MICROSOFT PURVIEW
Data Governance at Scale
Enterprise Architecture, Implementation Patterns & Operational Excellence
| Data Platform Practice | Microsoft Ecosystem Veratas | 2025 Edition |
Classification: Public | Version 1.0 | 2025
Executive Summary
Data is the most valuable asset enterprises possess — yet for most organizations, it remains ungoverned, undiscovered, and untrustworthy. Regulatory scrutiny has never been higher. GDPR fines exceeded €4.2 billion in 2023. The average cost of a data breach reached $4.45 million in 2024 (IBM). And yet, surveys consistently find that fewer than 30% of enterprise data assets are formally catalogued or classified.
Microsoft Purview addresses this challenge head-on. Released as a unified governance platform in 2022 (consolidating Azure Purview and Microsoft 365 Compliance Center), Purview provides organizations with a single control plane for data discovery, classification, cataloguing, lineage tracking, policy enforcement, information protection, and regulatory compliance — spanning on-premises, multi-cloud, and SaaS environments.
This white paper is written for data architects, Chief Data Officers, compliance leaders, and senior engineers who need to move beyond theoretical governance frameworks toward a production-grade, measurable, and operational governance program. It draws on real-world deployments across financial services, healthcare, retail, and public sector organizations.
Key Findings
| Dimension | Without Purview | With Purview (Measured Outcomes) |
| Data Discovery | Manual inventory; 40-60% of assets undocumented | 95%+ automated classification across 200+ source types |
| Time to Compliance Audit | 6-12 weeks manual preparation | 2-3 days with automated evidence collection |
| Data Breach Detection | Mean time 197 days (IBM 2024) | Near-real-time DLP alerts, sensitivity label enforcement |
| Data Consumer Productivity | Avg 4.2 hours/week searching for trusted data | 68% reduction in data search time (customer benchmark) |
| Regulatory Fine Risk | Unquantified exposure | Documented control framework mapped to GDPR, HIPAA, ISO 27001 |
| Governance TCO (3-year) | Distributed tools: $2.1M-$4.8M | Purview consolidation: $800K-$1.6M (45-65% reduction) |
| Key Insight: Microsoft Purview is not merely a data catalog — it is an integrated governance, risk, and compliance (GRC) platform that creates a closed-loop governance operating model. Organizations that treat it as only a catalog leave 60% of its capability untapped. |
CHAPTER 1
The Data Governance Imperative
1.1 Why Governance Fails Without Architecture
Most governance programs fail not because of lack of intent, but because of architectural sprawl. Governance teams operate spreadsheets, data stewards work in isolation, and lineage is tracked manually if at all. The root causes are structural:
- Federated data ownership without centralized metadata: Business units own data but metadata lives nowhere. No single system answers “where does this sales figure come from?”
- Tool fragmentation: Organizations accumulate Collibra, Informatica, Alation, Apache Atlas, and custom wiki pages — each partial, none authoritative.
- Classification as a one-time project: Data classification efforts produce point-in-time inventories that decay immediately as new data arrives and pipelines evolve.
- Compliance as reactive audit response: Evidence collection happens at audit time, not continuously — creating gaps, inconsistencies, and audit fatigue.
- Data access governed by infrastructure, not policy: Access control lives in storage ACLs, database permissions, and firewall rules — not in business-level policies aligned to data sensitivity.
1.2 The Regulatory Landscape
The regulatory environment demands systematic, demonstrable data governance. The following table maps key regulations to specific Purview capabilities:
| Regulation | Key Requirement | Purview Capability | Implementation Evidence |
| GDPR (EU) | Data subject rights (Art. 15-22), consent tracking, cross-border transfer controls | Data Map classification, Subject Rights Requests module, sensitivity labels | Automated PII detection; SRR workflow audit trail |
| HIPAA (US) | PHI identification, access controls, audit logging, minimum necessary standard | Custom classification rules for PHI patterns, policy enforcement, access reviews | Classification report; access policy audit log export |
| CCPA (California) | Consumer data inventory, opt-out rights, data sale disclosure | Data estate inventory export, lineage documentation | Asset inventory report; consent metadata tagging |
| PCI-DSS v4.0 | Cardholder data environment scoping, encryption, access logging | Sensitive info type: Credit Card, encryption insights, DLP policies | Data map export; DLP incident reports |
| ISO 27001:2022 | Information classification, asset inventory, supplier management | Full data catalog, sensitivity labels, third-party scanner integration | Control mapping export from Compliance Manager |
| SOC 2 Type II | Continuous monitoring, change management, logical access | Policy enforcement, continuous compliance score, activity logs | Evidence packages from Compliance Manager |
1.3 The Microsoft Purview Value Proposition
Unlike point solutions, Microsoft Purview delivers governance across the full data estate — from structured SQL databases to unstructured SharePoint documents, from on-premises SQL Server to Google BigQuery. Its value derives from three architectural properties:
Unified metadata fabric: A single metadata layer connects every data source, creating one authoritative catalog regardless of where data lives.
Automated classification at scale: Machine learning-based scanning continuously discovers and classifies data without manual tagging — critical when estates contain billions of assets.
Policy-to-enforcement bridge: Governance policies defined in Purview propagate to enforcement points in Azure Storage, SQL, Fabric, Power BI, and Microsoft 365 — eliminating the gap between policy documents and actual access control.
| War Story: A global bank with 47 data sources and 12 governance tools spent 11 weeks preparing for a GDPR audit. After deploying Purview with automated scanning across all sources, the same audit was prepared in 4 days — with higher confidence and full lineage documentation. Annual compliance preparation cost reduced from $1.8M to $340K. |
CHAPTER 2
Microsoft Purview: Platform Architecture
2.1 Architectural Overview
Microsoft Purview is delivered as a cloud-native SaaS service with no infrastructure to manage. Its architecture comprises four primary planes:
| Plane | Components | Primary Function |
| Data Map Plane | Automated scanning, classification engine, lineage collector, Atlas-compatible metadata store | Discovery, classification, and relationship mapping across all data sources |
| Data Catalog Plane | Search index, glossary engine, data products, collections framework, stewardship workflows | Self-service discovery, business context, and data access for consumers |
| Governance Insights Plane | Estate health dashboards, sensitivity coverage reports, stewardship metrics, Data Estate Insights | Measurement, reporting, and continuous improvement of governance program |
| Compliance & Protection Plane | Information Protection (sensitivity labels), DLP engine, Compliance Manager, eDiscovery, Audit | Regulatory compliance, data protection, legal hold, and investigation |
2.2 The Metadata Architecture
At the core of Purview is Apache Atlas-compatible metadata store, extended significantly by Microsoft. Every entity in Purview — whether a SQL table, Power BI report, ADF pipeline, or SharePoint document — is represented as a metadata entity with:
- Entity type definition: Structural schema (e.g., azure_sql_table has columns, a database parent, and connection properties)
- Attribute payload: Business metadata, technical metadata, system metadata, and user-defined custom attributes
- Relationships: Parent-child containment (server → database → schema → table → column), lineage edges (source → transformation → target), and semantic links (term → asset)
- Labels and classifications: Applied automatically by the classification engine or manually by stewards
- Contacts: Owners and experts assigned to assets for stewardship accountability
Entity Relationship Model
| Entity Tier | Example Entities | Relationship Type | Governance Action |
| Collection | Root Collection, Business Unit Collections | Hierarchical container | Access control scope, policy boundary |
| Data Source | Azure SQL Server, ADLS Gen2 Account, Fabric Workspace | Registered source in Data Map | Scanning target, credential binding |
| Dataset | SQL Table, Parquet file, Power BI dataset, Synapse Dedicated Pool | Owned by data source | Classification, lineage, ownership, sensitivity |
| Column/Field | SQL column, file schema field, report column | Child of dataset | Column-level classification, PII detection |
| Process | ADF pipeline, Spark job, Power Query transformation | Lineage edge between datasets | Data flow tracking, impact analysis |
| Glossary Term | Customer ID, Revenue, Churn Rate | Semantic link to datasets/columns | Business context, certified definitions |
2.3 Multi-Tenant & Collections Architecture
Purview’s Collections framework provides hierarchical organizational structure that governs both metadata access and policy scope. Collections are critical architectural decisions that cannot be changed without significant rework — they must be designed upfront.
Collection Design Patterns
Pattern 1 — Organizational Hierarchy: Root → Business Unit → Domain → Subdomain. Best for large enterprises with clear organizational governance. Enables business unit autonomy with enterprise-level roll-up reporting.
Pattern 2 — Data Zone Architecture: Root → Raw Zone → Curated Zone → Certified Zone → Restricted Zone. Best for data platform teams managing lakehouse tiers. Aligns collections to data quality and trust levels.
Pattern 3 — Regulatory Scope: Root → PCI Scope → PHI Scope → PII Scope → Internal. Best for compliance-driven organizations. Enables precise policy enforcement by regulatory classification.
Pattern 4 — Hybrid (Recommended for Enterprise): Combines organizational hierarchy at top level with regulatory scope sub-collections and data zone sub-collections. Most flexible but requires careful RBAC planning.
| Design Principle: Collections cannot be deleted if they contain assets. Design your collection hierarchy before onboarding any data sources. A flat hierarchy is always easier to expand than a complex hierarchy is to flatten. Start with no more than 3 levels for organizations under 10,000 employees. |
2.4 Identity & RBAC Architecture
Purview uses Azure Active Directory (Entra ID) for authentication and implements its own RBAC model on top. The key roles and their operational implications:
| Purview Role | Scope | Permissions | Recommended Assignment |
| Collection Admin | Per-collection | Manage sub-collections, assign roles within scope | Business unit data governance leads |
| Data Source Admin | Per-collection | Register and manage data sources, create scan rule sets | Data engineering team leads |
| Data Curator | Per-collection | Edit metadata, apply glossary terms, manage classifications | Data stewards, domain data owners |
| Data Reader | Per-collection | Read-only access to catalog, lineage, and classifications | Data consumers, analysts, report developers |
| Insights Reader | Account-level | Access Data Estate Insights dashboards | CDO, governance program manager |
| Policy Author | Account-level | Create and publish data access policies | Security architects, data governance lead |
CHAPTER 3
Data Map: Discovery & Classification at Scale
3.1 Scanning Architecture
The Data Map’s scanning engine is the foundation of Purview governance. Scans extract metadata, apply classification rules, collect lineage, and populate the catalog — continuously and at scale. Understanding scan architecture is critical to building a reliable governance program.
Scan Execution Models
Managed Virtual Network (MVNet) — Recommended: Purview manages the integration runtime within a Microsoft-managed VNet. No infrastructure to deploy. Supports private endpoint connectivity to sources. Best for most Azure-native deployments.
Self-Hosted Integration Runtime (SHIR): Customer-deployed VM running the Purview runtime agent. Required for: on-premises sources, private network sources without private endpoint support, and non-Azure cloud sources with strict network controls.
Azure Integration Runtime (AIR): Used for public-endpoint Azure sources. Simplest deployment model. Not recommended for sensitive environments where data should not traverse public internet, even transiently.
Self-Hosted Integration Runtime: Sizing Guide
| Data Volume (Assets) | CPU Recommendation | RAM Recommendation | Network Bandwidth | Node Count |
| < 100K assets | 4 vCPU | 8 GB | 100 Mbps | 1 (no HA) |
| 100K – 1M assets | 8 vCPU | 16 GB | 1 Gbps | 2 (active-active HA) |
| 1M – 10M assets | 16 vCPU | 32 GB | 10 Gbps | 4 (2 + 2 failover) |
| > 10M assets | 32 vCPU | 64 GB | 10 Gbps dedicated | 8+ (scale-out cluster) |
3.2 Classification Engine Deep Dive
Classification in Purview is applied through a layered engine that combines pattern matching, machine learning, context analysis, and custom rules. Understanding the engine’s precedence model prevents misclassification at scale.
Classification Execution Order
- System classification rules (200+ built-in): Microsoft-maintained patterns for global PII types, financial data, health information, credentials, and national identifiers. Examples: Credit Card Number, SSN, IBAN, NHS Number, Email Address.
- Custom classification rules: Organization-specific patterns defined via regex, dictionary matching, or column name pattern. Applied after system rules. Override behavior configurable per rule.
- ML-based classification: Trained models for complex patterns that resist simple regex (e.g., detecting salary ranges, internally-coded identifiers). Runs after rule-based classification.
- Propagated classifications: Column-level classifications propagate upward to table level, and table-level classifications contribute to dataset-level sensitivity scoring.
Custom Classification Rule Design
Custom rules are the most common source of governance debt. The following patterns prevent the most frequent issues:
| — Example: Custom classification for internal Employee ID format (EMP-XXXXXXXX) |
| — Rule Type: Regular Expression |
| — Pattern: \bEMP-[0-9]{8}\b |
| — Minimum Match Threshold: 60% of sampled values must match |
| — Column Name Pattern: employee_id, emp_id, staff_id (case-insensitive) |
| — Data sampling behavior: |
| — Purview samples 128 rows per column by default |
| — Increase to 1000 rows for low-density sensitive data (e.g., executive records) |
| — Use “Full scan” for regulatory-critical sources (performance tradeoff: 3-8x slower) |
| — Anti-pattern: Setting threshold to 1% causes false positive explosion |
| — Anti-pattern: Using [0-9]{8} without word boundary causes substring matches |
| — Best practice: Test regex against 1000 representative values before deployment |
3.3 Supported Data Sources: Complete Matrix
| Source Category | Supported Sources | Metadata Extracted | Lineage Support | Classification |
| Azure Data | ADLS Gen1/Gen2, Azure Blob, Azure SQL DB, Azure SQL MI, Synapse Analytics, Cosmos DB, Azure Database for PostgreSQL/MySQL | Schema, statistics, partitions, resource metadata | Yes (native connectors) | Full (all classification types) |
| Microsoft Fabric | Fabric Lakehouse, Fabric Warehouse, Fabric Dataflows, Power BI datasets/reports | Schemas, measures, relationships, sensitivity labels | Yes (deep integration, column-level) | Full, bidirectional label sync |
| Power BI | Workspaces, datasets, reports, dataflows, dashboards | Dataset schema, report lineage, refresh history, endorsements | Yes (full report-to-source lineage) | Label push/pull from M365 Information Protection |
| Azure Data Factory | Pipelines, datasets, linked services, data flows | Pipeline structure, activity configuration | Yes (automated lineage extraction) | Via dataset classification |
| On-Premises | SQL Server 2012+, Oracle 12c+, SAP HANA, Teradata, HDFS | Full schema extraction | SQL Server: Yes. Others: Limited | Full classification on all |
| Multi-Cloud | AWS S3, AWS RDS, Google BigQuery, Google Cloud Storage, Snowflake | Schema and storage metadata | Limited (no native lineage; requires custom lineage API) | Full classification |
| SaaS Applications | Salesforce, SAP ECC, Erwin, Looker | Configured via connectors | Limited | Classification on connected datasets |
| Office 365 | SharePoint Online, Exchange Online, Teams, OneDrive | Document metadata, content classification | N/A (unstructured) | Full M365 sensitivity label integration |
3.4 Scan Rule Sets: Enterprise Design Patterns
Scan Rule Sets define which file types and classification rules apply to each scan. Enterprise-grade Scan Rule Sets require careful design to balance coverage, performance, and noise reduction.
| // Recommended Scan Rule Set Strategy |
| // |
| // Rule Set 1: PCI Scope (used for financial data sources) |
| // – File types: Parquet, Delta, CSV, ORC, JSON |
| // – Classifications: Credit Card, CVV, Account Number, IBAN, SWIFT Code |
| // – Sampling: Full scan (not sampled) for audit compliance |
| // – Custom rules: Internal account number format regex |
| // Rule Set 2: PHI Scope (used for healthcare sources) |
| // – File types: HL7, CSV, Parquet, JSON, XML |
| // – Classifications: NHS Number, SSN, Date of Birth, Medical Record Number, |
| // Diagnosis Code (custom), Prescription Data (custom) |
| // – Sampling: Full scan |
| // Rule Set 3: Standard PII (all other sources) |
| // – File types: All common formats |
| // – Classifications: Email, Phone, Address, Name (ML-based) |
| // – Sampling: 128 rows (default) |
| // Rule Set 4: Technical Metadata Only (dev/test environments) |
| // – File types: All |
| // – Classifications: NONE (disable all classification rules) |
| // – Purpose: Capture lineage and schema without false positive noise in non-prod |
CHAPTER 4
Data Catalog: Enterprise Search & Lineage
4.1 Catalog Architecture
The Purview Data Catalog is a faceted search index built on Azure Cognitive Search, surfacing all metadata entities discovered by the Data Map. Its power comes not from indexing but from curation: the governance workflows, glossary integration, certification processes, and ownership models that turn raw metadata into trusted, consumable data assets.
4.2 Business Glossary Design
The Business Glossary is the semantic layer of Purview — the bridge between technical metadata and business meaning. A poorly designed glossary becomes a maintenance burden; a well-designed glossary becomes the authoritative vocabulary of the organization.
Glossary Governance Model
| Term Status | Meaning | Who Sets It | Catalog Behavior |
| Draft | Term being developed; not yet authoritative | Term authors (data stewards) | Discoverable but not recommended for use |
| Approved | Reviewed and endorsed by domain owner | Domain data owner | Shown as authoritative in search results |
| Deprecated | Term being replaced; avoid new usage | Governance team | Shown with deprecation warning; redirects to replacement term |
| Expired | Term no longer valid; historical reference only | Governance program manager | Hidden from default search; accessible via filter |
Glossary Hierarchy Best Practices
Flat glossaries fail at scale. A 2,000-term flat list is unsearchable. Structure terms in a parent-child hierarchy reflecting business domains:
- L1 — Domain: Customer, Product, Finance, Risk, Operations, HR
- L2 — Subdomain: Customer → Prospect, Customer → Active, Customer → Churned
- L3 — Concept: Customer → Active → Customer Lifetime Value, Customer → Active → Net Promoter Score
- L4 — Attribute: Customer → Active → CLV → Predicted CLV (12-month), Actual CLV (trailing 12-month)
Terms should carry: formal definition, example values, calculation formula (where applicable), related terms, steward contact, and linked certified data assets. Incomplete terms erode trust faster than no glossary at all.
4.3 Lineage Architecture & Extraction
Data lineage is Purview’s most strategically valuable capability — and the most complex to implement correctly. Lineage answers the questions that matter most: “Where does this metric come from?”, “What would break if we changed this table?”, and “How was this data transformed before I received it?”
Lineage Extraction Mechanisms
Automated lineage (preferred): Purview automatically extracts lineage from Azure Data Factory (pipeline activities), Synapse Spark (PySpark/Scala), Azure Synapse Pipelines, Microsoft Fabric dataflows, and Power BI (report-to-dataset-to-source chain). Zero code required — enabled at scan time.
SQL-based lineage parsing: Purview parses stored procedures, views, and CREATE TABLE AS SELECT statements to extract column-level lineage from SQL-based transformations. Supports: Azure SQL Database, Synapse Dedicated Pool, SQL Server (via SHIR).
Custom lineage via Atlas API: For sources without native connectors (custom Spark jobs, dbt, Informatica, Talend), lineage is submitted programmatically via the Apache Atlas REST API. This is the integration path for third-party ETL tools.
Lineage from Microsoft Fabric (recommended 2024+): Fabric’s native integration with Purview provides the most granular lineage available: column-level lineage through Lakehouse, Warehouse, Dataflows, and Power BI reports in a single unbroken chain.
Lineage API Integration Example
| # Submit custom lineage to Purview Atlas API (Python) |
| # Use case: dbt transformation lineage |
| import requests |
| from azure.identity import DefaultAzureCredential |
| credential = DefaultAzureCredential() |
| token = credential.get_token(“https://purview.azure.net/.default”).token |
| purview_endpoint = “https://{account_name}.purview.azure.com” |
| lineage_payload = { |
| “entities”: [ |
| { |
| “typeName”: “Process”, |
| “attributes”: { |
| “name”: “dbt_model_customer_lifetime_value”, |
| “qualifiedName”: “dbt://project/model/customer_lifetime_value”, |
| “inputs”: [ |
| {“guid”: “-1”, “typeName”: “azure_sql_table”, |
| “uniqueAttributes”: {“qualifiedName”: “mssql://server/db/sales/orders”}}, |
| {“guid”: “-2”, “typeName”: “azure_sql_table”, |
| “uniqueAttributes”: {“qualifiedName”: “mssql://server/db/crm/customers”}} |
| ], |
| “outputs”: [ |
| {“guid”: “-3”, “typeName”: “azure_sql_table”, |
| “uniqueAttributes”: {“qualifiedName”: “mssql://server/dwh/gold/customer_ltv”}} |
| ] |
| } |
| } |
| ] |
| } |
| response = requests.post( |
| f”{purview_endpoint}/catalog/api/atlas/v2/entity/bulk”, |
| headers={“Authorization”: f”Bearer {token}”, “Content-Type”: “application/json”}, |
| json=lineage_payload |
| ) |
| # Response: {“mutatedEntities”: {“CREATE”: [{“guid”: “…”}]}} |
4.4 Search Optimization & Data Product Design
As catalogs scale past 100,000 assets, findability degrades without intentional design. Enterprise search optimization requires four capabilities:
Endorsed and certified assets: Purview endorsement levels (Promoted, Certified) surface trusted assets first. Certification should require: data quality SLA documentation, ownership assignment, glossary linkage, and lineage completeness. Certified assets appear with a visual badge in search results and are prioritized in the search ranking.
Custom attributes for domain context: Extend asset metadata with business-critical attributes: Data Domain, Business Owner, SLA Tier (Gold/Silver/Bronze), Refresh Frequency, Data Quality Score, GDPR Lawful Basis. These become facets in the search experience.
Data Products (Fabric integration): In organizations using Microsoft Fabric, Data Products group related datasets, reports, and pipelines into a consumable unit with a single access request, SLA, and ownership model. This is the recommended abstraction for data mesh architectures on the Microsoft stack.
Term-to-asset linking at ingestion: Automate glossary term linking during scan using column name pattern matching. A column named “customer_id” should automatically receive the “Customer ID” glossary term link as a suggestion — reducing manual stewardship burden by 40-70% in mature estates.
CHAPTER 5
Data Estate Insights & Health Metrics
5.1 Insights Architecture
Data Estate Insights is Purview’s built-in analytics layer — pre-built dashboards that measure governance program health across five dimensions. Unlike custom reporting, Insights uses pre-aggregated metrics from the Data Map, making it near-real-time with no additional cost.
| Insight Category | Key Metrics | Governance Question Answered |
| Asset Insights | Total assets by source type, collection, asset type; new assets per week; scan success rate | “How many data assets do we govern? Are we growing coverage?” |
| Classification Insights | % classified, top classifications by frequency, unclassified assets by source | “What percentage of our data estate has been classified? Where are our gaps?” |
| Sensitivity Insights | Sensitivity label distribution, label coverage %, cross-source sensitive asset count | “How much sensitive data do we have and where does it live?” |
| Glossary Insights | Term coverage (% of assets with linked terms), incomplete terms, term usage by domain | “Is our business glossary actually being used to annotate data?” |
| Stewardship Insights | Assets without owners, assets without expert contacts, overdue reviews | “Do we have accountability for our data assets?” |
5.2 Governance Maturity Scoring
The most operationally valuable use of Insights is building a Governance Maturity Score — a composite metric that tracks program progress over time. This is the single number a CDO should report to the board.
| — Governance Maturity Score Calculation (KQL, run in Log Analytics) |
| — Score range: 0-100. Target: >75 for “Managed” maturity level. |
| let totalAssets = PurviewInsights_AssetCounts_CL | summarize Total = sum(AssetCount_d); |
| let classifiedAssets = PurviewInsights_ClassificationCoverage_CL |
| | summarize Classified = sum(ClassifiedCount_d); |
| let labelledAssets = PurviewInsights_SensitivityCoverage_CL |
| | summarize Labelled = sum(LabelledCount_d); |
| let ownedAssets = PurviewInsights_StewardshipCoverage_CL |
| | summarize Owned = sum(OwnedCount_d); |
| let glossaryLinked = PurviewInsights_GlossaryCoverage_CL |
| | summarize Linked = sum(TermLinkedCount_d); |
| // Weighted composite score: |
| // Classification coverage: 30% |
| // Sensitivity labelling: 25% |
| // Asset ownership: 25% |
| // Glossary linkage: 20% |
| let score = (classifiedAssets/totalAssets * 30) |
| + (labelledAssets/totalAssets * 25) |
| + (ownedAssets/totalAssets * 25) |
| + (glossaryLinked/totalAssets * 20); |
| score | project GovernanceMaturityScore = score, CalculatedOn = now() |
5.3 Integrating Purview Insights with Power BI
For executive reporting, Purview Insights data should be surfaced in Power BI via the Purview REST API. The recommended approach: nightly pipeline (ADF or Fabric Data Factory) calls the Purview Insights REST API endpoints, writes results to a Fabric Lakehouse, and a certified Power BI report visualizes governance KPIs with week-over-week trending.
Key metrics to visualize: Governance Maturity Score trend, Classification coverage by data domain, Sensitive asset growth rate, Stewardship SLA compliance (% of assets reviewed within policy window), and Open stewardship actions by owner.
CHAPTER 6
Data Policy & Access Governance
6.1 Purview Data Policy Architecture
Purview’s Data Policy capability represents a fundamental shift in how data access is governed: from infrastructure-level ACLs managed by engineers to business-level policies managed by data owners and governance teams. This capability is often the least understood and most transformative feature in the platform.
A Purview data access policy states: “Users in group X can perform action Y on data assets matching classification Z.” This policy is then automatically enforced at the storage layer — without requiring changes to Azure RBAC, storage ACLs, or database permissions.
6.2 Policy Types
Data owner policies: Grant read or read/modify access to specific data assets in Azure Storage, ADLS Gen2, Azure SQL, and Fabric. The data owner (assigned in Purview catalog) can self-service grant access without involving the infrastructure team — a governance-controlled self-service model.
DevOps policies: Grant SQL performance monitoring access (VIEW DATABASE STATE, VIEW SERVER STATE) to DevOps engineers without granting data read permissions. Solves a critical principle of least privilege gap in most organizations.
Self-service data access policies: Enable data consumers to request access through the Purview catalog directly. The request triggers an approval workflow routed to the data owner. Approved access is provisioned automatically. Rejected with reason. Full audit trail maintained.
Attribute-based access control (ABAC) policies: Grant access based on asset classifications rather than specific named assets. Example: “Allow data scientists in the ML team to read assets classified as Non-Sensitive.” As new assets are classified and onboarded, they automatically inherit the correct access policy without manual intervention.
6.3 Policy Enforcement Architecture
| Data Source | Policy Enforcement Point | Latency to Enforce | Granularity |
| ADLS Gen2 | Azure Storage RBAC (via Purview policy propagation) | < 5 minutes | Container, folder, file level |
| Azure Blob Storage | Azure Storage RBAC | < 5 minutes | Container level |
| Azure SQL Database | SQL permissions (system-managed) | < 2 minutes | Database, schema, table level |
| Azure SQL Managed Instance | SQL permissions (system-managed) | < 2 minutes | Database, schema, table level |
| Microsoft Fabric | Fabric workspace and item permissions | < 10 minutes | Workspace, lakehouse, table level |
| Azure Synapse Analytics | Synapse workspace RBAC | < 5 minutes | Workspace, pool level |
| Architectural Constraint: Purview data policies do NOT replace row-level security (RLS), column masking, or Dynamic Data Masking (DDM) in SQL databases. Purview policies govern access to data assets — who can connect and query. RLS/DDM governs what data they see within an allowed connection. Both layers are required for complete access governance. |
6.4 Access Workflow Implementation
| # Purview Self-Service Access Request — Backend Configuration |
| # This Logic App workflow handles access request notifications and provisioning |
| # Step 1: Configure Purview to emit access request events to Event Grid |
| az purview account update \ |
| –name “contoso-purview” \ |
| –resource-group “governance-rg” \ |
| –self-hosted-integration-runtime-auth-key “” |
| # Step 2: Event Grid subscription — route access requests to Logic App |
| az eventgrid event-subscription create \ |
| –name “purview-access-requests” \ |
| –source-resource-id “/subscriptions/…/purview/accounts/contoso-purview” \ |
| –endpoint “https://prod-logic-app.azurewebsites.net:443/api/access-request/triggers/manual/invoke” \ |
| –included-event-types “Microsoft.Purview.DataAccessRequestCreated” |
| # Step 3: Logic App action — Teams notification to data owner |
| # { |
| # “type”: “message”, |
| # “to”: “@{triggerBody()?[‘dataOwnerEmail’]}”, |
| # “body”: “Access request from @{triggerBody()?[‘requestorEmail’]} for @{triggerBody()?[‘assetName’]}.”, |
| # “attachments”: [{“approve_url”: “@{triggerBody()?[‘approvalUrl’]}”}] |
| # } |
| # Step 4: On approval — Purview API call to grant policy |
| # POST /policyStore/dataPlane/policies/{policyId}/approve |
CHAPTER 7
Information Protection & DLP Integration
7.1 Sensitivity Labels: Architecture & Design
Microsoft Purview Information Protection (formerly Azure Information Protection + Microsoft 365 MIP) provides the sensitivity label framework that connects data classification in Purview to protection enforcement across the Microsoft ecosystem. Sensitivity labels are the governance primitive that spans cloud storage, databases, Office documents, Teams messages, and third-party applications.
Sensitivity Label Taxonomy Design
The label taxonomy must balance business usability with enforcement precision. The most common failure mode: too many labels (>10) that data users cannot distinguish between, leading to incorrect labelling or label fatigue.
| Label | Definition | Protection Actions | Auto-labelling Trigger |
| Public | Approved for external publication. No restrictions. | None | No sensitive classifications detected |
| Internal | Business information for employee use. Not for public sharing. | Watermark on documents | Default label applied to all unlabelled items |
| Confidential | Sensitive business data. External sharing requires approval. | Encryption, external sharing DLP block, watermark | PII classification, financial data patterns |
| Highly Confidential | Regulated data, trade secrets, executive communications. | Encryption, download restrictions, MFA for access, audit logging | PHI, PCI data, credentials, classified IP |
| Restricted | Legal hold, regulatory investigation, M&A sensitive. | Encryption, access list restricted to named individuals, no forwarding | Legal trigger or manual assignment only |
7.2 Automatic Labelling: Configuration & Tuning
Auto-labelling policies scan Exchange, SharePoint, OneDrive, Teams, and Azure-native sources to apply labels without user intervention. Configuration requires careful threshold tuning to avoid both under-labelling (missing sensitive content) and over-labelling (creating label fatigue and compliance noise).
| # PowerShell: Configure Auto-labelling Policy for Confidential label |
| # Applies “Confidential” label when content contains 3+ PII pattern matches |
| Connect-IPPSSession -UserPrincipalName admin@contoso.com |
| # Create the auto-labelling policy |
| New-AutoSensitivityLabelPolicy ` |
| -Name “Confidential-AutoLabel-Policy” ` |
| -Labels “Confidential” ` |
| -ApplySensitivityLabel “Confidential” ` |
| -AutoLabelingEnabled $true ` |
| -Workload “Exchange, SharePoint, OneDriveForBusiness” |
| # Create the labelling rule (trigger conditions) |
| New-AutoSensitivityLabelRule ` |
| -Name “PII-Detection-Rule” ` |
| -Policy “Confidential-AutoLabel-Policy” ` |
| -ContentContainsSensitiveInformation @( |
| @{Name=”Credit Card Number”; minCount=”1″}, |
| @{Name=”EU National Identification Number”; minCount=”1″}, |
| @{Name=”Email Address”; minCount=”5″} |
| ) ` |
| -Operator “Or” |
| # Run in simulation mode first (30 days) — never deploy directly to production |
| # Review simulation report: Security & Compliance Center → Information Protection |
| # → Auto-labelling → Policy Name → Simulation Results |
7.3 Data Loss Prevention Policies
DLP policies in Purview define actions taken when sensitive data violates sharing rules — blocked email, restricted external sharing, user notification, or incident report generation. DLP is the enforcement layer; sensitivity labels are the classification layer. Both are required.
| DLP Scope | Trigger Condition | Action | Business Justification Override |
| Exchange Email | Highly Confidential label; external recipient | Block delivery; notify sender; generate incident | Yes — manager approval workflow |
| SharePoint/OneDrive | Confidential label; public sharing link created | Block link creation; notify user; generate incident | Yes — data owner approval |
| Teams Messages | Credit card number, SSN pattern in message | Block send; notify user; policy tip displayed | No — hard block (financial regulatory) |
| Azure Storage (Defender for Storage) | Highly Confidential label; unusual download volume | Alert to SOC; optional: revoke access token | No — security team investigation required |
| Power BI | Confidential label; export to Excel/CSV | Audit log entry; notify workspace admin | Yes — self-service analytics exception |
CHAPTER 8
Compliance Manager & Regulatory Frameworks
8.1 Compliance Manager Architecture
Microsoft Purview Compliance Manager provides a risk-based compliance assessment framework that maps Microsoft 365 and Azure control implementations to regulatory requirements. It automates evidence collection for controls that Microsoft manages (Microsoft-managed actions) and guides remediation for customer-managed controls.
Compliance Manager’s compliance score is not a certification — it is a risk signal. A score of 85% does not mean an organization is 85% compliant with GDPR; it means 85% of the control objectives mapped in the assessment are in a satisfactory state. External audit remains necessary for formal certification.
8.2 Regulatory Assessment Configuration
| Regulation | Assessment Type | Control Count (Approx) | Automated Evidence % | Key Gap Areas Typically Found |
| GDPR | Pre-built (Microsoft) | 165 controls | ~45% | Subject Rights Request process, DPIA documentation, third-party processor contracts |
| HIPAA/HITECH | Pre-built | 80 controls | ~55% | PHI access audit logs, workforce training records, BAA documentation |
| ISO 27001:2022 | Pre-built | 93 controls | ~40% | Asset management procedures, incident response testing, supplier risk assessments |
| PCI-DSS v4.0 | Pre-built | 264 controls | ~35% | Network segmentation evidence, penetration test reports, PA-DSS documentation |
| SOC 2 Type II | Pre-built | 64 controls | ~50% | Vendor management evidence, background check procedures, BC/DR testing records |
| Custom Regulatory | Custom assessment builder | User-defined | Varies | Depends on regulation; use for DORA, NIS2, local data residency laws |
8.3 Continuous Compliance Operating Model
The highest-maturity organizations do not prepare for compliance audits — they operate in a state of continuous compliance. This requires three capabilities:
Automated control testing: Use Microsoft Defender for Cloud’s regulatory compliance dashboard linked to Compliance Manager. Defender continuously tests security controls (encryption at rest, network security groups, MFA enforcement) and feeds results directly into Compliance Manager control scores — daily, not quarterly.
Evidence automation: Build evidence packages programmatically. The Compliance Manager API allows: listing all controls in an assessment, retrieving current control status, uploading evidence documents from SharePoint, and triggering evidence refresh. Automate this in ADF or Logic Apps on a monthly schedule.
Audit-ready documentation: Compliance Manager generates exportable evidence packages (Excel workbooks with control evidence, test results, and remediation notes). Configure these to be generated automatically 30 days before known audit windows and delivered to the compliance team SharePoint site via Power Automate.
| # Compliance Manager API: Export assessment evidence package |
| # Use in Power Automate or ADF for automated audit package generation |
| $token = (Get-AzAccessToken -ResourceUrl “https://compliance.microsoft.com”).Token |
| # List all assessments |
| $assessments = Invoke-RestMethod ` |
| -Uri “https://compliance.microsoft.com/api/ComplianceManager/assessments” ` |
| -Headers @{Authorization = “Bearer $token”} ` |
| -Method GET |
| # Get controls for GDPR assessment |
| $gdprId = ($assessments.value | Where-Object {$_.name -like “*GDPR*”}).id |
| $controls = Invoke-RestMethod ` |
| -Uri “https://compliance.microsoft.com/api/ComplianceManager/assessments/$gdprId/controls” ` |
| -Headers @{Authorization = “Bearer $token”} ` |
| -Method GET |
| # Upload evidence document to a control |
| $body = @{ |
| controlId = “GDPR-Art32-EncryptionAtRest”; |
| evidenceFile = [Convert]::ToBase64String([IO.File]::ReadAllBytes(“encryption_policy.pdf”)); |
| evidenceType = “PolicyDocument” |
| } | ConvertTo-Json |
| Invoke-RestMethod ` |
| -Uri “https://compliance.microsoft.com/api/ComplianceManager/assessments/$gdprId/evidence” ` |
| -Headers @{Authorization = “Bearer $token”; “Content-Type” = “application/json”} ` |
| -Method POST -Body $body |
CHAPTER 9
Integration Patterns: Fabric, Synapse & ADF
9.1 Microsoft Fabric Integration (Deep Dive)
Microsoft Fabric’s native integration with Purview is the most complete governance integration in the Microsoft ecosystem — and in the enterprise data market broadly. When fully configured, every Fabric artifact (Lakehouse table, Warehouse table, Dataflow, Pipeline, Power BI dataset, report) is automatically discoverable in Purview with column-level lineage, sensitivity labels, and endorsement status synchronized bidirectionally.
Enabling Fabric-Purview Integration
| # Step 1: Connect Fabric tenant to Purview account |
| # Fabric Admin Portal → Governance → Microsoft Purview hub |
| # Enter Purview account name and subscription |
| # Step 2: Configure Purview scanning of Fabric workspaces |
| # In Purview Studio: Data Map → Sources → Register |
| # Source type: Microsoft Fabric |
| # Select workspace(s) to scan |
| # Step 3: Enable sensitivity label sync (bidirectional) |
| # In Fabric Admin Portal: Tenant Settings → Information Protection |
| # Toggle ON: “Apply sensitivity labels from data sources to their data in Power BI” |
| # Toggle ON: “Allow workspace admins to override automatically applied sensitivity labels” |
| # Step 4: Verify lineage (check in Purview catalog after first scan completes) |
| # Expected lineage chain: |
| # Source System → ADF/Dataflow ingestion → Bronze Lakehouse → |
| # Silver Lakehouse → Gold Lakehouse → Warehouse → Power BI Dataset → Report |
| # Troubleshooting: If lineage gaps appear between Lakehouse and Warehouse, |
| # ensure Fabric Warehouse is using shortcuts to Lakehouse (not COPY INTO) |
| # COPY INTO breaks automated lineage; use Lakehouse shortcuts or Dataflows instead |
9.2 Azure Data Factory Lineage Integration
ADF lineage extraction in Purview is one of the most powerful automated capabilities — transforming opaque ETL pipelines into transparent, traceable data flows. When configured correctly, every ADF pipeline run generates lineage edges in the Purview Data Map automatically.
| ADF Activity Type | Lineage Extracted | Granularity | Configuration Required |
| Copy Activity | Source dataset → Copy activity → Sink dataset | Dataset level | Auto (no config required once Purview connected) |
| Data Flow Activity | Source → Transformation stages → Sink | Column level (via data flow schema) | Enable “Lineage reporting” in Data Flow settings |
| Execute Pipeline Activity | Parent pipeline → Child pipeline linkage | Pipeline level | Auto |
| Stored Procedure Activity | Database → SP execution → Database | Dataset level (column via SQL parsing) | SQL Server: auto. Others: manual Atlas API |
| Lookup Activity | Source dataset read (no lineage to target) | Dataset level (read only) | Auto |
| Azure Function Activity | No native lineage | None | Custom Atlas API call from within the Azure Function |
9.3 Azure Synapse Analytics Integration
Synapse Analytics integration with Purview provides lineage and classification for dedicated SQL pools, serverless SQL, and Spark pools. Key integration points:
Dedicated SQL Pool lineage: Purview scans DDL and DML to extract lineage from stored procedures, views, and CTAS statements. For complex multi-step transformations in stored procedures, Purview parses the SQL AST (abstract syntax tree) to identify source-to-target column mappings.
Spark pool lineage: PySpark and Scala notebooks emit lineage events via the OpenLineage connector. Install the purview-spark-plugin to your Synapse Spark pool configuration to enable automatic lineage emission from all Spark jobs.
| # Install Purview Spark Connector in Synapse Spark Pool |
| # Synapse Studio → Manage → Apache Spark pools → [Your Pool] → Packages |
| # requirements.txt: |
| # azure-purview-lineage-spark==1.0.0 |
| # Spark configuration (add to Spark pool properties): |
| # spark.extraListeners: com.microsoft.azure.purview.spark.PurviewSparkListener |
| # spark.purview.account.name: contoso-purview |
| # spark.purview.tenant.id: {tenant-id} |
| # After configuration, all Spark DataFrame read/write operations emit lineage: |
| # df = spark.read.parquet(“abfss://bronze@storage.dfs.core.windows.net/sales/”) |
| # df_transformed = df.groupBy(“region”).agg(sum(“revenue”).alias(“total_revenue”)) |
| # df_transformed.write.format(“delta”).save(“abfss://silver@storage…/sales_by_region/”) |
| # This generates: bronze/sales → Spark job → silver/sales_by_region (column-level) |
CHAPTER 10
Deployment Architecture & Operating Model
10.1 Deployment Architecture Patterns
Pattern 1: Centralized Governance (Single Purview Account)
Best for: Organizations with <50,000 data assets, single-geography operation, or strong central governance team. One Purview account, one data map, one catalog. Simple operations, lower cost, but single point of governance control.
Pattern 2: Federated Governance (Hub-and-Spoke)
Best for: Large multi-geography organizations with autonomous business units. A central “governance hub” Purview account maintained by the enterprise CDO office. Each business unit has its own “spoke” Purview account for local governance. Metadata is synchronized to the hub via the Purview metadata API on a scheduled basis. Hub provides enterprise-wide search and reporting; spokes provide local stewardship autonomy.
Pattern 3: Domain-Aligned (Data Mesh Architecture)
Best for: Organizations implementing a Data Mesh operating model. Each data domain (Customer, Product, Finance, Risk) owns and operates its own Purview account and is responsible for its governance outcomes. The enterprise governance team sets standards and taxonomy but does not operate the domain catalogs. Federated computational governance via Purview’s shared glossary and common classification taxonomy ensures interoperability.
10.2 Network Architecture
| Component | Network Option | Recommended For | Tradeoffs |
| Purview Studio (UI) | Public endpoint (default) or Private Endpoint | Private EP for regulated industries | Private EP requires VPN/ExpressRoute for analyst access |
| Scanning runtime | Managed VNet (preferred) or SHIR | Managed VNet for Azure sources; SHIR for on-prem | SHIR adds VM management overhead; MVNet is zero-ops |
| Purview API | Public with AAD auth or Private EP | Private EP for automation in private VNets | Private EP: no public API calls from outside VNet |
| Kafka (event streaming) | Kafka endpoint for event notification | Event-driven metadata workflows | Requires Event Hubs consumption for real-time events |
10.3 Governance Operating Model
Technology deployment without an operating model produces an expensive, unused catalog. The governance operating model defines who does what, when, and how — continuously.
| Role | Responsibilities | Time Commitment | Purview Role |
| Chief Data Officer (CDO) | Set governance strategy, approve glossary, report metrics to board | 2-4 hrs/week | Insights Reader, Collection Admin (Root) |
| Data Governance Lead | Operate Purview program, manage stewards, evolve policies | Full time | Collection Admin, Policy Author |
| Domain Data Owner | Own data quality for domain, approve certifications, approve access requests | 4-8 hrs/week | Data Curator (domain collection) |
| Data Steward | Enrich metadata, link glossary terms, resolve classification issues, review flagged assets | 50-100% FTE | Data Curator |
| Data Engineer | Register data sources, configure scans, build custom lineage integration | 20% allocation | Data Source Admin |
| Data Consumer | Search catalog, request access, rate asset quality, report metadata issues | Ad hoc | Data Reader |
| Operating Model Insight: A Purview deployment without assigned Data Stewards is a catalog that fills with metadata but never becomes trusted. The steward-to-asset ratio matters: for a 50,000 asset estate, plan for 2-3 full-time stewards in the first year. Automation (bulk classification, bulk term linking) can increase effective steward capacity to 1 FTE per 100,000 assets at maturity. |
CHAPTER 11
Performance, Scale & Cost Optimization
11.1 Scan Performance Optimization
| Optimization Lever | Description | Performance Impact | Tradeoff |
| Incremental scanning | Scan only new/modified assets since last scan using watermark-based detection | 60-80% reduction in scan time for stable sources | May miss classification changes on unmodified assets |
| Targeted scanning | Scope scans to specific folders, schemas, or file patterns | 40-70% faster | Requires good source naming conventions |
| Scan concurrency tuning | Increase concurrent threads on SHIR from default 4 to 8-16 | 2-4x throughput increase | Higher CPU/RAM on SHIR VMs required |
| Classification rule optimization | Reduce classification rules per scan rule set; use targeted rule sets per source type | 30-50% faster per scanned asset | Requires maintaining multiple scan rule sets |
| Off-hours scheduling | Schedule large scans for 2-6 AM to avoid competing with production workloads | No throughput gain but avoids source contention | Delayed freshness; not suitable for compliance triggers |
11.2 Capacity Planning for Large Estates
Purview’s Data Map has published scale limits that define the upper bounds of single-account deployments. Planning against these limits prevents mid-program architectural pivots:
| Resource | Scale Limit (as of 2024) | Recommendation |
| Assets in Data Map | 100 million assets per account | For >80M assets, begin planning federated architecture |
| Registered sources | 3,000 sources per account | Consolidate similar source types into single registered sources where possible |
| Concurrent scans | 100 concurrent scan runs | Use scan scheduling to avoid peak concurrency; prioritize by source criticality |
| Glossary terms | 100,000 terms per account | Maintain term hygiene; deprecate unused terms quarterly |
| Collections | 256 collections per account | Design flat-ish hierarchies; max 4-5 levels for most organizations |
| Custom classification rules | 500 per account | Consolidate similar patterns; use regex groups over multiple single patterns |
11.3 Cost Optimization
Purview pricing is based on Data Map capacity units (CUs), scan compute, and Microsoft 365 Compliance licensing. Understanding the cost model prevents surprise bills in large-scale deployments.
| Cost Component | Billing Model | Optimization Strategy |
| Data Map capacity units | $0.496/CU/hour (1 CU = 1GB metadata storage + processing capacity) | Audit scan frequency; incremental scans reduce CU consumption by 60-75% |
| Scan compute | Billed per vCore-hour for SHIR; Managed VNet included in CUs | Right-size SHIR VMs; schedule to minimize runtime; use MVNet where possible |
| Data insights compute | Included in CU pricing up to 1M assets/day | No separate cost; do not over-provision CUs speculatively |
| M365 Compliance (DLP, Labels) | Included in M365 E5 or E5 Compliance add-on | Audit license assignments; unused Compliance seats are common waste |
| Defender for Cloud integration | Charged per resource per hour for Defender plans | Enable Defender Storage/SQL only for in-scope regulated workloads |
CHAPTER 12
Real-World Case Studies
Case Study 1: Global Financial Services Firm — GDPR & PCI Compliance Transformation
| Dimension | Details |
| Organization | Pan-European retail bank, 12,000 employees, 85 data sources |
| Challenge | GDPR audit failed in 2022 due to inability to demonstrate PII data inventory. €2.3M fine issued. Compliance team spent 14 weeks per audit cycle manually documenting data assets. |
| Purview Scope | Data Map (85 sources), full estate classification, GDPR & PCI assessments in Compliance Manager, sensitivity labels across M365 and Azure Storage, DLP policies for credit card and IBAN patterns |
| Timeline | 16 weeks to full production deployment across all 85 sources |
Technical Implementation Highlights
- Deployed Managed VNet scanning for 67 Azure-native sources; SHIR cluster (4 nodes, 16 vCPU each) for 18 on-premises SQL Server and Oracle sources
- Custom classification rules developed for: IBAN variants (23 country formats), internal account number format, and proprietary customer reference codes — regex validated against 500K live records before deployment
- Column-level lineage from core banking Oracle DB → ADF pipelines → Azure SQL DWH → Power BI reports established within 6 weeks of scan completion
- GDPR Subject Rights Request module deployed: automated data discovery across all 85 sources on SRR submission, reducing SRR response time from 28 days to 4 days
| Metric | Before Purview | After Purview (12 months) |
| Compliance audit preparation time | 14 weeks manual | 6 days automated |
| PII coverage (classified assets) | 23% (manual inventory) | 94% (automated) |
| Sensitivity label coverage | 8% (M365 only) | 87% across Azure + M365 |
| GDPR SRR response time | 28 days | 4 days |
| Annual compliance staffing cost | £1.2M (12 FTE) | £380K (3 FTE + automation) |
| Purview TCO (Year 1) | — | £420K (all-in) |
Case Study 2: Healthcare Network — PHI Governance & HIPAA Continuous Compliance
| Dimension | Details |
| Organization | US regional hospital network, 22 hospitals, 6,500 clinical staff, 140TB of health data across Azure and on-premises |
| Challenge | Inability to demonstrate minimum-necessary access principle for PHI (HIPAA §164.514). Multiple breaches of PHI to non-clinical staff through misconfigured Power BI reports. No systematic tracking of PHI data flows. |
| Purview Scope | Healthcare-specific classification rules (PHI types), Data Map across Epic EHR integration layer + Azure SQL + ADLS Gen2, Purview policy for PHI access restriction, DLP to block PHI in Teams/email, Compliance Manager HIPAA assessment |
| Timeline | 20 weeks (regulatory urgency drove accelerated timeline) |
Key Technical Decisions
Custom PHI classification taxonomy: Standard Purview PHI rules (SSN, DOB) were insufficient. Built 34 custom classification rules for: Epic patient MRN format, ICD-10 diagnosis codes in free-text, medication names from formulary dictionary match, insurance member ID patterns, and clinical note markers. Classification accuracy validated at 96.3% against 10,000 manually labelled records.
Power BI PHI governance: Deployed sensitivity label policy preventing download/export of reports containing PHI unless user holds HIPAA-authorized role (Entra ID security group). Power BI lineage in Purview enabled identification of 147 reports containing PHI columns — 23 of which had no sensitivity label applied. All remediated within 4 weeks.
Continuous HIPAA monitoring: Compliance Manager HIPAA assessment connected to Defender for Cloud. 42 automated control tests run daily. Compliance score reported weekly to CISO and Privacy Officer. First external HIPAA audit post-deployment: no significant findings.
Case Study 3: Retail Enterprise — Data Mesh Governance with Microsoft Fabric
A FTSE 100 retailer with 8 data domains (Customer, Product, Supply Chain, Finance, Marketing, Store Operations, Loyalty, Digital) implemented a Data Mesh architecture on Microsoft Fabric. The governance challenge: ensure interoperability and trust across domain-owned data products without a central data team bottleneck.
Governance Architecture
Federated Purview (one account per domain): 8 Purview accounts, one per domain. Each domain team operates their own catalog with full autonomy. Enterprise glossary terms synchronized from a central “governance hub” account via nightly API sync.
Cross-domain lineage: Custom lineage federation solution: each domain emits lineage events to a central Azure Event Hub. A Fabric Data Factory pipeline consumes events and writes cross-domain lineage to a central Purview account. CDO can view full end-to-end customer journey lineage from Customer domain through to Finance domain.
Data Product certification: A data product is certified (Purview “Certified” endorsement) only when: data quality SLA is documented, owner is assigned, sensitivity label is applied, and a data contract (schema + SLA) is published to the enterprise API registry. Automated certification check runs weekly via Purview API.
| KPI | Target | Achieved (Month 12) |
| Data products certified | 80% of published products | 84% |
| Cross-domain data access time (request to provision) | < 3 business days | 1.2 days average |
| Data quality incidents crossing domain boundary | < 5/month | 2.1/month average |
| Time to identify root cause of cross-domain data issue | < 4 hours | 47 minutes average |
CHAPTER 13
90-Day Implementation Roadmap
Based on 20+ Purview deployments, the following 90-day roadmap represents the optimal sequencing for enterprise governance programs. It balances quick wins (demonstrable value in Week 6) with foundation-setting activities required for long-term scale.
Phase 1: Foundation (Days 1-30)
| Week | Activity | Owner | Success Criteria |
| 1 | Purview account provisioning, network design (MVNet vs SHIR), AAD groups creation, collection hierarchy design | Data Engineer + Architect | Purview account live; network connectivity validated; collection hierarchy documented and approved |
| 1-2 | Source inventory: document all data sources (type, location, sensitivity level, volume, business owner) | Data Governance Lead | Complete source inventory spreadsheet; sources prioritized by regulatory risk |
| 2 | Glossary foundation: identify 50-100 core business terms per domain with definitions, owners, and related terms | Data Governance Lead + Domain Owners | Core glossary terms in Draft status; term owners assigned |
| 2-3 | Register and scan Tier 1 sources (highest regulatory risk: PCI scope, PHI scope, GDPR-critical sources) | Data Engineer | Tier 1 sources scanned; classification results reviewed; false positive rate < 5% |
| 3-4 | Classification review and tuning: review auto-classification results; build custom rules for organizational patterns; set thresholds | Data Steward + Data Engineer | Custom classification rules deployed; classification accuracy > 90% on sampled validation set |
| 4 | RBAC configuration: assign Data Curator, Data Reader roles to domain teams; configure collection-level permissions | Data Governance Lead | All roles assigned; domain teams can access catalog; data engineers can register sources |
Phase 2: Activation (Days 31-60)
| Week | Activity | Owner | Success Criteria |
| 5-6 | Register and scan all remaining data sources; configure incremental scan schedules | Data Engineer | >90% of data estate registered and scanned; scan schedule operational |
| 6 | Quick win: publish Governance Maturity Score dashboard in Power BI; present Week 6 scorecard to leadership | Data Governance Lead | Dashboard live; leadership briefing completed; program funded for Phase 3 |
| 6-7 | Sensitivity label deployment: configure label taxonomy; deploy auto-labelling policies in simulation mode for M365 and Azure | Security + Data Governance | Labels published; simulation mode running; simulation report reviewed |
| 7-8 | Lineage validation: verify ADF, Synapse, Fabric lineage; build custom lineage for non-native sources via Atlas API | Data Engineer | End-to-end lineage visible for >3 critical data pipelines; column-level lineage for Power BI reports |
| 7-8 | Stewardship workflows: configure ownership assignment workflows; assign owners to all Tier 1 assets; begin Tier 2 | Data Steward | >95% of Tier 1 assets have assigned owner and expert contact |
| 8 | Compliance Manager setup: create GDPR, HIPAA, or relevant regulatory assessments; map controls to organizational evidence | Compliance Officer + Data Governance | At least one regulatory assessment active; initial compliance score baseline established |
Phase 3: Optimization (Days 61-90)
| Week | Activity | Owner | Success Criteria | ||
| 9-10 | Enable sensitivity labels in production (exit simulation mode); deploy DLP policies; monitor DLP incidents for first 2 weeks before tuning | Security + Data Governance | Labels applying to new content; DLP incidents appearing in dashboard; < 10% false positive rate on DLP | ||
| 10-11 | Data access policy deployment: configure self-service access request workflow; pilot with 2-3 data domains | Data Governance Lead + Data Engineer | Self-service access workflow live; first access requests processed through Purview; data owner satisfaction confirmed | ||
| 11 | Glossary completion: approve Tier 1 glossary terms; link certified terms to classified assets via bulk assignment | Data Steward | >80% of Tier 1 assets linked to at least one approved glossary term | ||
| 12 | Program review: measure Governance Maturity Score vs. Week 1 baseline; document lessons learned; plan 90-180 day roadmap | CDO + Data Governance Lead | Governance score improvement documented; 90-180 day roadmap approved; ongoing operating model confirmed | ||
| Critical Success Factor: Governance programs that fail typically do so in Days 31-60 — the “activation phase.” Quick wins must be demonstrated by Day 45 to maintain organizational momentum and leadership confidence. The Power BI Governance Maturity Score dashboard is specifically designed as this early value demonstration. | |||||
APPENDIX
Appendix: Reference Materials
A. Essential KQL Queries for Purview Operations
A1. Assets Without Owners (Stewardship Gap Report)
| // Find all assets in Purview Data Map without assigned owners |
| // Run in Log Analytics workspace linked to Purview |
| PurviewAssetMetadata_CL |
| | where TimeGenerated > ago(7d) |
| | where isnull(Owner_s) or Owner_s == “” |
| | summarize UnownedAssets = count() by SourceType_s, Collection_s |
| | order by UnownedAssets desc |
| | project Collection_s, SourceType_s, UnownedAssets |
A2. Classification Coverage by Data Domain
| // Sensitivity classification coverage heat map by collection |
| PurviewAssetMetadata_CL |
| | where TimeGenerated > ago(1d) |
| | summarize |
| TotalAssets = count(), |
| ClassifiedAssets = countif(isnotempty(Classifications_s)), |
| SensitiveAssets = countif(SensitivityLabel_s in (“Confidential”,”HighlyConfidential”)) |
| by Collection_s |
| | extend |
| ClassificationCoverage = round(100.0 * ClassifiedAssets / TotalAssets, 1), |
| SensitiveCoverage = round(100.0 * SensitiveAssets / TotalAssets, 1) |
| | order by ClassificationCoverage asc // Surface lowest coverage domains first |
A3. Scan Failure Detection & Alerting
| // Alert when scan success rate drops below 95% in any 24-hour window |
| PurviewScanLogs_CL |
| | where TimeGenerated > ago(24h) |
| | summarize |
| TotalScans = count(), |
| FailedScans = countif(Status_s == “Failed”), |
| SuccessRate = round(100.0 * countif(Status_s == “Succeeded”) / count(), 1) |
| by SourceName_s |
| | where SuccessRate < 95 |
| | project SourceName_s, TotalScans, FailedScans, SuccessRate |
| | order by SuccessRate asc |
| // Use this query as basis for Azure Monitor alert rule |
| // Alert when count() > 0 (any source below 95% success) |
B. Purview REST API Quick Reference
| Operation | Method | Endpoint | Use Case |
| List collections | GET | /account/collections | Audit collection hierarchy; generate governance reports |
| Get asset by qualified name | GET | /catalog/api/atlas/v2/entity/uniqueAttribute/type/{typeName}?attr:qualifiedName={qn} | Look up specific asset metadata in automation |
| Update asset contacts | PUT | /catalog/api/atlas/v2/entity/guid/{guid}/businessattribute/Contacts | Bulk owner assignment in onboarding automation |
| Submit lineage | POST | /catalog/api/atlas/v2/entity/bulk | Custom lineage for non-native sources (dbt, custom ETL) |
| Run scan | POST | /scan/datasources/{dsName}/scans/{scanName}/runs | Trigger scan on-demand from CI/CD pipeline on schema change |
| Get scan status | GET | /scan/datasources/{dsName}/scans/{scanName}/runs/{runId} | Poll scan completion in automation workflow |
| Create glossary term | POST | /catalog/api/atlas/v2/glossary/term | Bulk glossary population from existing business dictionaries |
| Assign term to asset | POST | /catalog/api/atlas/v2/entity/guid/{guid}/classifications | Automated term linking after scan completion |
C. Governance Maturity Model
| Level | Name | Characteristics | Target Score | Typical Timeline |
| L1 | Initial | Ad hoc governance; no systematic catalog; manual compliance preparation; governance by tribal knowledge | 0-20 | Starting point for most organizations |
| L2 | Managed | Data sources registered and scanned; basic classification applied; glossary under development; ownership partially assigned | 20-50 | 0-6 months post-Purview deployment |
| L3 | Defined | Full estate classification; glossary approved and linked; lineage documented for critical pipelines; compliance assessments active | 50-70 | 6-12 months post-deployment |
| L4 | Quantitatively Governed | Governance Maturity Score tracked weekly; stewardship SLAs enforced; access policies active; DLP protecting sensitive data; self-service access working | 70-85 | 12-24 months |
| L5 | Optimizing | Automated certification, continuous compliance, AI-assisted stewardship, domain-level governance ownership, governance embedded in CI/CD pipelines | 85-100 | 24-36 months |
D. Glossary of Key Terms
| Term | Definition |
| Apache Atlas | Open-source metadata management and governance framework; the foundational metadata model underlying Purview’s Data Map |
| Business Glossary | Curated vocabulary of business terms linked to data assets; provides semantic context and shared language across data consumers |
| Collection | Hierarchical container in Purview that scopes metadata, access control, and policy enforcement; the primary organizational unit of the Data Map |
| Data Lineage | Documentation of data origin, movement, and transformation — tracing how data flows from source systems through processing layers to consumption |
| Data Map | The live, continuously updated inventory of all registered data assets in Purview, including metadata, classifications, lineage, and relationships |
| DLP (Data Loss Prevention) | Policies that detect and block unauthorized movement or sharing of sensitive data across email, documents, cloud storage, and messaging platforms |
| Endorsement | Trust signal applied to catalog assets: “Promoted” (recommended by workspace member) or “Certified” (validated by designated authority) |
| Glossary Term | A business concept formally defined in Purview with name, definition, steward, status, and asset linkages |
| Managed Virtual Network | Microsoft-managed network infrastructure for Purview scanning; eliminates need for customer-managed integration runtime VMs |
| OpenLineage | Open standard for data lineage metadata; used by Purview Spark connector to emit lineage from Spark jobs |
| Policy Author | Purview RBAC role with permission to create and publish data access policies that enforce at the storage layer |
| Scan Rule Set | Configuration defining which file types and classification rules apply when scanning a data source |
| Self-Hosted Integration Runtime (SHIR) | Customer-managed agent VM that enables Purview scanning of on-premises or private network data sources |
| Sensitivity Label | Classification tag applied to data assets and documents (e.g., Confidential, Highly Confidential) that drives downstream protection actions |
