Enterprise Data Lakehouse on Microsoft Fabric

January 30, 2026 Puneet No comments yet

Architecture & Governance Framework

Expert Implementation Guide v1.0

Executive Summary

Microsoft Fabric represents the most significant shift in enterprise data architecture since cloud data warehouses. By unifying data engineering, warehousing, data science, real-time analytics, and business intelligence into a single SaaS platform, Fabric eliminates the “tool sprawl” that has plagued data teams.

Key Business Outcomes from Production Deployments

Metric	Before Fabric	After Fabric	Improvement
Data Ingestion Time	4-6 hours (batch)	15-30 minutes	85% faster
Tool Licensing Costs	$800K/year	$350K/year	56% reduction
Time to Insights	2-3 weeks	2-3 days	90% faster
Data Engineer Productivity	3 pipelines/week	12 pipelines/week	4× increase
Governance Coverage	40% of datasets	95% of datasets	137% improvement

Readiness Assessment

✓ You’re Ready for Fabric If:

Multiple disconnected tools (Databricks + Synapse + Power BI + ADF)
Data duplication costing $500K+ annually
Self-service analytics blocked by engineering bottlenecks
Governance is manual and inconsistent
Executive mandate for AI/ML requires unified foundation

✗ Fabric May Not Be Suitable If:

Heavily invested in AWS/GCP ecosystems
Need Spark 3.5+ features (Fabric currently on 3.4)
Custom Java/Scala Spark applications (Python/SQL only)
Require on-premises deployment (Fabric is cloud-only)

1. The Fabric Value Proposition

Traditional vs. Fabric Architecture

Traditional (Tool Sprawl):

Data Sources → ADF → ADLS → Databricks → Synapse → Power BI

Each with separate: Billing | Security | Metadata | Compute

Fabric (Unified Platform):

Data Sources → Microsoft Fabric (OneLake)

↓

Data Factory | Synapse | Data Science

↓

Power BI (built-in)

↓

Unified Security & Governance

Key Differentiators

Aspect	Traditional	Fabric
Data Movement	Copy between services	Data in place (OneLake)
Security	Per-service RBAC	Unified security model
Metadata	Separate catalogs	One metadata layer
Billing	Multiple SKUs	Single capacity-based
Governance	External tools	Built-in Purview

2. Architecture Patterns

Pattern 1: Medallion Architecture (Bronze-Silver-Gold)

The medallion architecture is the gold standard for lakehouse implementations.

Bronze Layer (Raw)

Purpose: Immutable archive in original format
Format: Delta Lake with schema-on-read
Retention: 7-90 days
Example Structure:

bronze/

├── salesforce/accounts/year=2024/month=01/data.parquet

└── erp/transactions/year=2024/month=01/data.parquet

Silver Layer (Cleaned)

Purpose: Validated, de-duplicated, conformed data
Format: Delta Lake with enforced schema
Quality Checks: NULL handling, type validation, referential integrity
Example Transformation:

from pyspark.sql.functions import col, trim, lower

df_silver = (

df_bronze

.dropDuplicates([“customer_id”])

.withColumn(“email”, trim(lower(col(“email”))))

.filter(col(“customer_id”).isNotNull())

)

df_silver.write.format(“delta”).mode(“overwrite”).save(“silver/customers”)

Gold Layer (Business-Ready)

Purpose: Dimensional models for analytics
Format: Star schema in Delta Lake
Optimization: Z-ordering, liquid clustering
Example:

CREATE TABLE gold.fact_sales (

sale_id BIGINT,

date_key INT,

customer_key INT,

revenue DECIMAL(18,2)

) USING DELTA

PARTITIONED BY (date_key)

TBLPROPERTIES (‘delta.autoOptimize.optimizeWrite’ = ‘true’);

Pattern 2: Real-Time Analytics (Kappa Architecture)

For sub-second latency requirements (fraud detection, IoT monitoring):

IoT Devices → Eventstream → KQL Database (hot) + OneLake (cold)

↓

Power BI RT Dashboards

When to Use:

100K+ events/second ingestion
Financial transaction monitoring
Website clickstream analysis
Security threat detection

3. OneLake Design Principles

OneLake vs. Traditional Data Lake

Feature	ADLS Gen2	OneLake
Storage	Multiple accounts	Single namespace
Access	Per-account management	Unified across workspaces
Cost	Pay per account	Included in capacity
Shortcuts	Not supported	Native to ADLS/S3/GCS
Security	RBAC + ACLs	Workspace-level permissions

Workspace Organization Strategy

Recommended: Environment-Based Isolation

Fabric Capacity: PROD-F64

│

├── Workspace: Data-Engineering-PROD

│ ├── Lakehouse: bronze_raw

│ ├── Lakehouse: silver_curated

│ ├── Lakehouse: gold_dimensional

│ └── Notebooks: transformations

│

├── Workspace: Analytics-PROD

│ ├── Warehouse: sales_dw

│ ├── Semantic Model: sales_model

│ └── Reports: executive_dashboard

│

└── Workspace: Data-Engineering-DEV

└── (Mirror of PROD for testing)

OneLake Shortcuts: The Game-Changer

Shortcuts reference external data without copying:

Use Cases:

Access existing ADLS Gen2 data without migration
Federate data from AWS S3 or Google Cloud Storage
Create virtual gold layer combining multiple sources
Enable cross-workspace collaboration

Example: S3 Shortcut

from notebookutils import mssparkutils

mssparkutils.lakehouse.createShortcut(

shortcutName=”aws_customer_data”,

targetPath=”s3://company-bucket/master_data/”,

credentials={“accessKeyId”: “…”, “secretAccessKey”: “…”}

)

# Access S3 as if local

df = spark.read.format(“delta”).load(“Tables/aws_customer_data”)

Lakehouse vs. Warehouse Decision Matrix

Criterion	Lakehouse	Warehouse
Primary Users	Data engineers, scientists	Business analysts
Data Format	Delta Lake (Parquet)	Managed tables
Query Language	Python, Spark SQL, T-SQL	T-SQL only
Best For	ELT, ML, unstructured	Traditional BI
Cost	Lower	Higher

Expert Recommendation: Use Lakehouse for medallion (bronze/silver/gold), create Warehouse SQL endpoint on gold for BI tools.

4. Data Engineering Patterns

Incremental Load with Delta Time Travel

Problem: Only process NEW data daily

Solution:

from delta.tables import DeltaTable

from pyspark.sql.functions import current_timestamp

# Initial Load

df_initial = spark.read.format(“salesforce”).load()

df_initial.write.format(“delta”).mode(“overwrite”).save(“bronze/opportunities”)

# Incremental Load (Day 2+)

last_version = DeltaTable.forPath(spark, “bronze/opportunities”).history(1)

last_timestamp = last_version.select(“timestamp”).first()[0]

df_new = spark.read.format(“salesforce”) \

.option(“query”, f”WHERE SystemModstamp > ‘{last_timestamp}'”) \

.load()

df_new.write.format(“delta”).mode(“append”).save(“bronze/opportunities”)

Type 2 Slowly Changing Dimensions

Business Requirement: Track historical changes to customer addresses

from delta.tables import DeltaTable

from pyspark.sql.functions import current_timestamp, lit

df_source = spark.read.format(“delta”).load(“silver/crm_customers”)

dim_customers = DeltaTable.forPath(spark, “gold/dim_customer”)

# Expire old records when changes detected

dim_customers.alias(“target”).merge(

df_source.alias(“source”),

“target.customer_id = source.customer_id AND target.is_current = true”

).whenMatchedUpdate(

condition=”target.address <> source.address”,

set={“is_current”: “false”, “end_date”: “current_timestamp()”}

).whenNotMatchedInsert(

values={

“customer_id”: “source.customer_id”,

“address”: “source.address”,

“start_date”: “current_timestamp()”,

“is_current”: “true”

}

).execute()

Data Quality Validation

import great_expectations as gx

context = gx.get_context()

df = spark.read.format(“delta”).load(“silver/customers”)

validator = context.get_validator(batch_data=df)

# Define quality checks

validator.expect_column_values_to_not_be_null(column=”customer_id”)

validator.expect_column_values_to_be_unique(column=”customer_id”)

validator.expect_column_values_to_match_regex(

column=”email”,

regex=”^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$”

)

results = validator.validate()

if not results[“success”]:

raise Exception(f”Data quality check failed: {results}”)

5. Governance & Security

Microsoft Purview Integration

Fabric has native Purview integration—every lakehouse automatically registers.

Automatic Capabilities:

Asset discovery (all tables/notebooks indexed)
Column-level data lineage
Sensitivity labels (PII, Confidential)
Access tracking (who accessed what, when)

Apply Sensitivity Label:

from notebookutils import mssparkutils

mssparkutils.credentials.applyLabel(

labelId=”e3c2b4f1-xxxx”,

tableName=”gold.dim_customer”,

columnName=”social_security_number”

)

Row-Level Security at Scale

Scenario: Sales managers only see their team’s data

— Create security function

CREATE FUNCTION dbo.fn_securitypredicate(@manager_email VARCHAR(255))

RETURNS TABLE

WITH SCHEMABINDING

RETURN SELECT 1 AS result

WHERE @manager_email = USER_NAME()

OR IS_ROLEMEMBER(‘SalesDirector’) = 1;

— Apply security policy

CREATE SECURITY POLICY SalesFilter

ADD FILTER PREDICATE dbo.fn_securitypredicate(sales_manager_email)

ON gold.fact_sales

WITH (STATE = ON);

Result: Users automatically see only their data. No Power BI code changes required.

Workspace Security Model

Role	Permissions	Use Case
Admin	Full control, assign roles	Platform lead
Member	Create/edit items	Data engineers
Contributor	Edit existing only	Junior analysts
Viewer	Read-only	Business users

Best Practice: Use Azure AD Security Groups instead of individual users.

6. Performance Optimization

Delta Lake Optimization Techniques

1. Z-Ordering (Data Skipping)

— Before: Query scans 500GB

SELECT * FROM gold.fact_sales

WHERE product_category = ‘Electronics’ AND sale_date >= ‘2024-01-01’;

— Execution: 45 seconds

— Apply Z-Ordering

OPTIMIZE gold.fact_sales ZORDER BY (product_category, sale_date);

— After: Query scans only 50GB

— Execution: 4 seconds (90% faster)

2. Liquid Clustering (Better than Partitioning)

— Old: Manual partitioning creates 365+ folders/year

CREATE TABLE gold.fact_sales (…) PARTITIONED BY (sale_date);

— New: Liquid clustering auto-organizes

CREATE TABLE gold.fact_sales (

sale_id INT,

sale_date DATE,

product_id INT,

revenue DECIMAL(18,2)

) USING DELTA

CLUSTER BY (sale_date, product_id);

— Queries automatically use clustering (95% faster)

Power BI DirectLake Performance

DirectLake queries Delta Lake directly from Power BI—no import or DirectQuery.

Mode	Latency	Freshness	Limitations
Import	<1s	Scheduled refresh (max 48×/day)	10GB-400GB limit
DirectQuery	5-30s	Real-time	Slow, impacts source
DirectLake	1-3s	Real-time	Requires Fabric lakehouse

DirectLake Requirements:

Delta Lake format in Fabric lakehouse
F64+ capacity (F32 for basic scenarios)
Semantic model in same workspace

Optimize for DirectLake:

# Prepare gold table with V-Order + Z-Ordering

df_sales.write.format(“delta”) \

.option(“parquet.vorder.enabled”, “true”) \

.saveAsTable(“gold.fact_sales”)

spark.sql(“OPTIMIZE gold.fact_sales ZORDER BY (date_key, customer_key)”)

7. Migration Strategies

Migrating from Databricks

Technical Compatibility:

Databricks	Fabric Equivalent	Notes
Delta Lake	Native Delta	100% compatible
Notebooks (Python/SQL)	Fabric Notebooks	80% copy-paste works
Databricks SQL	Synapse SQL Endpoint	T-SQL, not Spark SQL
MLflow	Fabric ML	Different API
Unity Catalog	OneLake + Purview	Different security model

Cost Comparison (500 DBU/month):

Databricks:

– 500 DBU × $0.40 = $200/month compute

– 10TB ADLS = $200/month

– Support = $2,000/month

Total: $2,400/month = $28,800/year

Fabric F64:

– $8,384/month (includes all services)

Total: $8,384/month = $100,608/year

Verdict: Fabric MORE expensive for pure Spark

BUT cheaper when including BI + orchestration

Break-even: ~800 DBU + BI workloads

Migrating from SQL Server

Recommended: Fabric Mirroring (Real-Time CDC)

On-Prem SQL Server

↓ (Mirroring – <5min lag)

Fabric Mirrored Database (read-only Delta Lake)

↓

Bronze → Silver → Gold transformation

↓

Power BI DirectLake reports

Setup Steps:

Enable Change Tracking (SQL Server):

ALTER DATABASE Sales SET CHANGE_TRACKING = ON;

ALTER TABLE dbo.Orders ENABLE CHANGE_TRACKING;

Create Fabric Mirror (UI):
- Fabric workspace → Create → Mirrored SQL Database
- Enter SQL connection details
- Select tables to mirror
- Auto-creates Delta Lake tables
Transform to Gold:

df_orders = spark.read.format(“delta”).load(“bronze_mirror.Orders”)

df_customers = spark.read.format(“delta”).load(“bronze_mirror.Customers”)

df_fact = df_orders.join(df_customers, “CustomerID”) \

.select(“OrderID”, “OrderDate”, “TotalAmount”)

df_fact.write.format(“delta”).saveAsTable(“gold.fact_sales”)

8. Cost Management

Fabric Capacity Pricing

SKU	CU	Monthly Cost	Use Case
F2	2	$262	Dev/test only
F8	8	$1,048	50-100 users, <500GB
F32	32	$4,192	300-1,000 users, 5-20TB
F64	64	$8,384	1,000-5,000 users, 20-100TB
F128	128	$16,768	Enterprise, 5,000+ users

What’s Included:

OneLake storage (unlimited, fair use)
All compute (Spark, SQL, ML)
Power BI Premium
Data Factory orchestration

What’s NOT Included:

Outbound data transfer
Azure OpenAI API calls

Cost Optimization Strategies

1. Autoscale for Peak Loads

Baseline: F64 ($8,384/month)

Peak hours: Scale to F128

Off-hours: Scale to F32

Total: ~$10,384/month vs. static F128: $16,768

Savings: $6,384/month (38%)

2. Pause Dev/Test Capacities

F32 Dev: $4,192/month

Weekend downtime: 208 hours/month

Effective cost: $2,997/month

Savings: $1,195/month (28.5%)

3. Vacuum Old Delta Versions

VACUUM gold.fact_sales RETAIN 7 HOURS;

— Before: 500GB (50GB current + 450GB old)

— After: 60GB (50GB current + 10GB recent)

— Reduction: 88%

ROI Calculation Example

Retail Company Migration:

Category	Before	After	Savings
Databricks	$384K/yr	$0	-$384K
Synapse Analytics	$120K/yr	$0	-$120K
Power BI Premium	$60K/yr	$0	-$60K
Informatica ETL	$200K/yr	$0	-$200K
Fabric F128	$0	$201K/yr	+$201K
Net Tool Savings			-$563K/yr
Data engineers (8→5 FTE)	$960K	$600K	-$360K
Total Annual Savings			-$923K/yr

3-Year TCO: $2.77M saved (66% reduction)

9. Production Operations

CI/CD for Fabric Workspaces

Fabric supports Git integration for deployment automation.

GitHub Actions Example:

name: Fabric Deployment

on:

push:

branches: [main]

jobs:

deploy-staging:

runs-on: ubuntu-latest

steps:

– uses: actions/checkout@v3

– name: Deploy to Fabric

run: |

Import-PowerBINotebook -WorkspaceId “staging-id” \

-Path “./notebooks/transform.ipynb”

– name: Run Tests

run: |

Invoke-PowerBINotebook -NotebookName “integration_tests”

deploy-production:

needs: deploy-staging

environment: production # Manual approval required

steps:

– name: Deploy to Production

run: |

Import-PowerBINotebook -WorkspaceId “prod-id” \

-Path “./notebooks/transform.ipynb”

Monitoring & Alerting

Key Metrics:

Metric	Warning	Critical	Action
Capacity Utilization	>80%	>90%	Scale up
Query Duration	>30s	>60s	Optimize queries
Pipeline Failures	Any	N/A	Alert team
Storage Growth	>50GB/day	>100GB/day	Investigate duplication

Azure Monitor Integration:

from azure.monitor.opentelemetry import configure_azure_monitor

configure_azure_monitor(connection_string=”InstrumentationKey=…”)

# Track pipeline metrics

pipeline_counter.add(1, {“pipeline”: “daily_ingest”, “status”: “success”})

data_processed.record(150.5, {“pipeline”: “daily_ingest”})

Disaster Recovery

Component	Backup Method	RPO	RTO
Lakehouse Data	OneLake geo-redundancy	<1hr	<1hr
Notebooks	Git version control	0	<15min
Pipelines	Git version control	0	<15min
Power BI Reports	Deployment pipelines	<1hr	<30min

Typical Failover Time: 45-90 minutes (including validation)

10. Case Studies

Case Study 1: Global Retail – $1.5M Savings

Profile:

2,500 stores, $8B revenue
50TB data, 500GB/day new

Challenge:

7 disconnected tools
3-week lead time for reports
$1.2M annual licensing

Solution:

Medallion architecture (bronze/silver/gold)
Real-time inventory dashboard (Eventstream + KQL)
150 analysts trained on self-service

Results:

Cost: $1.2M → $250K (79% reduction)
Time to insights: 3 weeks → 2 days
Revenue impact: +$25M from optimized inventory

Case Study 2: Healthcare – HIPAA Compliance

Profile:

15 hospitals, 200 clinics
120TB patient records (EHR, imaging)

Challenge:

HIPAA compliance nightmare
Patient 360° view impossible
40 hours/week manual reporting

Solution:

Azure Private Link + CMK encryption
Row-level security (doctors see only their patients)
Purview sensitivity labels (PHI)
Patient 360° pipeline (merge 5 source systems)

Results:

Passed HIPAA audit (zero findings, first time in 5 years)
Cost: $1M → $300K (70% savings)
Readmissions: -12% (predictive ML)
Reporting: 40 hrs/week → 2 hrs (95% reduction)

Case Study 3: Financial Services – Fraud Detection

Profile:

5M customers, $50B AUM
100M transactions/day

Challenge:

Fraud detected 24-48 hours after occurrence
$5M annual losses
Legacy batch processing

Solution:

Kappa architecture (Eventstream → KQL → ML scoring)
Real-time ML model (<500ms latency)
Data Activator alerts for high-risk transactions

Results:

Detection speed: 24-48hrs → <500ms
False positives: 5% → 0.8%
Fraud losses: $5M → $800K (84% reduction)
Customer NPS: 72 → 85

11. Implementation Roadmap (90 Days)

Phase 1: Foundation (Days 1-30)

Week 1: Environment Setup

Create workspaces (dev, staging, prod)
Configure Git integration
Enable Purview catalog
Assign workspace roles

Week 2: Pilot Use Case Selection

Identify 1-2 high-impact use cases
Map data sources
Document current architecture

Week 3: Training

Data Engineers: Fabric notebooks, Delta Lake (2 days)
BI Analysts: DirectLake, semantic models (1 day)
Architects: OneLake design patterns (1 day)

Week 4: Build Bronze Layer

Create Data Factory pipelines
Ingest to bronze lakehouse
Validate data completeness
Setup incremental load

Deliverables:

3 workspaces configured
10-15 team members trained
Bronze layer operational
Documentation wiki created

Phase 2: Transformation (Days 31-60)

Week 5: Build Silver Layer

Notebooks for cleansing
Data quality checks
Business rules
Incremental refresh testing

Week 6: Build Gold Layer

Design star schema
Create fact/dimension tables
Implement SCD Type 2
Optimize (Z-ordering, clustering)

Week 7: Power BI Integration

Create semantic models (DirectLake)
Build pilot reports
Configure RLS
User acceptance testing

Week 8: Governance

Apply Purview labels
Configure lineage
Audit logging
Data dictionary

Deliverables:

Medallion architecture complete
5-10 Power BI reports
Governance framework
Pilot fully functional

Phase 3: Scale & Optimize (Days 61-90)

Week 9: Performance Tuning

Analyze slow queries
Apply V-Order + Z-ordering
Load testing (100+ users)

Week 10: Production Cutover

Create deployment pipeline
Test disaster recovery
Document runbooks
Schedule go-live

Week 11: User Onboarding

Train 50-100 business users
Self-service guidelines
Support channel setup
Internal marketing

Week 12: Go-Live

Production cutover
Monitor capacity
Daily user check-ins
Bug fixes

Deliverables:

Production live
100+ users onboarded
Monitoring operational
Support documented

12. Conclusion

Microsoft Fabric represents a platform shift. Organizations adopting early gain:

Competitive Advantage – Real-time insights enable faster decisions
Cost Efficiency – 50-70% reduction vs. multi-tool stacks
AI Readiness – Unified foundation for enterprise AI/ML
Talent Retention – Modern tools attract top engineers

The Risk of Waiting:

Tool sprawl continues increasing costs
Competitors move faster with Fabric
Best talent leaves for modern platforms

Final Recommendation:
Start with 90-day pilot (one use case), prove ROI, then scale. Success requires treating Fabric as strategic platform investment, not tactical replacement.

Appendix: Quick Reference

Fabric Terminology

Term	Definition
Lakehouse	Delta Lake + Spark + SQL endpoint
OneLake	Unified data lake namespace
DirectLake	Power BI mode (no import/DirectQuery)
Eventstream	Real-time ingestion (Kafka-like)
KQL Database	Time-series analytics
Capacity Unit	Billing unit for compute/storage
Shortcut	Pointer to external data

Common Limits

Resource	Limit	Workaround
Lakehouse size	10TB	Multiple lakehouses + shortcuts
Notebook execution	24 hours max	Break into smaller jobs
DirectLake dataset	400GB (F64+)	Optimize tables, aggregations
Concurrent Spark sessions	20 (F64)	Queue jobs, scale capacity

Useful SQL Queries

— Find largest tables

SELECT table_name,

SUM(file_size_bytes)/1024/1024/1024 AS size_gb

FROM INFORMATION_SCHEMA.FILES

GROUP BY table_name

ORDER BY size_gb DESC;

— Tables needing optimization

SELECT table_name, COUNT(*) AS file_count

FROM INFORMATION_SCHEMA.FILES

GROUP BY table_name

HAVING COUNT(*) > 1000;

— Check Delta versions

DESCRIBE HISTORY gold.fact_sales;

Contact sales

Enterprise Data Lakehouse on Microsoft Fabric

Architecture & Governance Framework

Expert Implementation Guide v1.0

Executive Summary

Key Business Outcomes from Production Deployments

Readiness Assessment

1. The Fabric Value Proposition

Traditional vs. Fabric Architecture

Key Differentiators

2. Architecture Patterns

Pattern 1: Medallion Architecture (Bronze-Silver-Gold)

Pattern 2: Real-Time Analytics (Kappa Architecture)

3. OneLake Design Principles

OneLake vs. Traditional Data Lake

Workspace Organization Strategy

OneLake Shortcuts: The Game-Changer

Lakehouse vs. Warehouse Decision Matrix

4. Data Engineering Patterns

Incremental Load with Delta Time Travel

Type 2 Slowly Changing Dimensions

Data Quality Validation

5. Governance & Security

Microsoft Purview Integration

Row-Level Security at Scale

Workspace Security Model

6. Performance Optimization

Delta Lake Optimization Techniques

Power BI DirectLake Performance

7. Migration Strategies

Migrating from Databricks

Migrating from SQL Server

8. Cost Management

Fabric Capacity Pricing

Cost Optimization Strategies

ROI Calculation Example

9. Production Operations

CI/CD for Fabric Workspaces

Monitoring & Alerting

Disaster Recovery

10. Case Studies

Case Study 1: Global Retail – $1.5M Savings

Case Study 2: Healthcare – HIPAA Compliance

Case Study 3: Financial Services – Fraud Detection

11. Implementation Roadmap (90 Days)

Phase 1: Foundation (Days 1-30)

Phase 2: Transformation (Days 31-60)

Phase 3: Scale & Optimize (Days 61-90)

12. Conclusion

Appendix: Quick Reference

Fabric Terminology

Common Limits

Useful SQL Queries

Puneet

Leave a Reply Cancel reply

Company

Resources

© Veratas. All rights reserved.