DATABRICKS ON AZURE

November 19, 2025 Puneet No comments yet

Lakehouse Architecture for Petabyte-Scale Analytics

Unity Catalog, Delta Live Tables, ML at Scale & Azure Integration Patterns

Data Platform Practice | Microsoft Ecosystem Veratas | 2025 Edition

Classification: Public | Version 1.0 | 2025

Executive Summary

The modern data stack has converged on a single architectural truth: the lakehouse. By unifying the low-cost, flexible storage of a data lake with the reliability, performance, and governance of a data warehouse, the lakehouse eliminates the duplication, inconsistency, and operational overhead that plagued the two-tier architectures of the previous decade. Databricks — the company that originated Apache Spark, Delta Lake, and MLflow — delivers the most mature and battle-tested lakehouse platform available, and on Azure it integrates deeply with the Microsoft ecosystem that most enterprises already operate.

This white paper addresses the gap between Databricks introductory documentation and the architectural decisions that determine whether a Databricks deployment scales from a proof-of-concept to a petabyte-scale production platform. It is written for data architects, senior data engineers, ML engineers, and technical leaders making platform decisions for organizations processing millions to trillions of records daily.

Key Findings

Dimension	Traditional Architecture	Databricks Lakehouse (Measured Outcomes)
Storage cost per TB/month	$180-$450 (data warehouse)	$23-$45 (ADLS Gen2 + Delta)
Batch-to-serving latency	4-24 hours (nightly ETL)	< 5 minutes (Delta Live Tables streaming)
ML model deployment time	2-8 weeks (manual MLOps)	< 4 hours (MLflow + Model Serving)
Data pipeline reliability	60-75% on-time SLA (complex DAGs)	95%+ with DLT expectations & auto-retry
Governance coverage	Per-tool ACLs; no unified lineage	Unity Catalog: column-level lineage, ABAC, row filters
Compute cost (analytics)	Fixed warehouse capacity	40-70% reduction via auto-scaling + spot instances
Time to insight (ad hoc)	Hours (provisioning + query queue)	Minutes (serverless SQL warehouse, instant start)

Strategic Insight: Databricks and Microsoft Fabric are not mutually exclusive. The optimal enterprise architecture in 2025 uses both: Databricks for heavy data engineering, ML training, and streaming at scale; Microsoft Fabric for business-facing analytics, Power BI integration, and self-service BI. Unity Catalog bridges both platforms through its Fabric connector, enabling shared governance across the full data estate.

CHAPTER 1

Why Databricks on Azure — Strategic Positioning

1.1 The Lakehouse Market in 2025

The lakehouse market has reached a tipping point. Snowflake, Google BigQuery, and Microsoft Fabric all offer lakehouse-compatible architectures, but Databricks retains three durable advantages that make it the preferred platform for engineering-heavy, ML-intensive, or high-volume data workloads:

Apache Spark ownership: Databricks employs the core Spark committers. Every Databricks Runtime release ships Spark optimizations that are 1-2 years ahead of open-source availability. For workloads processing terabytes or petabytes per job, this translates to 2-5x faster execution over commodity Spark environments.

Delta Lake protocol maturity: Delta Lake — created by Databricks — is the most widely adopted open table format, with 30,000+ production deployments. Delta Universal Format (UniForm) enables interoperability with Apache Iceberg and Apache Hudi readers, future-proofing storage investments against format fragmentation.

Unified ML platform: MLflow (also created by Databricks) is the de facto standard for ML experiment tracking and model registry, with integrations in Azure ML, SageMaker, Google Vertex AI, and every major ML framework. No other lakehouse platform matches Databricks’ native integration between data engineering and ML workflows.

1.2 Databricks vs. Microsoft Fabric: Decision Framework

The most common architectural question in the Microsoft ecosystem: when to use Databricks, when to use Fabric, and when to use both. The answer is workload-driven:

Workload Characteristic	Recommended Platform	Rationale
Large-scale Spark ETL (>10TB/day)	Databricks	Databricks Runtime 2-5x faster than open-source Spark; better autoscaling for burst workloads
Streaming ingestion at high volume (>100K events/sec)	Databricks (Structured Streaming)	Native Kafka connector, exactly-once semantics, DLT auto-scaling for variable throughput
ML model training (>1M parameter models)	Databricks	Distributed training (Horovod, Ray), GPU cluster management, native MLflow
Business analytics & self-service BI	Microsoft Fabric / Power BI	DirectLake mode, Fabric semantic model, M365 integration, non-technical user experience
SQL analytics on small-medium datasets (<1TB)	Microsoft Fabric (Warehouse)	Easier SQL authoring, better Power BI integration, lower cost for SQL-only workloads
Enterprise data catalog & governance	Microsoft Purview + Unity Catalog	Purview for estate-wide governance; Unity Catalog for Databricks-specific fine-grained control
Mixed workload (ETL + analytics + ML)	Databricks + Fabric (integrated)	Databricks processes & governs; Fabric serves; Unity Catalog bridges via Fabric connector

1.3 Azure-Native Integration Advantages

Deploying Databricks on Azure (Azure Databricks) provides integration benefits that are unavailable on other cloud providers:

Azure Active Directory (Entra ID) as the identity provider — no separate Databricks user management; SSO and conditional access policies apply automatically
Azure Private Link for Databricks control plane and data plane — full network isolation without public internet exposure
Unity Catalog integration with Microsoft Purview — bidirectional metadata sync for enterprise-wide governance
Native connectors to Azure Event Hubs (Kafka protocol), Azure Blob Storage, ADLS Gen2, Azure SQL, Cosmos DB, Azure Synapse — no third-party connector licensing
Azure Monitor + Log Analytics for unified observability — Databricks cluster logs, job metrics, and SQL warehouse query history in the same observability stack as all Azure resources
Microsoft Fabric shortcut to Databricks-managed Delta tables — Fabric reports can query Databricks Lakehouse tables directly without data movement

Cost Reality Check: Azure Databricks pricing combines DBU (Databricks Unit) consumption with underlying Azure VM costs. A common planning mistake is budgeting only DBU costs. Total cost = DBU cost + VM cost + storage cost + networking cost. In practice, VM cost represents 40-60% of total Databricks spend. This white paper’s cost optimization chapter addresses all four components.

CHAPTER 2

Lakehouse Architecture: Foundations & Design Patterns

2.1 The Medallion Architecture

The Medallion Architecture — Bronze, Silver, Gold — is the de facto standard for organizing data in a Databricks Lakehouse. Its value is not the three-tier label but the quality contract each tier enforces, the decoupling it provides between ingestion and consumption, and the lineage clarity it creates for governance teams.

Layer	Data State	Quality Contract	Consumers	Storage Format
Bronze (Raw)	Exact copy of source data; no transformations; append-only	Schema-on-read; no business rules applied; complete audit trail of source data	Data engineers, audit/compliance	Delta (with source schema preserved); partitioned by ingestion date
Silver (Curated)	Cleaned, deduplicated, conformed; business rules applied; joins across source systems	Schema enforced; data quality rules pass; SLA-driven freshness; no PII in open columns	Data engineers, data scientists, advanced analysts	Delta; partitioned by business date; Z-ordered on high-cardinality query columns
Gold (Certified)	Business-ready aggregations, dimensional models, feature stores; domain-specific semantics	Certified by domain owner; SLA guaranteed; documentation complete; sensitivity labelled	Analysts, Power BI / Fabric, ML feature pipelines, APIs	Delta; highly optimized; partitioned and Z-ordered for query patterns

Extended Medallion: Platinum Layer

High-maturity organizations add a Platinum layer: ML feature store output, aggregated serving tables with sub-second query SLAs, or API-serving materialized views. Platinum tables are the most heavily governed, most tightly SLA-bound, and most aggressively optimized assets in the lakehouse.

2.2 Storage Organization Patterns

The physical organization of the lakehouse in ADLS Gen2 has major implications for performance, security, cost, and governance. Two primary patterns:

Pattern 1 — Zone-Isolated Storage Accounts (Recommended for Enterprise)

Separate ADLS Gen2 storage accounts per medallion layer: adls-bronze-prod, adls-silver-prod, adls-gold-prod. Advantages: independent encryption keys per zone, network security group rules per zone, independent soft-delete and versioning policies, clear billing separation. Disadvantage: slightly more complex pipeline configuration.

Pattern 2 — Container-Isolated Single Account

Single ADLS Gen2 account with containers per layer: /bronze, /silver, /gold. Advantages: simpler configuration, single place for RBAC. Disadvantage: shared encryption key, harder to enforce zone-level network policies, difficult to separate storage costs by layer.

Recommendation: Use zone-isolated storage accounts for any deployment handling regulated data (PCI, PHI, PII at scale). The per-key encryption enables different key rotation policies per zone and makes regulatory scoping unambiguous. For development environments, a single account with containers is sufficient.

2.3 Workspace Architecture for Enterprise

A Databricks workspace is the primary deployment unit — the boundary for compute resources, notebooks, jobs, and most RBAC. Enterprise deployments require deliberate workspace strategy:

Workspace Pattern	Use Case	Advantages	Disadvantages
Single workspace	Small teams (<20 engineers), single project, low sensitivity	Simple administration, low overhead	No environment isolation; production/dev share resources and governance
Environment-separated (dev/test/prod)	Most enterprise deployments	Environment isolation; CI/CD promotion; prod cluster policies enforced separately	Data sharing between envs requires Delta Sharing or data copies
Domain-separated (Data Mesh)	Large orgs with multiple data domains	Domain autonomy; per-domain billing; independent upgrade cycles	Unity Catalog metastore still shared; requires cross-workspace governance coordination
Hub-and-spoke (Recommended)	Enterprise with shared platform team + domain teams	Shared infra (hub) + domain autonomy (spokes); central governance + local engineering	Most complex to set up; requires strong platform team

2.4 Cluster Architecture

Cluster selection is the single biggest driver of both performance and cost. Databricks offers three compute types with fundamentally different operational models:

All-Purpose Clusters: Interactive, long-lived clusters for notebook development and ad hoc analysis. Charged continuously while running. Never use all-purpose clusters for production jobs — auto-termination misconfiguration is the single most common source of unexpected Databricks cost overruns. Enforce auto-termination of 30 minutes via cluster policy.

Job Clusters: Ephemeral clusters created at job start and terminated at job end. The correct compute type for all production jobs. Cost = compute time only. Cluster start time (2-5 minutes) is the only overhead. Use cluster pools to pre-warm instances and reduce start latency to 30-60 seconds.

SQL Warehouses (Serverless preferred): Purpose-built for SQL analytics workloads. Classic SQL Warehouses run on customer-managed VMs. Serverless SQL Warehouses run on Databricks-managed compute and start in < 5 seconds. Use serverless for all interactive BI and analyst SQL workloads. Classic warehouses only when data residency requirements prohibit serverless.

Cluster Type	Start Time	Cost Model	Best For	Avoid For
All-Purpose	3-7 min	Per-second while running	Interactive development, notebooks	Production jobs (cost risk)
Job Cluster	2-5 min (cold)	Per-second for job duration only	All production batch/streaming jobs	Interactive exploration
Job Cluster + Pool	30-60 sec (warm)	Pool idle cost + per-second for job	Latency-sensitive production jobs	Infrequent jobs (pool idle cost not worth it)
Serverless SQL Warehouse	< 5 sec	Per-second, serverless pricing	BI analytics, analyst SQL, Power BI DirectQuery	Large Spark batch processing
Classic SQL Warehouse	2-4 min	Per-second (customer VM)	Data residency-constrained SQL analytics	General use (serverless preferred)

CHAPTER 3

Delta Lake: The Storage Layer Deep Dive

3.1 Delta Lake Architecture

Delta Lake is not merely a file format — it is a transactional storage layer that adds ACID guarantees, scalable metadata management, and time travel to Parquet files stored in cloud object storage. Understanding its internals is essential for designing high-performance, cost-efficient lakehouse pipelines.

Every Delta table consists of: a data directory (Parquet files organized by partition), a transaction log directory (_delta_log/) containing JSON commit files, and periodically generated Parquet checkpoint files that compact the JSON transaction log for fast metadata reads.

Delta Capability	Technical Mechanism	Operational Significance
ACID transactions	Optimistic concurrency control via transaction log; writers serialize via log entry; readers see consistent snapshots	Multiple concurrent writers without data corruption; safe concurrent reads during writes
Time travel	Transaction log preserves all historical versions; TIMESTAMP AS OF or VERSION AS OF syntax	Audit compliance; rollback from bad pipeline runs; reproducible ML training datasets
Schema enforcement	Writer validates incoming data against table schema; rejects non-conforming writes by default	Prevents silent data corruption from schema drift in upstream sources
Schema evolution	MERGE INTO, ALTER TABLE ADD COLUMN, or mergeSchema option	Safely add new columns without breaking existing readers
Scalable metadata	Parquet checkpoints compact JSON log files; column stats stored in log for data skipping	Tables with billions of files read metadata in seconds, not minutes
Data skipping	Min/max column statistics per file stored in transaction log; queries skip files outside predicate range	10-100x query speedup on filtered queries without full table scan
Z-ordering	Co-locates related data in same files based on specified columns; improves data skipping effectiveness	Especially effective for high-cardinality columns (customer_id, product_id) used in WHERE clauses

3.2 Delta Table Optimization: Production Patterns

OPTIMIZE and Z-ORDER

— Run after bulk loads or when file count exceeds 1000 per partition

— Schedule as a post-load job, not inline in the pipeline

— Basic OPTIMIZE (compacts small files into 1GB target file size)

OPTIMIZE silver.sales_transactions;

— OPTIMIZE with Z-ORDER (co-locates data by query columns)

— Z-ORDER columns: choose 1-4 columns used in WHERE/JOIN predicates

— Do NOT Z-ORDER on partition columns (already co-located by partition)

— Do NOT Z-ORDER on columns with low cardinality (<100 distinct values)

OPTIMIZE silver.sales_transactions

ZORDER BY (customer_id, product_id);

— Predictive Optimization (DBR 12.2+) — let Databricks decide when to OPTIMIZE

— Enable at catalog or schema level (Unity Catalog required):

ALTER CATALOG main SET DBPROPERTIES (delta.enablePredictiveOptimization = TRUE);

— Predictive Optimization monitors file stats and triggers OPTIMIZE automatically

— Eliminates need for scheduled OPTIMIZE jobs in most cases

Partitioning Strategy

— Partitioning rules:

— Partition only on LOW cardinality columns (date, region, status)

— Target partition size: 1GB – 10GB of data per partition

— NEVER partition on high-cardinality columns (customer_id, order_id)

— High-cardinality partitioning = millions of tiny files = metadata explosion

— Good: partition by date (daily batches ~1-10GB per day)

CREATE TABLE silver.sales_transactions (

transaction_id STRING,

customer_id STRING,

product_id STRING,

amount DECIMAL(18,2),

transaction_ts TIMESTAMP,

transaction_dt DATE

) USING DELTA

PARTITIONED BY (transaction_dt)

LOCATION “abfss://silver@storage.dfs.core.windows.net/sales/transactions/”;

— After partitioning by date, Z-ORDER by customer_id and product_id

— to optimize point lookups within each date partition

OPTIMIZE silver.sales_transactions

WHERE transaction_dt >= current_date() – INTERVAL 7 DAYS

ZORDER BY (customer_id, product_id);

3.3 Change Data Capture with Delta

Delta’s Change Data Feed (CDF) is the recommended pattern for incremental processing pipelines that need to propagate changes downstream without full table scans. CDF captures insert, update, and delete operations at the row level with change type metadata.

— Enable Change Data Feed on a Delta table

ALTER TABLE silver.customers

SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

— Read changes since a specific version or timestamp

val changes = spark.read

.format(“delta”)

.option(“readChangeFeed”, “true”)

.option(“startingVersion”, lastProcessedVersion) // store in metadata table

.table(“silver.customers”)

— _change_type values: insert, update_preimage, update_postimage, delete

// Process only inserts and updates (ignore preimage rows)

val upserts = changes.filter(“_change_type IN (‘insert’, ‘update_postimage’)”)

— Apply to downstream gold table using MERGE

deltaTable.alias(“target”)

.merge(upserts.alias(“source”), “target.customer_id = source.customer_id”)

.whenMatchedUpdateAll()

.whenNotMatchedInsertAll()

.execute()

— Store the last processed version for idempotency

spark.sql(s”UPDATE pipeline_metadata.watermarks SET last_version = ${changes.select(max(‘_commit_version’)).collect()(0)(0)} WHERE table_name = ‘silver.customers'”)

3.4 Time Travel & Data Retention

Operation	Syntax	Use Case	Retention Constraint
Query historical version	SELECT * FROM table VERSION AS OF 42	Debug pipeline issues; compare before/after transformation	Only works within retention window
Query by timestamp	SELECT * FROM table TIMESTAMP AS OF ‘2025-01-15’	Reproduce ML training dataset as of a specific date	Timestamp must be within retention window
Restore table	RESTORE TABLE silver.customers TO VERSION AS OF 38	Rollback bad pipeline run or accidental delete	Restores table state; does not delete newer log entries
View table history	DESCRIBE HISTORY silver.customers	Audit all operations; find version before bad write	Shows full transaction log within retention window

Default retention is 30 days (7 days for log files). For regulatory environments requiring longer audit trails, set: ALTER TABLE silver.customers SET TBLPROPERTIES (‘delta.logRetentionDuration’ = ‘interval 365 days’, ‘delta.deletedFileRetentionDuration’ = ‘interval 365 days’);

CHAPTER 4

Unity Catalog: Enterprise Data Governance

4.1 Unity Catalog Architecture

Unity Catalog (UC) is Databricks’ unified governance layer — the control plane for data access, audit logging, lineage tracking, and data sharing across all workspaces in an organization. It is the most significant governance advancement in Databricks’ history, replacing the legacy per-workspace Hive metastore with a centrally managed, ANSI-SQL-compatible catalog with fine-grained access control.

The UC object hierarchy determines how permissions propagate and how assets are organized:

Object Level	Example	Grants Propagate To	Primary Governance Action
Metastore	Contoso-AUED-Metastore (one per Azure region)	All catalogs, schemas, tables below	Assign metastore admins; attach to workspaces; configure audit log destination
Catalog	main, dev_catalog, finance_catalog	All schemas and tables within catalog	Assign catalog owners; set default privileges; configure storage credentials
Schema (Database)	main.silver, finance_catalog.reporting	All tables, views, functions within schema	Assign schema owners; set managed location; grant domain team access
Table / View	main.silver.sales_transactions	Columns (if column mask applied)	Assign owners; grant SELECT/MODIFY; apply row filters; apply column masks
Column	sales_transactions.customer_email	N/A (leaf node)	Apply column masks (e.g., mask PII for non-privileged users)
Function	main.silver.calculate_ltv()	N/A	Control which users can call UDFs and table-valued functions
Storage Credential	adls-silver-credential	External locations using this credential	Grant to specific workspaces or service principals; never grant to end users

4.2 Privilege Model & RBAC Design

Unity Catalog uses a privilege model inspired by ANSI SQL GRANT/REVOKE but extended for cloud-native data access patterns. Understanding privilege inheritance prevents common over-permissioning mistakes.

Privilege Inheritance Rules

Privileges granted at a higher level (catalog) are inherited by all objects below (schemas, tables, columns)
DENY is not supported — UC uses positive permissions only; to restrict access, do not grant at that level
Table owner automatically has all privileges on that table; schema owner has all privileges on all tables in the schema
USE CATALOG and USE SCHEMA are “traversal” privileges — required to access any object within, but do not grant read access to data
SELECT on a table requires: USE CATALOG on the parent catalog AND USE SCHEMA on the parent schema AND SELECT on the table — all three must be explicitly granted

Recommended RBAC Pattern

— Unity Catalog RBAC: Principle of Least Privilege

— Use Entra ID (AAD) groups, not individual users

— Tier 1: Read-only analysts (most common role)

GRANT USE CATALOG ON CATALOG main TO `data-analysts@contoso.com`;

GRANT USE SCHEMA ON SCHEMA main.gold TO `data-analysts@contoso.com`;

GRANT SELECT ON ALL TABLES IN SCHEMA main.gold TO `data-analysts@contoso.com`;

— Note: analysts CAN see gold data but cannot see silver or bronze

— Tier 2: Data engineers (read silver/bronze, write gold)

GRANT USE CATALOG ON CATALOG main TO `data-engineers@contoso.com`;

GRANT USE SCHEMA ON SCHEMA main.bronze TO `data-engineers@contoso.com`;

GRANT USE SCHEMA ON SCHEMA main.silver TO `data-engineers@contoso.com`;

GRANT USE SCHEMA ON SCHEMA main.gold TO `data-engineers@contoso.com`;

GRANT SELECT ON ALL TABLES IN SCHEMA main.bronze TO `data-engineers@contoso.com`;

GRANT SELECT ON ALL TABLES IN SCHEMA main.silver TO `data-engineers@contoso.com`;

GRANT SELECT, MODIFY ON ALL TABLES IN SCHEMA main.gold TO `data-engineers@contoso.com`;

— Tier 3: Service principals for production pipelines

— Use service principals (not groups) for job clusters

GRANT USE CATALOG, USE SCHEMA ON CATALOG main TO `sp-pipeline-prod`;

GRANT SELECT ON SCHEMA main.silver TO `sp-pipeline-prod`;

GRANT SELECT, MODIFY ON SCHEMA main.gold TO `sp-pipeline-prod`;

— Principle: pipelines have exactly the permissions needed for their specific tables

4.3 Column Masks & Row Filters

Column masks and row filters are the most powerful and underutilized Unity Catalog features. They implement attribute-based access control (ABAC) at the data level — the same data asset serves different data consumers with different visibility based on their group membership or attributes, without duplicating data or maintaining separate views.

— Column Mask: Mask email addresses for non-privileged users

— Users in “data-scientists-privileged” group see real email

— All other users see masked version

CREATE FUNCTION main.security.mask_email(email STRING)

RETURNS STRING

RETURN CASE

WHEN is_member(‘data-scientists-privileged’) THEN email

ELSE CONCAT(LEFT(email, 2), ‘***@***.com’)

END;

ALTER TABLE main.silver.customers

ALTER COLUMN customer_email

SET MASK main.security.mask_email;

— Row Filter: Data scientists only see rows for their assigned region

— Useful for multi-tenant datasets, regional data residency, or cost centre allocation

CREATE FUNCTION main.security.filter_by_region(region STRING)

RETURNS BOOLEAN

RETURN is_member(‘global-data-access’) — global team sees all

OR is_member(CONCAT(‘region-‘, region)); — regional team sees their region

ALTER TABLE main.silver.sales_transactions

SET ROW FILTER main.security.filter_by_region ON (region);

— Verify mask/filter is working (run as non-privileged user):

SELECT customer_id, customer_email FROM main.silver.customers LIMIT 5;

— Should return: “jo***@***.com” style masked emails

4.4 Automated Lineage & Data Discovery

Unity Catalog automatically captures lineage for all operations executed through Databricks — SQL queries, Spark DataFrame transformations, Python, R, and Scala notebooks — without any instrumentation code. Lineage is captured at the column level and stored in the UC metastore, queryable via the Databricks REST API or viewable in the Catalog Explorer UI.

Lineage from Unity Catalog is also exported to Microsoft Purview via the native connector, creating a unified lineage graph that spans Databricks (engineering/ML) and the broader Microsoft data estate (Fabric, ADF, Power BI). This is the recommended governance architecture for organizations using both platforms.

Lineage Source	Granularity	Capture Method	Purview Export
Spark SQL (notebooks, jobs)	Column-level	Automatic (UC query analysis)	Yes, via Purview-UC connector
Python DataFrame API	Table-level	Automatic (I/O operation capture)	Yes
Delta Live Tables pipelines	Column-level (full DAG)	Automatic (DLT execution graph)	Yes
ML training runs (MLflow)	Dataset-to-model	Automatic (MLflow UC integration)	Yes (model entity in Purview)
dbt on Databricks	Column-level (dbt metadata)	Via dbt-databricks adapter + UC integration	Yes (via dbt-purview connector)

CHAPTER 5

Delta Live Tables: Declarative Pipeline Engineering

5.1 DLT Architecture & Value Proposition

Delta Live Tables (DLT) is Databricks’ declarative pipeline framework that manages the full lifecycle of data pipelines: execution ordering, dependency resolution, error handling, retry logic, data quality enforcement, autoscaling, and monitoring. Instead of writing orchestration logic, engineers declare what data should look like — DLT determines how to produce it.

The productivity gain is substantial. A pipeline that required 800 lines of Spark code with custom error handling, dependency management, and retry logic can be expressed in 150 lines of DLT SQL or Python. More importantly, the operational properties (idempotency, exactly-once semantics, automatic backfill) are guaranteed by the framework, not the engineer.

5.2 DLT Pipeline Design Patterns

Pattern 1 — Full Medallion DLT Pipeline (SQL)

— Bronze: Ingest raw events from Event Hubs (streaming)

CREATE OR REFRESH STREAMING LIVE TABLE bronze_events

COMMENT “Raw events from IoT sensors via Event Hubs. Schema preserved exactly.”

AS SELECT * FROM cloud_files(

“abfss://landing@storage.dfs.core.windows.net/events/”,

“json”,

map(“cloudFiles.inferColumnTypes”, “true”,

“cloudFiles.schemaLocation”, “/checkpoints/bronze_events/schema”)

);

— Silver: Clean, validate, and enrich (streaming)

CREATE OR REFRESH STREAMING LIVE TABLE silver_events (

CONSTRAINT valid_device_id EXPECT (device_id IS NOT NULL) ON VIOLATION DROP ROW,

CONSTRAINT valid_temperature EXPECT (temperature BETWEEN -50 AND 150) ON VIOLATION QUARANTINE,

CONSTRAINT valid_timestamp EXPECT (event_ts > ‘2020-01-01’) ON VIOLATION FAIL UPDATE

)

COMMENT “Validated and enriched sensor events. Quarantined rows in silver_events_quarantine.”

AS SELECT

e.device_id,

e.temperature,

e.humidity,

e.event_ts,

d.device_name,

d.location,

d.manufacturer

FROM STREAM(LIVE.bronze_events) e

LEFT JOIN LIVE.dim_devices d ON e.device_id = d.device_id;

— Gold: Business aggregation (batch, refreshed every 15 minutes)

CREATE OR REFRESH LIVE TABLE gold_device_hourly_stats

COMMENT “Hourly aggregated stats per device. Power BI source.”

AS SELECT

device_id,

location,

date_trunc(‘HOUR’, event_ts) AS hour,

avg(temperature) AS avg_temp,

max(temperature) AS max_temp,

min(temperature) AS min_temp,

avg(humidity) AS avg_humidity,

count(*) AS event_count

FROM LIVE.silver_events

GROUP BY device_id, location, date_trunc(‘HOUR’, event_ts);

5.3 Data Quality with Expectations

DLT Expectations are the single most important feature for production pipeline reliability. They formalize data quality rules as executable constraints, replacing ad hoc validation notebooks and undocumented “it looks wrong” discoveries.

Expectation Violation Mode	Behaviour	Use When	Monitoring
ON VIOLATION DROP ROW	Invalid rows are silently dropped; valid rows continue processing	Noise in source data is expected and acceptable; business logic tolerates missing records	Metrics track dropped row count; alert if drop rate exceeds threshold
ON VIOLATION QUARANTINE	Invalid rows routed to _quarantine table; valid rows continue	Invalid rows need investigation without blocking pipeline; compliance requires audit of rejected data	Query _quarantine table; alert on non-zero counts for critical constraints
ON VIOLATION FAIL UPDATE	Pipeline fails if constraint violated; no partial writes	Zero tolerance for invalid data; downstream consumers cannot handle bad data	Pipeline failure triggers alert; requires manual investigation before rerun

5.4 DLT Enhanced Autoscaling

DLT Enhanced Autoscaling is the recommended compute mode for streaming pipelines with variable throughput. Unlike standard Databricks autoscaling (which scales based on task queue depth), DLT Enhanced Autoscaling uses pipeline-level metrics — lag behind source, processing rate, memory pressure — to make scaling decisions optimized for streaming throughput.

// DLT Pipeline configuration (JSON) for Enhanced Autoscaling

{

“name”: “iot-sensor-pipeline”,

“clusters”: [{

“label”: “default”,

“autoscale”: {

“min_workers”: 2,

“max_workers”: 20,

“mode”: “ENHANCED” // Key: use ENHANCED not LEGACY

“node_type_id”: “Standard_DS4_v2”,

“spark_conf”: {

“spark.databricks.delta.optimizeWrite.enabled”: “true”,

“spark.databricks.delta.autoCompact.enabled”: “true”

}

}],

“continuous”: true, // For streaming: always-on pipeline

// “development”: false — production mode: no interactive restarts

“channel”: “CURRENT”, // Production: always use CURRENT, not PREVIEW

“edition”: “ADVANCED” // Required for Enhanced Autoscaling + expectations

}

CHAPTER 6

Machine Learning at Scale: MLflow & Model Serving

6.1 The Databricks ML Platform

Databricks’ ML platform integrates three components into a unified workflow: the Databricks Runtime for ML (pre-installed with TensorFlow, PyTorch, scikit-learn, XGBoost, and 50+ ML libraries), MLflow for experiment tracking and model lifecycle management, and Model Serving for production model deployment. Together, they reduce the time from experiment to production from weeks to hours.

6.2 MLflow: Experiment Tracking & Model Registry

import mlflow

import mlflow.sklearn

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import roc_auc_score

# Set experiment (creates if not exists; tracked in Unity Catalog)

mlflow.set_experiment(“/Shared/churn-prediction/experiments/gbm-v2”)

with mlflow.start_run(run_name=”gbm-depth6-lr0.05″) as run:

# Log hyperparameters

params = {“n_estimators”: 500, “max_depth”: 6, “learning_rate”: 0.05}

mlflow.log_params(params)

# Train model

model = GradientBoostingClassifier(**params)

model.fit(X_train, y_train)

# Log metrics

auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

mlflow.log_metric(“test_auc”, auc)

mlflow.log_metric(“train_auc”, roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]))

# Log model with input example and signature (required for Model Serving)

signature = mlflow.models.infer_signature(X_train, model.predict(X_train))

mlflow.sklearn.log_model(

model, “model”,

signature=signature,

input_example=X_train[:5],

registered_model_name=”main.ml_models.churn_prediction” # UC-registered

)

print(f”Run ID: {run.info.run_id} | AUC: {auc:.4f}”)

6.3 Model Registry & Deployment Lifecycle

Stage	Registry Status	Who Transitions	Automated Gates
Development	None (experiment runs)	Data Scientist	N/A — free experimentation
Candidate	Registered (no alias)	Data Scientist	Auto: unit tests, schema validation, bias metrics check
Staging	@staging alias	ML Engineer	Auto: integration test, shadow mode evaluation against production traffic, data drift check
Production	@production alias	ML Lead (approval gate)	Auto: A/B test gate (>2% lift required), latency SLA check (<100ms p99), manual sign-off
Archived	@archived alias	Automated (on new production)	Retain for 90 days for rollback capability; auto-delete after retention window

6.4 Model Serving: Production Deployment

Databricks Model Serving provides serverless, auto-scaling REST endpoints for ML models registered in Unity Catalog. The endpoint handles traffic routing, A/B testing, canary deployments, and feature lookup — eliminating the need for a separate model serving infrastructure.

# Deploy model to Databricks Model Serving via REST API

# (also configurable via UI or Terraform)

import requests

import json

token = dbutils.secrets.get(scope=”ml-secrets”, key=”databricks-token”)

workspace_url = “https://adb-<workspace-id>.azuredatabricks.net”

endpoint_config = {

“name”: “churn-prediction-v2”,

“config”: {

“served_models”: [

{

“name”: “churn-gbm-production”,

“model_name”: “main.ml_models.churn_prediction”,

“model_version”: “3”,

“workload_size”: “Small”, # Small/Medium/Large (CPU); GPU options available

“scale_to_zero_enabled”: True, # Cost saving for non-critical endpoints

}

“traffic_config”: {

“routes”: [{“served_model_name”: “churn-gbm-production”, “traffic_percentage”: 100}]

}

response = requests.post(

f”{workspace_url}/api/2.0/serving-endpoints”,

headers={“Authorization”: f”Bearer {token}”},

json=endpoint_config

)

# Endpoint ready in ~5 minutes. SLA: <100ms p99 for Small workload size.

6.5 Feature Store

The Databricks Feature Store (now Unity Catalog Feature Store) solves the training-serving skew problem — the most common source of ML model degradation in production. Features computed during training are stored in Unity Catalog, and the same feature values are retrieved at inference time, guaranteeing consistency.

Key capability: Point-in-time lookups. The Feature Store retrieves the feature value that was valid at the timestamp of each training record — not the current value. This is critical for time-series problems (churn prediction, fraud detection, demand forecasting) where using future feature values in training creates data leakage.

CHAPTER 7

Azure Integration Architecture

7.1 Network Architecture

Production Databricks deployments on Azure require deliberate network design. The default “no VNet injection” configuration routes all cluster traffic through public endpoints — acceptable for development, unacceptable for regulated data workloads.

VNet Injection (Required for Production)

VNet injection deploys Databricks cluster VMs into a customer-managed Azure Virtual Network. Both the control plane (Databricks-managed) and data plane (cluster VMs) communicate through private endpoints. This architecture enables:

Databricks clusters that cannot be reached from the public internet
Outbound traffic routed through customer-managed Azure Firewall or NVA with full packet inspection
Private endpoint connectivity from clusters to ADLS Gen2, Azure Key Vault, Azure SQL, Event Hubs, and all other Azure PaaS services
NSG rules that restrict lateral movement between cluster nodes and to/from non-approved destinations

Network Component	Configuration	Purpose
VNet	Dedicated /16 or /17 (e.g., 10.10.0.0/16)	Container for all Databricks network resources
Public Subnet	/17 or /18; NSG with Databricks service tag rules	Cluster driver nodes; required by Databricks (even in “no public IP” mode)
Private Subnet	/17 or /18; NSG with Databricks service tag rules	Cluster worker nodes; no public IP; data plane traffic
Azure Private Endpoints	One per PaaS service (ADLS, Key Vault, ACR, etc.)	Private IP access to Azure services; eliminates data exfiltration via public endpoints
Azure Firewall / NVA	Custom route table (0.0.0.0/0 → Firewall)	Inspect and log all outbound traffic; allow only approved destinations
Private DNS Zones	One per private endpoint service type	Resolve Azure service hostnames to private IPs within VNet

7.2 Identity Integration with Azure Active Directory

Azure Databricks uses Azure Active Directory (Entra ID) as the exclusive identity provider when Unity Catalog is enabled. This eliminates Databricks-local user management and enables:

Single Sign-On via Entra ID SAML 2.0 — users authenticate with their corporate credentials
Conditional Access Policies — MFA requirements, device compliance, IP location restrictions apply to Databricks access
Group synchronization via SCIM — Entra ID security groups sync to Databricks automatically; deprovisioning in Entra ID immediately revokes Databricks access
Service principal authentication — pipelines and automation use managed identities or service principals with federated credentials; no secrets in code

7.3 Key Azure Integration Patterns

Azure Service	Integration Pattern	Authentication	Use Case
ADLS Gen2	Direct mount via Unity Catalog external location; no storage credentials in notebooks	Managed Identity on cluster; or service principal via Unity Catalog storage credential	All Bronze/Silver/Gold storage; primary lakehouse storage layer
Azure Event Hubs	Kafka-compatible connector (EventHubsConf); or Azure Event Hubs Spark connector	SAS token via Key Vault secret; or Managed Identity (recommended)	Streaming ingestion into Bronze tables via DLT or Structured Streaming
Azure Key Vault	Databricks secret scope backed by Azure Key Vault	Managed Identity; no AKV credentials stored in Databricks	All secrets (connection strings, API keys, tokens); never hardcode credentials
Azure SQL Database	JDBC connector; or recommended: Data Factory staging to Delta	Service principal with SQL login; or Managed Identity (SQL MI)	Incremental extraction from operational databases into Bronze
Azure Data Factory	ADF Databricks Notebook/Jar/Python activities	Service principal with Contributor role on workspace	Orchestrate Databricks jobs from ADF parent pipelines; cross-system dependencies
Microsoft Fabric	Fabric shortcut to Delta tables in ADLS Gen2; or Delta Sharing protocol	Entra ID passthrough; or Delta Sharing credential	Serve Databricks gold tables in Fabric/Power BI without data movement
Azure Monitor	Cluster log delivery to Log Analytics via diagnostic settings	Workspace MSI; no configuration required beyond enabling	Unified observability; query Databricks logs alongside all Azure resources

7.4 Secrets Management

# NEVER do this:

# connection_string = “Server=myserver.database.windows.net;User=admin;Password=P@ssw0rd!”

# ALWAYS do this: Key Vault-backed secret scope

# Step 1: Create Key Vault-backed secret scope (one-time, via Databricks CLI)

# databricks secrets create-scope \

# –scope “production-secrets” \

# –scope-backend-type AZURE_KEYVAULT \

# –resource-id “/subscriptions/…/vaults/contoso-kv-prod” \

# –dns-name “https://contoso-kv-prod.vault.azure.net/”

# Step 2: Read secret in notebook/job (secret never exposed in plain text)

jdbc_password = dbutils.secrets.get(scope=”production-secrets”, key=”sql-db-password”)

# The value of jdbc_password is REDACTED in all Databricks logs

# print(jdbc_password) → [REDACTED]

# Step 3: Use in connection

df = spark.read.format(“jdbc”) \

.option(“url”, f”jdbc:sqlserver://server.database.windows.net”) \

.option(“dbtable”, “dbo.orders”) \

.option(“user”, dbutils.secrets.get(“production-secrets”, “sql-db-user”)) \

.option(“password”, jdbc_password) \

.load()

CHAPTER 8

Security Architecture & Zero-Trust Design

8.1 Security Layers

A production Databricks security architecture operates across five layers. Each layer is necessary; none is sufficient alone:

Security Layer	Controls	Databricks Mechanism
Identity & Authentication	SSO, MFA, conditional access, SCIM provisioning, service principal governance	Entra ID integration, Unity Catalog identity management
Network	VNet injection, private endpoints, NSG rules, firewall egress control, no-public-IP clusters	Azure VNet injection, private Link, Azure Firewall integration
Data Access	Table/column/row-level access control, column masks, row filters, data masking	Unity Catalog RBAC, column masks, row filters
Compute	Cluster policies, single-user access mode, no SSH access, immutable cluster configuration	Cluster access control, cluster policies, admin restrictions
Audit & Monitoring	All data access logged, admin actions logged, anomaly detection, retention compliance	Unity Catalog audit logs, Azure Monitor, Databricks audit log delivery

8.2 Cluster Access Modes

Cluster access mode is one of the most critical security decisions in Databricks. The access mode determines who can use the cluster and what data they can access — a misconfigured access mode is the most common source of unintended data exposure in Databricks deployments.

Access Mode		Who Can Use	Unity Catalog Support	Use Case	Recommended
Single User		One user or service principal only	Full support (all features)	Production jobs, sensitive ML workloads, privileged analyst access	Yes — for all production jobs
Shared		Multiple users, process isolation via kernel namespacing	Full support	Shared interactive clusters for teams with different permissions	Yes — for multi-user interactive
No Isolation Shared (Legacy)		Multiple users, NO process isolation	Limited (no row filters, column masks)	Legacy workloads only	No — avoid for any new deployment
Custom (legacy)		Databricks-local users/groups only	Not supported	Pre-UC legacy workloads	No — migrate to Unity Catalog
	Security Anti-Pattern: Many organizations set cluster access mode to “No Isolation Shared” because it was the historical default and “things just work.” This mode provides zero process isolation between users — a notebook run by User A can access the memory, temp files, and Spark state of User B running concurrently on the same cluster. Never use No Isolation Shared mode in production environments handling sensitive data.

8.3 Cluster Policies

Cluster policies enforce governance controls on cluster creation — preventing cost overruns, security misconfigurations, and unapproved software. In enterprise deployments, cluster policies are mandatory: without them, any user with CREATE CLUSTER permission can launch arbitrarily large clusters with any configuration.

// Cluster Policy JSON: Production Job Clusters

// Enforce: spot instances, auto-termination, approved node types, no SSH

{

“cluster_type”: {“type”: “fixed”, “value”: “job”}, // Jobs only, no all-purpose

“autotermination_minutes”: {

“type”: “range”, “minValue”: 10, “maxValue”: 30, “defaultValue”: 20

“node_type_id”: {

“type”: “allowlist”,

“values”: [“Standard_DS4_v2”, “Standard_DS5_v2”, “Standard_E8s_v3”, “Standard_E16s_v3”]

“azure_attributes.availability”: {

“type”: “fixed”, “value”: “SPOT_WITH_FALLBACK_AZURE” // Spot instances required

“spark_conf.spark.databricks.cluster.profile”: {

“type”: “fixed”, “value”: “singleNode”,

“hidden”: true // Users cannot override

“ssh_public_keys”: {“type”: “fixed”, “value”: []}, // No SSH access

“data_security_mode”: {

“type”: “fixed”, “value”: “SINGLE_USER” // Enforce single-user access mode

“runtime_engine”: {“type”: “fixed”, “value”: “PHOTON”} // Require Photon for performance

}

CHAPTER 9

Performance Engineering & Cost Optimization

9.1 Photon Engine

Photon is Databricks’ C++ vectorized query engine that replaces the JVM-based Spark execution engine for SQL and DataFrame operations. Photon delivers 2-8x query speedup for analytical workloads with no code changes required — it is drop-in compatible with any Spark SQL or DataFrame API code.

Workload Type	Photon Speedup	Recommendation
Aggregations (GROUP BY, COUNT, SUM)	3-8x	Enable Photon on all analytics clusters; cost of DBU premium pays back immediately
Hash joins on large tables	2-5x	Critical for gold layer transformations with multi-table joins
Sorting and ordering	2-4x	Especially impactful for OPTIMIZE Z-ORDER operations
Data scans with filters (WHERE clauses)	2-6x	Combined with Delta data skipping, Photon makes scans of 10TB tables feel like point lookups
Streaming (Structured Streaming)	1.5-3x	Enable on DLT clusters; reduces streaming lag during throughput spikes
Complex Python UDFs	No improvement	Python UDFs bypass Photon; rewrite as SQL expressions or pandas UDFs for Photon benefit

9.2 Cost Optimization: The Four Levers

Lever 1 — Spot Instances

Spot (preemptible) instances reduce VM cost by 60-80% compared to on-demand pricing. Databricks handles spot instance interruptions transparently: if a worker is preempted, the task is automatically retried on a replacement worker. The job runs longer but at 1/3 to 1/5 the compute cost.

Use SPOT_WITH_FALLBACK_AZURE in cluster policies: spot instances preferred, falls back to on-demand if no spot capacity. In practice, Azure spot availability for DS4_v2 and DS5_v2 exceeds 95% in major regions, making spot suitable for all production batch jobs.

Lever 2 — Cluster Right-Sizing

// Finding the right cluster size: use Ganglia metrics or Databricks Compute UI

// Symptoms of over-provisioned cluster:

// – Executor CPU utilization < 40% during active processing

// – Memory utilization < 50% (driver and executors)

// – Shuffle spill to disk = 0 (could handle smaller workers)

// Symptoms of under-provisioned cluster:

// – Executor CPU utilization > 85% sustained

// – Shuffle spill to disk > 0 (increase worker count or use memory-optimized VMs)

// – Task time variance high (some tasks 10x slower = data skew; fix in code, not hardware)

// Rule of thumb: start with 1 worker per 50GB of data processed

// For memory-intensive operations (ML, wide joins): use E-series (memory-optimized)

// For compute-intensive operations (Photon SQL, streaming): use D-series

// Enable cluster autoscaling for variable workloads:

// min_workers = minimum needed for SLA at lowest throughput

// max_workers = maximum needed at peak throughput

// autoscale.mode = “ENHANCED” for DLT, “LEGACY” for general clusters

Lever 3 — Serverless Compute

Databricks Serverless SQL Warehouses and Serverless Job Compute eliminate the cluster start time and idle cost problem entirely. With serverless, compute is ready in < 5 seconds, scales to zero when not in use, and charges only for active query time. For workloads with unpredictable or bursty patterns, serverless is almost always the lowest total cost option.

Lever 4 — Storage Optimization

Storage Optimization	Implementation	Cost Impact
ADLS Gen2 tiering	Move Bronze data older than 90 days to Cool tier; archive after 1 year	40-70% reduction in Bronze storage cost
Delta VACUUM	Run VACUUM RETAIN 30 DAYS weekly on all Delta tables	Removes old Parquet files from deleted/updated rows; 20-40% storage reduction on high-churn tables
Compression tuning	Use Snappy for hot data (faster decompress); Zstd for cold data (better ratio)	10-30% storage reduction with no query performance impact for ADLS-backed tables
Predictive Optimization	Enable UC Predictive Optimization to auto-manage OPTIMIZE and VACUUM	Reduces over-optimizing small tables; reduces VACUUM not being run; saves engineering time

9.3 Handling Data Skew

Data skew — where some partition keys have vastly more rows than others — is the most common cause of mysterious Spark job slowdowns that more hardware cannot fix. Identifying and resolving skew is a core Databricks engineering competency.

# Identify skew: look for tasks taking 10x+ longer than median task

# In Spark UI: Stage Detail → Task Metrics → sort by Duration descending

# Solution 1: Salting (for joins with skewed keys)

# Add random salt to skewed key to distribute across multiple partitions

from pyspark.sql import functions as F

SALT_FACTOR = 20 # Tune based on skew severity; more salt = more parallelism

# Salt the large (skewed) table

large_df_salted = large_df.withColumn(

“customer_id_salted”,

F.concat(F.col(“customer_id”), F.lit(“_”), (F.rand() * SALT_FACTOR).cast(“int”))

)

# Explode the small (lookup) table to match salt values

small_df_exploded = small_df.crossJoin(

spark.range(SALT_FACTOR).select(F.col(“id”).cast(“string”).alias(“salt”))

).withColumn(

“customer_id_salted”,

F.concat(F.col(“customer_id”), F.lit(“_”), F.col(“salt”))

)

# Join on salted key

result = large_df_salted.join(small_df_exploded, “customer_id_salted”)

# Solution 2: Adaptive Query Execution (AQE) — enable for automatic skew handling

spark.conf.set(“spark.sql.adaptive.enabled”, “true”)

spark.conf.set(“spark.sql.adaptive.skewJoin.enabled”, “true”)

# AQE automatically splits skewed partitions at runtime (Databricks Runtime 8.0+)

# Enable first; only implement manual salting if AQE is insufficient

CHAPTER 10

Operations, Monitoring & SRE Practices

10.1 Observability Stack

A production Databricks deployment requires three observability layers: job-level monitoring (did the job succeed? how long did it take?), cluster-level monitoring (CPU, memory, network utilization), and data-level monitoring (did the data arrive? is it fresh? does it pass quality checks?).

Observability Layer	Data Source	Tooling	Key Alerts
Job & Pipeline	Databricks Jobs API, DLT pipeline events, MLflow run metrics	Azure Monitor (Log Analytics), Databricks Workflow emails	Job failure, duration > SLA, DLT expectation violation rate spike
Cluster & Compute	Databricks cluster metrics to Azure Monitor, Ganglia (legacy)	Azure Monitor dashboards, custom KQL queries	Spot interruption rate > 10%, memory OOM events, driver failure
Data Quality	DLT expectations metrics, custom Great Expectations/Soda suites	Databricks SQL dashboards, Power BI, PagerDuty	Null rate spike, row count anomaly, freshness breach (> SLA hours since last update)
Cost	Databricks cost API, Azure Cost Management	Azure Cost Management dashboards + budget alerts	Daily DBU spend > budget threshold, unexpected long-running cluster

10.2 CI/CD for Databricks

Production Databricks deployments require a CI/CD pipeline that tests notebooks and jobs before deployment, promotes code through environments (dev → test → prod), and maintains configuration as code for all workspace resources.

# Azure DevOps Pipeline: Databricks CI/CD (YAML)

# Stages: Test → Deploy to Test → Integration Test → Deploy to Prod

trigger:

branches:

include: [main, release/*]

stages:

– stage: Test

jobs:

– job: UnitTests

steps:

– script: pip install databricks-connect pytest pytest-mock

– script: pytest tests/unit/ -v –junitxml=test-results.xml

– task: PublishTestResults@2

inputs:

testResultsFiles: test-results.xml

– stage: DeployTest

dependsOn: Test

jobs:

– job: DeployToTest

steps:

– script: |

pip install databricks-cli

databricks workspace import_dir ./notebooks /Shared/pipelines/test –overwrite

databricks jobs reset –job-id $TEST_JOB_ID –json @job_configs/test/ingest_job.json

env:

DATABRICKS_HOST: $(DATABRICKS_TEST_HOST)

DATABRICKS_TOKEN: $(DATABRICKS_TEST_TOKEN)

– stage: DeployProd

dependsOn: IntegrationTest

condition: and(succeeded(), eq(variables[‘Build.SourceBranch’], ‘refs/heads/main’))

jobs:

– deployment: DeployToProd

environment: production # Requires manual approval gate

strategy:

runOnce:

deploy:

steps:

– script: databricks bundles deploy –target prod

# Databricks Asset Bundles: recommended for production deployments

10.3 Databricks Asset Bundles

Databricks Asset Bundles (DABs) are the recommended approach for managing Databricks resources as code. DABs define jobs, pipelines, notebooks, and cluster configurations in a YAML manifest that can be version-controlled, deployed consistently across environments, and integrated with CI/CD pipelines.

# databricks.yml — Asset Bundle definition

bundle:

name: analytics-platform

variables:

catalog: {default: dev_catalog}

environment: {default: dev}

targets:

dev:

mode: development

workspace:

host: https://adb-dev.azuredatabricks.net

variables:

catalog: dev_catalog

environment: dev

prod:

mode: production

workspace:

host: https://adb-prod.azuredatabricks.net

variables:

catalog: main

environment: prod

resources:

pipelines:

iot_sensor_pipeline:

name: “IoT Sensor Pipeline (${var.environment})”

catalog: ${var.catalog}

schema: silver

libraries:

– notebook: {path: ./src/pipelines/iot_sensor_pipeline.py}

continuous: true

channel: CURRENT

# Deploy: databricks bundle deploy –target prod

# Destroy: databricks bundle destroy –target dev

CHAPTER 11

Migration Patterns: On-Premises & Cloud to Databricks

11.1 Migration Assessment Framework

Before migrating any workload to Databricks, a structured assessment determines which workloads benefit most, which require significant re-engineering, and which should remain on existing platforms.

Workload Type	Migration Complexity	Migration Value	Recommended Approach
SQL Server SSIS packages	Medium	High	Rewrite as DLT pipelines or ADF; use SSIS → Databricks migration accelerator tools
Informatica PowerCenter mappings	High	High	Use Databricks professional services migration toolkit; expect 60-70% automation, 30-40% manual rewrite
Teradata SQL workloads	Medium-High	Very High (cost reduction)	Use Databricks Teradata migration utilities; SQL dialect conversion tools; validate results
On-premises Hadoop (HDFS + Hive)	Low-Medium	Very High	Lift-and-shift HDFS to ADLS Gen2; convert Hive tables to Delta; Spark code works with minimal changes
Azure Synapse Analytics (dedicated pool)	Low	Medium	Delta Sharing from Databricks to Synapse; or migrate workloads preferring Databricks compute to Databricks
Snowflake	Medium	High (for ML-heavy orgs)	Use Delta Sharing to mirror key tables; migrate engineering workloads; evaluate co-existence vs. full migration

11.2 Hadoop Migration Pattern

Hadoop-to-Databricks is the most common migration pattern and also the most straightforward. HDFS and Hive are architecturally similar to ADLS Gen2 and Delta Lake, making it possible to migrate 80% of Hadoop workloads with minimal code changes.

Storage migration: Use Azure Data Box for initial bulk transfer of HDFS data to ADLS Gen2 (for datasets >10TB). Use AzCopy for incremental transfer during cutover. Target: Bronze layer in ADLS Gen2.
Hive table conversion: Run Databricks CONVERT TO DELTA on all Parquet-backed Hive tables. For ORC-backed tables, read as Spark DataFrame and write as Delta. Register in Unity Catalog metastore.
Spark code migration: Existing PySpark and Scala Spark code runs on Databricks Runtime without modification in ~85% of cases. Common issues: Hadoop-specific configuration properties, HDFS path references, and deprecated Spark 2.x APIs that were removed in Spark 3.x.
Hive metastore migration: Use Databricks Unity Catalog migration tool to migrate existing Hive metastore to Unity Catalog, preserving table definitions, partition metadata, and (where possible) access control information.
Scheduler migration: Replace Apache Oozie or Airflow (on Hadoop YARN) with Databricks Workflows or Azure Data Factory. Export Oozie workflow XML and use the Databricks Oozie migration converter for automated translation.

11.3 Teradata Migration Pattern

Teradata migrations deliver the highest ROI of any Databricks migration — typically 70-85% cost reduction on compute — but require the most careful SQL dialect conversion and performance validation.

Migration Phase	Activities	Duration (Typical)	Key Risk
Discovery	Inventory all Teradata objects (tables, views, stored procedures, macros); profile query patterns; identify BTEQ and TPT scripts	2-4 weeks	Undocumented dependencies; legacy macros with no documentation
Schema conversion	Convert Teradata DDL to Delta Lake DDL; map data types; handle SET tables (implicit dedup) vs. MULTISET (allow duplication)	4-8 weeks	SET table semantics differ from Delta; requires explicit deduplication logic
SQL conversion	Convert BTEQ → PySpark/SQL; convert Teradata-specific functions (QUALIFY, SAMPLE, periods) to Spark equivalents	8-16 weeks	QUALIFY (row numbering within groups) needs careful WINDOW function translation
Performance validation	Run parallel execution: Teradata vs. Databricks; validate row counts, aggregations, and business metrics match	4-8 weeks	Floating point rounding differences; NULL handling differences
Cutover	Final data sync; redirect pipelines; decommission Teradata	2-4 weeks	Cutover window; rollback plan if validation fails

CHAPTER 12

Real-World Case Studies

Case Study 1: Global Retailer — Real-Time Inventory Analytics

Dimension	Details
Organization	Pan-European omnichannel retailer; 1,200 stores; 85M transactions/day across POS, e-commerce, and fulfilment systems
Challenge	Inventory visibility latency of 4-6 hours causing 12% out-of-stock rate. Demand forecasting models running on stale batch data producing recommendations 2 days behind actual demand signals. €47M annual revenue loss attributed to stockouts and overstocking combined.
Databricks Scope	DLT streaming pipelines ingesting POS events from 1,200 stores via Event Hubs (120K events/second peak); real-time inventory position table updated every 60 seconds; XGBoost demand forecasting model retrained daily; 180-day sales history in Silver; real-time serving via Model Serving API
Architecture	Bronze: raw POS events (Event Hubs → DLT streaming, Delta CDF enabled). Silver: cleaned transactions, inventory deltas (DLT streaming). Gold: real-time inventory position table (MERGE every 60 seconds), store-level demand signals. ML: daily demand forecast model using Gold features; served via Model Serving REST API to merchandising system.

Key Engineering Decisions

Event deduplication: POS systems occasionally double-emit events during network retries. DLT Expectations with FAIL UPDATE on duplicate transaction_id, combined with a 2-hour deduplication window using Delta CDF, eliminate duplicate inventory decrements without requiring a full table scan.

Inventory MERGE optimization: The inventory position table (450M rows, one per SKU per store) required MERGE every 60 seconds. Naive MERGE was too slow (45 seconds). Solution: partition by store_id (1,200 partitions), MERGE with WHEN NOT MATCHED THEN INSERT only on the affected store partition. Merge time reduced to 8 seconds — well within the 60-second SLA.

Model serving latency: Merchandising system requires forecast in < 200ms. Databricks Model Serving (Small workload size, scale-to-zero disabled) delivers p99 < 85ms for the XGBoost model with 320 input features. Feature lookup from Delta table adds 12ms average — within SLA.

KPI	Before Databricks	After Databricks (6 months)
Inventory visibility latency	4-6 hours	< 2 minutes
Out-of-stock rate	12%	7.3%
Demand forecast accuracy (MAPE)	31%	18%
Data infrastructure cost (annual)	€4.2M (Teradata + SAS)	€1.1M (Databricks + ADLS)
Engineering team size (data)	28 FTE	16 FTE (12 freed for product work)

Case Study 2: Financial Services Firm — Credit Risk Model Factory

Dimension	Details
Organization	UK challenger bank; 2.4M customers; regulatory capital models (IFRS 9 ECL, Basel III PD/LGD/EAD) requiring quarterly revalidation and monthly production runs
Challenge	Monthly credit risk model runs taking 72 hours on legacy SAS environment. Regulatory validation requiring complete audit trail of every model input, transformation, and output — no automated lineage. Model retraining requiring 3-week manual process (data preparation, training, validation, sign-off, deployment).
Databricks Scope	Full model factory: data ingestion from core banking system via CDC (Azure SQL → Bronze Delta), feature engineering pipeline (Silver), model training and registry (MLflow + Unity Catalog), monthly production scoring (Gold), regulatory reporting (Gold → Azure SQL → Power BI)

Regulatory Compliance Architecture

Immutable audit trail: Delta Lake time travel serves as the regulatory audit trail. Every model training run records the exact Delta table version of all input tables in the MLflow run metadata. Regulators can request reconstruction of any model output by restoring the exact input data state — guaranteed by Delta’s transaction log.

Model governance workflow: MLflow model registry stages (Candidate → Staging → Production) map to the bank’s model risk governance process. Each stage transition requires documented challenger model test results, populated as MLflow run tags. The Production promotion requires manual sign-off in the registry UI from the Model Risk team — creating an automated, auditable governance record.

Parallel validation: During the 6-month transition, Databricks and SAS ran in parallel. Databricks results were compared to SAS output at each stage. Any discrepancy >0.01% triggered an investigation. Final validation showed 99.97% numerical agreement — the 0.03% difference traced to floating-point precision differences between SAS and Python, accepted by the regulator after documentation.

KPI	Before Databricks (SAS)	After Databricks
Monthly model run duration	72 hours	4.5 hours
Model retraining cycle	3 weeks manual	6 hours automated (CI/CD + MLflow)
Audit trail completeness	Manual documentation; gaps common	100% automated via Delta time travel + MLflow
Regulatory audit preparation	8 weeks	3 days
Annual platform cost	£1.8M (SAS licences)	£420K (Databricks + compute)

Case Study 3: Manufacturing — Predictive Maintenance at IoT Scale

A global industrial manufacturer deployed Databricks to process sensor telemetry from 85,000 connected machines across 140 factories in 28 countries. The business case: reduce unplanned downtime (average cost: $260,000/hour per production line) through ML-based predictive maintenance.

Scale challenge: 85,000 machines each emitting 20 sensor readings per second = 1.7 million events per second sustained, 4.2 million at peak. Event Hubs (100 throughput units) ingests to Databricks DLT streaming pipeline. Bronze layer receives 147TB of raw sensor data per month.

ML architecture: One model per machine type (42 machine types) trained on 18-month historical data including 3,200 documented failure events. Isolation Forest for anomaly detection (unsupervised), Gradient Boosting for time-to-failure prediction (supervised, where labelled failure data available). Models retrained weekly using latest telemetry via automated Databricks Workflow.

Edge-to-cloud integration: High-priority alerts (predicted failure within 4 hours) routed from Model Serving → Event Grid → factory floor SCADA system via Azure API Management. Round-trip latency from sensor reading to SCADA alert: < 90 seconds.

KPI	Baseline	After 12 Months
Unplanned downtime (hours/year)	14,200 across all factories	8,900 (37% reduction)
Annual downtime cost avoidance	—	$135M (at $260K/hour average)
False positive rate (maintenance alerts)	N/A (no predictive system)	8.3% (within 10% target)
Model prediction accuracy (failure within 4hrs)	—	84% precision, 79% recall
Databricks platform cost (annual)	—	$2.1M all-in

CHAPTER 13

90-Day Implementation Roadmap

The following roadmap reflects deployment patterns across 15+ enterprise Databricks implementations. The sequencing is optimized for: early value demonstration (quick wins by Day 30), platform stability before scale (foundation before onboarding all workloads), and sustainable operations (CI/CD and monitoring before production traffic).

Phase 1: Foundation (Days 1-30)

Week	Activity	Owner	Success Criteria
1	Azure Databricks workspace deployment with VNet injection; Unity Catalog metastore creation; Entra ID SCIM sync configuration; Key Vault-backed secret scope setup	Data Platform Engineer + Azure Architect	Workspace live in private network; Unity Catalog active; first Entra ID group synced; Key Vault scope working
1-2	ADLS Gen2 storage account creation (bronze/silver/gold); Unity Catalog external locations and storage credentials; cluster policies defined and published	Data Platform Engineer	Storage accounts with hierarchical namespace; external locations in Unity Catalog; cluster policies active
2	Catalog and schema design in Unity Catalog; RBAC model design; data domain owner identification; Entra ID group structure for Databricks access	Data Governance Lead + Data Architect	Catalog hierarchy documented and deployed; RBAC matrix approved; all data domain owners identified and briefed
2-3	First pipeline: select highest-value Bronze ingestion (one source); build as DLT pipeline with expectations; deploy to dev; validate against source	Senior Data Engineer	First DLT pipeline ingesting into Bronze Delta table; expectations defined; data validated against source system
3-4	CI/CD pipeline setup (Azure DevOps); Databricks Asset Bundles configured; dev/prod workspace deployment automation; first automated unit tests	Data Engineer + DevOps	Code deployment automated; dev→prod promotion via pipeline; unit tests running on every PR
4	Azure Monitor integration; Log Analytics workspace linked; first cost alert configured; Databricks audit log delivery enabled	Data Platform Engineer	Cluster metrics in Log Analytics; cost alert active; audit logs flowing

Phase 2: Activation (Days 31-60)

Week	Activity	Owner	Success Criteria
5-6	Onboard Tier 1 workloads: build Silver transformation pipelines with DLT; define data quality expectations; deploy to production	Data Engineering Team	Tier 1 Silver tables populated; DLT expectations capturing quality metrics; pipeline SLAs defined
6	Quick win: first Gold table serving Power BI / Fabric; first analyst using Databricks SQL for self-service query; Governance Maturity demo to leadership	Data Engineering + Analytics Lead	First Gold table live; at least 3 analysts running queries; leadership demo delivered
6-7	Unity Catalog lineage verification; column masks and row filters for PII data; Purview integration configured (if in scope)	Data Engineer + Data Governance	Lineage visible in Catalog Explorer for Tier 1 tables; PII columns masked; Purview sync active
7-8	MLflow setup (if ML in scope): first experiment tracked; model registered in Unity Catalog; model serving endpoint deployed for at least one model	ML Engineer	MLflow experiments running; first model in registry with Champion/Challenger staging; serving endpoint live
7-8	Streaming pipeline deployment (if real-time in scope): Event Hubs connector; DLT streaming with Enhanced Autoscaling; latency SLA validation	Senior Data Engineer	Streaming pipeline ingesting at production volume; end-to-end latency within SLA; autoscaling tested under load
8	Cost review: spot instance utilization report; cluster right-sizing audit; Predictive Optimization enabled on UC catalogs; budget alert refinement	Data Platform Engineer + FinOps	Spot utilization >80%; identified and actioned top 3 cost reduction opportunities

Phase 3: Scale & Optimise (Days 61-90)

Week		Activity	Owner	Success Criteria
9-10		Onboard remaining workloads: complete Bronze ingestion for all Tier 2 sources; Silver and Gold for priority domains; DLT for all new pipelines (no raw Spark jobs)	Data Engineering Team	>80% of planned data sources ingested; all new pipelines using DLT; data quality SLAs documented
10-11		Performance engineering: identify top 10 slowest jobs; apply Z-ORDER, OPTIMIZE, and Photon; validate speedup; document optimization patterns for team	Senior Data Engineer	Top 10 jobs optimized; average job time reduced by >30%; performance tuning runbook created
11		Operations runbook: incident response procedures; on-call rotation; DLT pipeline restart procedures; cost escalation process; SLA breach response	Data Platform Lead + SRE	Runbook published; on-call rotation active; first gameday exercise completed
12		Platform review: measure KPIs vs. Day 1 baseline; business case revalidation; 90-180 day roadmap (next workloads, ML roadmap, Fabric integration plan)	CDO + Data Platform Lead	KPI improvement documented; business case validated; 90-180 day plan approved by leadership
	Critical Success Factor: The most common Databricks deployment failure mode is “platform without pipelines” — the infrastructure is deployed correctly but business workloads are never migrated to it. Ensure at least one business-critical pipeline is live in production by Day 45. Early production usage drives adoption, surfaces real-world issues before scale, and maintains organizational momentum.

APPENDIX

Appendix: Reference Materials

A. Essential PySpark Patterns

A1. Efficient Incremental Load with Watermarks

# Production pattern: Incremental load with high-water mark

# Stores last processed timestamp in a metadata Delta table

from delta.tables import DeltaTable

from pyspark.sql import functions as F

# Read watermark

watermark = spark.table(“main.metadata.pipeline_watermarks”) \

.filter(“pipeline_name = ‘orders_ingestion'”) \

.select(“last_processed_ts”) \

.collect()[0][0]

# Extract only new/changed records

new_records = spark.read.format(“jdbc”) \

.option(“url”, jdbc_url) \

.option(“query”, f”SELECT * FROM orders WHERE updated_at > ‘{watermark}'”) \

.load()

# Upsert to Bronze Delta table

bronze_table = DeltaTable.forName(spark, “main.bronze.orders”)

bronze_table.alias(“target”) \

.merge(new_records.alias(“source”), “target.order_id = source.order_id”) \

.whenMatchedUpdateAll() \

.whenNotMatchedInsertAll() \

.execute()

# Update watermark

new_watermark = new_records.agg(F.max(“updated_at”)).collect()[0][0]

spark.sql(f”””

UPDATE main.metadata.pipeline_watermarks

SET last_processed_ts = ‘{new_watermark}’

WHERE pipeline_name = ‘orders_ingestion’

“””)

A2. Dynamic Partition Overwrite Pattern

# Safe overwrite of specific partitions without touching others

# Use case: reprocess a date range after source data correction

spark.conf.set(“spark.sql.sources.partitionOverwriteMode”, “dynamic”)

# This overwrites ONLY the partitions present in df, leaves others untouched

df.write \

.format(“delta”) \

.mode(“overwrite”) \

.option(“replaceWhere”, “transaction_dt >= ‘2025-01-01’ AND transaction_dt <= ‘2025-01-07′”) \

.saveAsTable(“main.silver.sales_transactions”)

# Note: replaceWhere is safer than dynamic partition overwrite for Delta

# It uses Delta transaction semantics; atomic; rolls back on failure

B. Databricks REST API Quick Reference

Operation	Method	Endpoint	Common Use
List jobs	GET	/api/2.1/jobs/list	Audit all jobs; generate cost attribution reports
Run job now	POST	/api/2.1/jobs/run-now	Trigger job from external orchestrator (ADF, Logic Apps)
Get run status	GET	/api/2.1/jobs/runs/get?run_id={id}	Poll completion from external orchestrator
List DLT pipelines	GET	/api/2.0/pipelines	Audit pipelines; governance inventory
Get pipeline events	GET	/api/2.0/pipelines/{id}/events	Extract DLT expectation metrics for monitoring
List Unity Catalog tables	GET	/api/2.1/unity-catalog/tables	Data catalog integration; asset inventory
Get table lineage	GET	/api/2.0/lineage-tracking/table-lineage	Extract lineage for Purview integration
Create cluster policy	POST	/api/2.0/policies/clusters/create	Infrastructure-as-code cluster governance
List MLflow experiments	GET	/api/2.0/mlflow/experiments/list	ML governance; experiment inventory
Get model versions	GET	/api/2.0/mlflow/model-versions/search	Model registry governance; deployment audit

C. Databricks Runtime Version Policy

Runtime Channel	Description	Recommendation
CURRENT	Latest stable DBR; receives bug fixes and security patches	Use for all production workloads; update quarterly
PREVIEW	Next DBR version; not yet GA; may contain breaking changes	Use in development/test only; never in production
LTS (Long-Term Support)	Extended support cycle (2 years); conservative updates	Use for regulated workloads with strict change control; accept performance tradeoff vs. CURRENT
Custom (ML Runtime)	Includes TensorFlow, PyTorch, scikit-learn, XGBoost pre-installed	Use for all ML workloads; eliminates library installation time on cluster start

D. Glossary

Term	Definition
Adaptive Query Execution (AQE)	Spark feature that re-optimizes query plans at runtime based on actual data statistics; automatically handles skew, coalesces partitions, and selects join strategies
Asset Bundle (DAB)	Databricks resource configuration framework that defines jobs, pipelines, and clusters as YAML manifests for version-controlled, environment-aware deployment
Change Data Feed (CDF)	Delta Lake feature that captures row-level changes (inserts, updates, deletes) with change type metadata, enabling efficient incremental downstream processing
Delta Lake	Open-source storage layer providing ACID transactions, scalable metadata, time travel, and schema enforcement on top of Parquet files in cloud object storage
Delta Live Tables (DLT)	Databricks declarative pipeline framework; engineers declare data transformations and quality expectations; DLT manages execution, retry, monitoring, and autoscaling
Lakehouse	Architecture combining data lake storage economics with data warehouse reliability and performance, using open table formats (Delta Lake, Iceberg, Hudi) as the unifying layer
Medallion Architecture	Data organization pattern (Bronze/Silver/Gold) representing progressive data quality levels, from raw ingestion to business-certified analytics-ready assets
MLflow	Open-source platform for ML lifecycle management: experiment tracking, model registry, model serving, and evaluation; created by Databricks; integrated into Unity Catalog
Photon	Databricks C++ vectorized query engine replacing JVM Spark execution for analytical workloads; delivers 2-8x performance improvement with no code changes
Unity Catalog	Databricks unified governance layer providing centralized access control, lineage tracking, data sharing, and audit across all workspaces in an organization
Z-Ordering	Delta Lake data layout optimization that co-locates related data in the same files based on specified columns, improving query performance by enabling more aggressive data skipping

Contact sales

DATABRICKS ON AZURE

Executive Summary

Key Findings

Why Databricks on Azure — Strategic Positioning

1.1 The Lakehouse Market in 2025

1.2 Databricks vs. Microsoft Fabric: Decision Framework

1.3 Azure-Native Integration Advantages

Lakehouse Architecture: Foundations & Design Patterns

2.1 The Medallion Architecture

Extended Medallion: Platinum Layer

2.2 Storage Organization Patterns

Pattern 1 — Zone-Isolated Storage Accounts (Recommended for Enterprise)

Pattern 2 — Container-Isolated Single Account

2.3 Workspace Architecture for Enterprise

2.4 Cluster Architecture

Delta Lake: The Storage Layer Deep Dive

3.1 Delta Lake Architecture

3.2 Delta Table Optimization: Production Patterns

OPTIMIZE and Z-ORDER

Partitioning Strategy

3.3 Change Data Capture with Delta

3.4 Time Travel & Data Retention

Unity Catalog: Enterprise Data Governance

4.1 Unity Catalog Architecture

4.2 Privilege Model & RBAC Design

Privilege Inheritance Rules

Recommended RBAC Pattern

4.3 Column Masks & Row Filters

4.4 Automated Lineage & Data Discovery

Delta Live Tables: Declarative Pipeline Engineering

5.1 DLT Architecture & Value Proposition

5.2 DLT Pipeline Design Patterns

Pattern 1 — Full Medallion DLT Pipeline (SQL)

5.3 Data Quality with Expectations

5.4 DLT Enhanced Autoscaling

Machine Learning at Scale: MLflow & Model Serving

6.1 The Databricks ML Platform

6.2 MLflow: Experiment Tracking & Model Registry

6.3 Model Registry & Deployment Lifecycle

6.4 Model Serving: Production Deployment

6.5 Feature Store

Azure Integration Architecture

7.1 Network Architecture

VNet Injection (Required for Production)

7.2 Identity Integration with Azure Active Directory

7.3 Key Azure Integration Patterns

7.4 Secrets Management

Security Architecture & Zero-Trust Design

8.1 Security Layers

8.2 Cluster Access Modes

8.3 Cluster Policies

Performance Engineering & Cost Optimization

9.1 Photon Engine

9.2 Cost Optimization: The Four Levers

Lever 1 — Spot Instances

Lever 2 — Cluster Right-Sizing

Lever 3 — Serverless Compute

Lever 4 — Storage Optimization

9.3 Handling Data Skew

Operations, Monitoring & SRE Practices

10.1 Observability Stack

10.2 CI/CD for Databricks

10.3 Databricks Asset Bundles

Migration Patterns: On-Premises & Cloud to Databricks

11.1 Migration Assessment Framework

11.2 Hadoop Migration Pattern

11.3 Teradata Migration Pattern

Real-World Case Studies

Case Study 1: Global Retailer — Real-Time Inventory Analytics

Key Engineering Decisions

Case Study 2: Financial Services Firm — Credit Risk Model Factory

Regulatory Compliance Architecture

Case Study 3: Manufacturing — Predictive Maintenance at IoT Scale

90-Day Implementation Roadmap

Phase 1: Foundation (Days 1-30)

Phase 2: Activation (Days 31-60)

Phase 3: Scale & Optimise (Days 61-90)

Appendix: Reference Materials

A. Essential PySpark Patterns