DATABRICKS ON AZURE
Lakehouse Architecture for Petabyte-Scale Analytics
Unity Catalog, Delta Live Tables, ML at Scale & Azure Integration Patterns
| Data Platform Practice | Microsoft Ecosystem Veratas | 2025 Edition |
Classification: Public | Version 1.0 | 2025
Executive Summary
The modern data stack has converged on a single architectural truth: the lakehouse. By unifying the low-cost, flexible storage of a data lake with the reliability, performance, and governance of a data warehouse, the lakehouse eliminates the duplication, inconsistency, and operational overhead that plagued the two-tier architectures of the previous decade. Databricks — the company that originated Apache Spark, Delta Lake, and MLflow — delivers the most mature and battle-tested lakehouse platform available, and on Azure it integrates deeply with the Microsoft ecosystem that most enterprises already operate.
This white paper addresses the gap between Databricks introductory documentation and the architectural decisions that determine whether a Databricks deployment scales from a proof-of-concept to a petabyte-scale production platform. It is written for data architects, senior data engineers, ML engineers, and technical leaders making platform decisions for organizations processing millions to trillions of records daily.
Key Findings
| Dimension | Traditional Architecture | Databricks Lakehouse (Measured Outcomes) |
| Storage cost per TB/month | $180-$450 (data warehouse) | $23-$45 (ADLS Gen2 + Delta) |
| Batch-to-serving latency | 4-24 hours (nightly ETL) | < 5 minutes (Delta Live Tables streaming) |
| ML model deployment time | 2-8 weeks (manual MLOps) | < 4 hours (MLflow + Model Serving) |
| Data pipeline reliability | 60-75% on-time SLA (complex DAGs) | 95%+ with DLT expectations & auto-retry |
| Governance coverage | Per-tool ACLs; no unified lineage | Unity Catalog: column-level lineage, ABAC, row filters |
| Compute cost (analytics) | Fixed warehouse capacity | 40-70% reduction via auto-scaling + spot instances |
| Time to insight (ad hoc) | Hours (provisioning + query queue) | Minutes (serverless SQL warehouse, instant start) |
| Strategic Insight: Databricks and Microsoft Fabric are not mutually exclusive. The optimal enterprise architecture in 2025 uses both: Databricks for heavy data engineering, ML training, and streaming at scale; Microsoft Fabric for business-facing analytics, Power BI integration, and self-service BI. Unity Catalog bridges both platforms through its Fabric connector, enabling shared governance across the full data estate. |
CHAPTER 1
Why Databricks on Azure — Strategic Positioning
1.1 The Lakehouse Market in 2025
The lakehouse market has reached a tipping point. Snowflake, Google BigQuery, and Microsoft Fabric all offer lakehouse-compatible architectures, but Databricks retains three durable advantages that make it the preferred platform for engineering-heavy, ML-intensive, or high-volume data workloads:
Apache Spark ownership: Databricks employs the core Spark committers. Every Databricks Runtime release ships Spark optimizations that are 1-2 years ahead of open-source availability. For workloads processing terabytes or petabytes per job, this translates to 2-5x faster execution over commodity Spark environments.
Delta Lake protocol maturity: Delta Lake — created by Databricks — is the most widely adopted open table format, with 30,000+ production deployments. Delta Universal Format (UniForm) enables interoperability with Apache Iceberg and Apache Hudi readers, future-proofing storage investments against format fragmentation.
Unified ML platform: MLflow (also created by Databricks) is the de facto standard for ML experiment tracking and model registry, with integrations in Azure ML, SageMaker, Google Vertex AI, and every major ML framework. No other lakehouse platform matches Databricks’ native integration between data engineering and ML workflows.
1.2 Databricks vs. Microsoft Fabric: Decision Framework
The most common architectural question in the Microsoft ecosystem: when to use Databricks, when to use Fabric, and when to use both. The answer is workload-driven:
| Workload Characteristic | Recommended Platform | Rationale |
| Large-scale Spark ETL (>10TB/day) | Databricks | Databricks Runtime 2-5x faster than open-source Spark; better autoscaling for burst workloads |
| Streaming ingestion at high volume (>100K events/sec) | Databricks (Structured Streaming) | Native Kafka connector, exactly-once semantics, DLT auto-scaling for variable throughput |
| ML model training (>1M parameter models) | Databricks | Distributed training (Horovod, Ray), GPU cluster management, native MLflow |
| Business analytics & self-service BI | Microsoft Fabric / Power BI | DirectLake mode, Fabric semantic model, M365 integration, non-technical user experience |
| SQL analytics on small-medium datasets (<1TB) | Microsoft Fabric (Warehouse) | Easier SQL authoring, better Power BI integration, lower cost for SQL-only workloads |
| Enterprise data catalog & governance | Microsoft Purview + Unity Catalog | Purview for estate-wide governance; Unity Catalog for Databricks-specific fine-grained control |
| Mixed workload (ETL + analytics + ML) | Databricks + Fabric (integrated) | Databricks processes & governs; Fabric serves; Unity Catalog bridges via Fabric connector |
1.3 Azure-Native Integration Advantages
Deploying Databricks on Azure (Azure Databricks) provides integration benefits that are unavailable on other cloud providers:
- Azure Active Directory (Entra ID) as the identity provider — no separate Databricks user management; SSO and conditional access policies apply automatically
- Azure Private Link for Databricks control plane and data plane — full network isolation without public internet exposure
- Unity Catalog integration with Microsoft Purview — bidirectional metadata sync for enterprise-wide governance
- Native connectors to Azure Event Hubs (Kafka protocol), Azure Blob Storage, ADLS Gen2, Azure SQL, Cosmos DB, Azure Synapse — no third-party connector licensing
- Azure Monitor + Log Analytics for unified observability — Databricks cluster logs, job metrics, and SQL warehouse query history in the same observability stack as all Azure resources
- Microsoft Fabric shortcut to Databricks-managed Delta tables — Fabric reports can query Databricks Lakehouse tables directly without data movement
| Cost Reality Check: Azure Databricks pricing combines DBU (Databricks Unit) consumption with underlying Azure VM costs. A common planning mistake is budgeting only DBU costs. Total cost = DBU cost + VM cost + storage cost + networking cost. In practice, VM cost represents 40-60% of total Databricks spend. This white paper’s cost optimization chapter addresses all four components. |
CHAPTER 2
Lakehouse Architecture: Foundations & Design Patterns
2.1 The Medallion Architecture
The Medallion Architecture — Bronze, Silver, Gold — is the de facto standard for organizing data in a Databricks Lakehouse. Its value is not the three-tier label but the quality contract each tier enforces, the decoupling it provides between ingestion and consumption, and the lineage clarity it creates for governance teams.
| Layer | Data State | Quality Contract | Consumers | Storage Format |
| Bronze (Raw) | Exact copy of source data; no transformations; append-only | Schema-on-read; no business rules applied; complete audit trail of source data | Data engineers, audit/compliance | Delta (with source schema preserved); partitioned by ingestion date |
| Silver (Curated) | Cleaned, deduplicated, conformed; business rules applied; joins across source systems | Schema enforced; data quality rules pass; SLA-driven freshness; no PII in open columns | Data engineers, data scientists, advanced analysts | Delta; partitioned by business date; Z-ordered on high-cardinality query columns |
| Gold (Certified) | Business-ready aggregations, dimensional models, feature stores; domain-specific semantics | Certified by domain owner; SLA guaranteed; documentation complete; sensitivity labelled | Analysts, Power BI / Fabric, ML feature pipelines, APIs | Delta; highly optimized; partitioned and Z-ordered for query patterns |
Extended Medallion: Platinum Layer
High-maturity organizations add a Platinum layer: ML feature store output, aggregated serving tables with sub-second query SLAs, or API-serving materialized views. Platinum tables are the most heavily governed, most tightly SLA-bound, and most aggressively optimized assets in the lakehouse.
2.2 Storage Organization Patterns
The physical organization of the lakehouse in ADLS Gen2 has major implications for performance, security, cost, and governance. Two primary patterns:
Pattern 1 — Zone-Isolated Storage Accounts (Recommended for Enterprise)
Separate ADLS Gen2 storage accounts per medallion layer: adls-bronze-prod, adls-silver-prod, adls-gold-prod. Advantages: independent encryption keys per zone, network security group rules per zone, independent soft-delete and versioning policies, clear billing separation. Disadvantage: slightly more complex pipeline configuration.
Pattern 2 — Container-Isolated Single Account
Single ADLS Gen2 account with containers per layer: /bronze, /silver, /gold. Advantages: simpler configuration, single place for RBAC. Disadvantage: shared encryption key, harder to enforce zone-level network policies, difficult to separate storage costs by layer.
| Recommendation: Use zone-isolated storage accounts for any deployment handling regulated data (PCI, PHI, PII at scale). The per-key encryption enables different key rotation policies per zone and makes regulatory scoping unambiguous. For development environments, a single account with containers is sufficient. |
2.3 Workspace Architecture for Enterprise
A Databricks workspace is the primary deployment unit — the boundary for compute resources, notebooks, jobs, and most RBAC. Enterprise deployments require deliberate workspace strategy:
| Workspace Pattern | Use Case | Advantages | Disadvantages |
| Single workspace | Small teams (<20 engineers), single project, low sensitivity | Simple administration, low overhead | No environment isolation; production/dev share resources and governance |
| Environment-separated (dev/test/prod) | Most enterprise deployments | Environment isolation; CI/CD promotion; prod cluster policies enforced separately | Data sharing between envs requires Delta Sharing or data copies |
| Domain-separated (Data Mesh) | Large orgs with multiple data domains | Domain autonomy; per-domain billing; independent upgrade cycles | Unity Catalog metastore still shared; requires cross-workspace governance coordination |
| Hub-and-spoke (Recommended) | Enterprise with shared platform team + domain teams | Shared infra (hub) + domain autonomy (spokes); central governance + local engineering | Most complex to set up; requires strong platform team |
2.4 Cluster Architecture
Cluster selection is the single biggest driver of both performance and cost. Databricks offers three compute types with fundamentally different operational models:
All-Purpose Clusters: Interactive, long-lived clusters for notebook development and ad hoc analysis. Charged continuously while running. Never use all-purpose clusters for production jobs — auto-termination misconfiguration is the single most common source of unexpected Databricks cost overruns. Enforce auto-termination of 30 minutes via cluster policy.
Job Clusters: Ephemeral clusters created at job start and terminated at job end. The correct compute type for all production jobs. Cost = compute time only. Cluster start time (2-5 minutes) is the only overhead. Use cluster pools to pre-warm instances and reduce start latency to 30-60 seconds.
SQL Warehouses (Serverless preferred): Purpose-built for SQL analytics workloads. Classic SQL Warehouses run on customer-managed VMs. Serverless SQL Warehouses run on Databricks-managed compute and start in < 5 seconds. Use serverless for all interactive BI and analyst SQL workloads. Classic warehouses only when data residency requirements prohibit serverless.
| Cluster Type | Start Time | Cost Model | Best For | Avoid For |
| All-Purpose | 3-7 min | Per-second while running | Interactive development, notebooks | Production jobs (cost risk) |
| Job Cluster | 2-5 min (cold) | Per-second for job duration only | All production batch/streaming jobs | Interactive exploration |
| Job Cluster + Pool | 30-60 sec (warm) | Pool idle cost + per-second for job | Latency-sensitive production jobs | Infrequent jobs (pool idle cost not worth it) |
| Serverless SQL Warehouse | < 5 sec | Per-second, serverless pricing | BI analytics, analyst SQL, Power BI DirectQuery | Large Spark batch processing |
| Classic SQL Warehouse | 2-4 min | Per-second (customer VM) | Data residency-constrained SQL analytics | General use (serverless preferred) |
CHAPTER 3
Delta Lake: The Storage Layer Deep Dive
3.1 Delta Lake Architecture
Delta Lake is not merely a file format — it is a transactional storage layer that adds ACID guarantees, scalable metadata management, and time travel to Parquet files stored in cloud object storage. Understanding its internals is essential for designing high-performance, cost-efficient lakehouse pipelines.
Every Delta table consists of: a data directory (Parquet files organized by partition), a transaction log directory (_delta_log/) containing JSON commit files, and periodically generated Parquet checkpoint files that compact the JSON transaction log for fast metadata reads.
| Delta Capability | Technical Mechanism | Operational Significance |
| ACID transactions | Optimistic concurrency control via transaction log; writers serialize via log entry; readers see consistent snapshots | Multiple concurrent writers without data corruption; safe concurrent reads during writes |
| Time travel | Transaction log preserves all historical versions; TIMESTAMP AS OF or VERSION AS OF syntax | Audit compliance; rollback from bad pipeline runs; reproducible ML training datasets |
| Schema enforcement | Writer validates incoming data against table schema; rejects non-conforming writes by default | Prevents silent data corruption from schema drift in upstream sources |
| Schema evolution | MERGE INTO, ALTER TABLE ADD COLUMN, or mergeSchema option | Safely add new columns without breaking existing readers |
| Scalable metadata | Parquet checkpoints compact JSON log files; column stats stored in log for data skipping | Tables with billions of files read metadata in seconds, not minutes |
| Data skipping | Min/max column statistics per file stored in transaction log; queries skip files outside predicate range | 10-100x query speedup on filtered queries without full table scan |
| Z-ordering | Co-locates related data in same files based on specified columns; improves data skipping effectiveness | Especially effective for high-cardinality columns (customer_id, product_id) used in WHERE clauses |
3.2 Delta Table Optimization: Production Patterns
OPTIMIZE and Z-ORDER
| — Run after bulk loads or when file count exceeds 1000 per partition |
| — Schedule as a post-load job, not inline in the pipeline |
| — Basic OPTIMIZE (compacts small files into 1GB target file size) |
| OPTIMIZE silver.sales_transactions; |
| — OPTIMIZE with Z-ORDER (co-locates data by query columns) |
| — Z-ORDER columns: choose 1-4 columns used in WHERE/JOIN predicates |
| — Do NOT Z-ORDER on partition columns (already co-located by partition) |
| — Do NOT Z-ORDER on columns with low cardinality (<100 distinct values) |
| OPTIMIZE silver.sales_transactions |
| ZORDER BY (customer_id, product_id); |
| — Predictive Optimization (DBR 12.2+) — let Databricks decide when to OPTIMIZE |
| — Enable at catalog or schema level (Unity Catalog required): |
| ALTER CATALOG main SET DBPROPERTIES (delta.enablePredictiveOptimization = TRUE); |
| — Predictive Optimization monitors file stats and triggers OPTIMIZE automatically |
| — Eliminates need for scheduled OPTIMIZE jobs in most cases |
Partitioning Strategy
| — Partitioning rules: |
| — Partition only on LOW cardinality columns (date, region, status) |
| — Target partition size: 1GB – 10GB of data per partition |
| — NEVER partition on high-cardinality columns (customer_id, order_id) |
| — High-cardinality partitioning = millions of tiny files = metadata explosion |
| — Good: partition by date (daily batches ~1-10GB per day) |
| CREATE TABLE silver.sales_transactions ( |
| transaction_id STRING, |
| customer_id STRING, |
| product_id STRING, |
| amount DECIMAL(18,2), |
| transaction_ts TIMESTAMP, |
| transaction_dt DATE |
| ) USING DELTA |
| PARTITIONED BY (transaction_dt) |
| LOCATION “abfss://silver@storage.dfs.core.windows.net/sales/transactions/”; |
| — After partitioning by date, Z-ORDER by customer_id and product_id |
| — to optimize point lookups within each date partition |
| OPTIMIZE silver.sales_transactions |
| WHERE transaction_dt >= current_date() – INTERVAL 7 DAYS |
| ZORDER BY (customer_id, product_id); |
3.3 Change Data Capture with Delta
Delta’s Change Data Feed (CDF) is the recommended pattern for incremental processing pipelines that need to propagate changes downstream without full table scans. CDF captures insert, update, and delete operations at the row level with change type metadata.
| — Enable Change Data Feed on a Delta table |
| ALTER TABLE silver.customers |
| SET TBLPROPERTIES (delta.enableChangeDataFeed = true); |
| — Read changes since a specific version or timestamp |
| val changes = spark.read |
| .format(“delta”) |
| .option(“readChangeFeed”, “true”) |
| .option(“startingVersion”, lastProcessedVersion) // store in metadata table |
| .table(“silver.customers”) |
| — _change_type values: insert, update_preimage, update_postimage, delete |
| // Process only inserts and updates (ignore preimage rows) |
| val upserts = changes.filter(“_change_type IN (‘insert’, ‘update_postimage’)”) |
| — Apply to downstream gold table using MERGE |
| deltaTable.alias(“target”) |
| .merge(upserts.alias(“source”), “target.customer_id = source.customer_id”) |
| .whenMatchedUpdateAll() |
| .whenNotMatchedInsertAll() |
| .execute() |
| — Store the last processed version for idempotency |
| spark.sql(s”UPDATE pipeline_metadata.watermarks SET last_version = ${changes.select(max(‘_commit_version’)).collect()(0)(0)} WHERE table_name = ‘silver.customers'”) |
3.4 Time Travel & Data Retention
| Operation | Syntax | Use Case | Retention Constraint |
| Query historical version | SELECT * FROM table VERSION AS OF 42 | Debug pipeline issues; compare before/after transformation | Only works within retention window |
| Query by timestamp | SELECT * FROM table TIMESTAMP AS OF ‘2025-01-15’ | Reproduce ML training dataset as of a specific date | Timestamp must be within retention window |
| Restore table | RESTORE TABLE silver.customers TO VERSION AS OF 38 | Rollback bad pipeline run or accidental delete | Restores table state; does not delete newer log entries |
| View table history | DESCRIBE HISTORY silver.customers | Audit all operations; find version before bad write | Shows full transaction log within retention window |
Default retention is 30 days (7 days for log files). For regulatory environments requiring longer audit trails, set: ALTER TABLE silver.customers SET TBLPROPERTIES (‘delta.logRetentionDuration’ = ‘interval 365 days’, ‘delta.deletedFileRetentionDuration’ = ‘interval 365 days’);
CHAPTER 4
Unity Catalog: Enterprise Data Governance
4.1 Unity Catalog Architecture
Unity Catalog (UC) is Databricks’ unified governance layer — the control plane for data access, audit logging, lineage tracking, and data sharing across all workspaces in an organization. It is the most significant governance advancement in Databricks’ history, replacing the legacy per-workspace Hive metastore with a centrally managed, ANSI-SQL-compatible catalog with fine-grained access control.
The UC object hierarchy determines how permissions propagate and how assets are organized:
| Object Level | Example | Grants Propagate To | Primary Governance Action |
| Metastore | Contoso-AUED-Metastore (one per Azure region) | All catalogs, schemas, tables below | Assign metastore admins; attach to workspaces; configure audit log destination |
| Catalog | main, dev_catalog, finance_catalog | All schemas and tables within catalog | Assign catalog owners; set default privileges; configure storage credentials |
| Schema (Database) | main.silver, finance_catalog.reporting | All tables, views, functions within schema | Assign schema owners; set managed location; grant domain team access |
| Table / View | main.silver.sales_transactions | Columns (if column mask applied) | Assign owners; grant SELECT/MODIFY; apply row filters; apply column masks |
| Column | sales_transactions.customer_email | N/A (leaf node) | Apply column masks (e.g., mask PII for non-privileged users) |
| Function | main.silver.calculate_ltv() | N/A | Control which users can call UDFs and table-valued functions |
| Storage Credential | adls-silver-credential | External locations using this credential | Grant to specific workspaces or service principals; never grant to end users |
4.2 Privilege Model & RBAC Design
Unity Catalog uses a privilege model inspired by ANSI SQL GRANT/REVOKE but extended for cloud-native data access patterns. Understanding privilege inheritance prevents common over-permissioning mistakes.
Privilege Inheritance Rules
- Privileges granted at a higher level (catalog) are inherited by all objects below (schemas, tables, columns)
- DENY is not supported — UC uses positive permissions only; to restrict access, do not grant at that level
- Table owner automatically has all privileges on that table; schema owner has all privileges on all tables in the schema
- USE CATALOG and USE SCHEMA are “traversal” privileges — required to access any object within, but do not grant read access to data
- SELECT on a table requires: USE CATALOG on the parent catalog AND USE SCHEMA on the parent schema AND SELECT on the table — all three must be explicitly granted
Recommended RBAC Pattern
| — Unity Catalog RBAC: Principle of Least Privilege |
| — Use Entra ID (AAD) groups, not individual users |
| — Tier 1: Read-only analysts (most common role) |
| GRANT USE CATALOG ON CATALOG main TO `data-analysts@contoso.com`; |
| GRANT USE SCHEMA ON SCHEMA main.gold TO `data-analysts@contoso.com`; |
| GRANT SELECT ON ALL TABLES IN SCHEMA main.gold TO `data-analysts@contoso.com`; |
| — Note: analysts CAN see gold data but cannot see silver or bronze |
| — Tier 2: Data engineers (read silver/bronze, write gold) |
| GRANT USE CATALOG ON CATALOG main TO `data-engineers@contoso.com`; |
| GRANT USE SCHEMA ON SCHEMA main.bronze TO `data-engineers@contoso.com`; |
| GRANT USE SCHEMA ON SCHEMA main.silver TO `data-engineers@contoso.com`; |
| GRANT USE SCHEMA ON SCHEMA main.gold TO `data-engineers@contoso.com`; |
| GRANT SELECT ON ALL TABLES IN SCHEMA main.bronze TO `data-engineers@contoso.com`; |
| GRANT SELECT ON ALL TABLES IN SCHEMA main.silver TO `data-engineers@contoso.com`; |
| GRANT SELECT, MODIFY ON ALL TABLES IN SCHEMA main.gold TO `data-engineers@contoso.com`; |
| — Tier 3: Service principals for production pipelines |
| — Use service principals (not groups) for job clusters |
| GRANT USE CATALOG, USE SCHEMA ON CATALOG main TO `sp-pipeline-prod`; |
| GRANT SELECT ON SCHEMA main.silver TO `sp-pipeline-prod`; |
| GRANT SELECT, MODIFY ON SCHEMA main.gold TO `sp-pipeline-prod`; |
| — Principle: pipelines have exactly the permissions needed for their specific tables |
4.3 Column Masks & Row Filters
Column masks and row filters are the most powerful and underutilized Unity Catalog features. They implement attribute-based access control (ABAC) at the data level — the same data asset serves different data consumers with different visibility based on their group membership or attributes, without duplicating data or maintaining separate views.
| — Column Mask: Mask email addresses for non-privileged users |
| — Users in “data-scientists-privileged” group see real email |
| — All other users see masked version |
| CREATE FUNCTION main.security.mask_email(email STRING) |
| RETURNS STRING |
| RETURN CASE |
| WHEN is_member(‘data-scientists-privileged’) THEN email |
| ELSE CONCAT(LEFT(email, 2), ‘***@***.com’) |
| END; |
| ALTER TABLE main.silver.customers |
| ALTER COLUMN customer_email |
| SET MASK main.security.mask_email; |
| — Row Filter: Data scientists only see rows for their assigned region |
| — Useful for multi-tenant datasets, regional data residency, or cost centre allocation |
| CREATE FUNCTION main.security.filter_by_region(region STRING) |
| RETURNS BOOLEAN |
| RETURN is_member(‘global-data-access’) — global team sees all |
| OR is_member(CONCAT(‘region-‘, region)); — regional team sees their region |
| ALTER TABLE main.silver.sales_transactions |
| SET ROW FILTER main.security.filter_by_region ON (region); |
| — Verify mask/filter is working (run as non-privileged user): |
| SELECT customer_id, customer_email FROM main.silver.customers LIMIT 5; |
| — Should return: “jo***@***.com” style masked emails |
4.4 Automated Lineage & Data Discovery
Unity Catalog automatically captures lineage for all operations executed through Databricks — SQL queries, Spark DataFrame transformations, Python, R, and Scala notebooks — without any instrumentation code. Lineage is captured at the column level and stored in the UC metastore, queryable via the Databricks REST API or viewable in the Catalog Explorer UI.
Lineage from Unity Catalog is also exported to Microsoft Purview via the native connector, creating a unified lineage graph that spans Databricks (engineering/ML) and the broader Microsoft data estate (Fabric, ADF, Power BI). This is the recommended governance architecture for organizations using both platforms.
| Lineage Source | Granularity | Capture Method | Purview Export |
| Spark SQL (notebooks, jobs) | Column-level | Automatic (UC query analysis) | Yes, via Purview-UC connector |
| Python DataFrame API | Table-level | Automatic (I/O operation capture) | Yes |
| Delta Live Tables pipelines | Column-level (full DAG) | Automatic (DLT execution graph) | Yes |
| ML training runs (MLflow) | Dataset-to-model | Automatic (MLflow UC integration) | Yes (model entity in Purview) |
| dbt on Databricks | Column-level (dbt metadata) | Via dbt-databricks adapter + UC integration | Yes (via dbt-purview connector) |
CHAPTER 5
Delta Live Tables: Declarative Pipeline Engineering
5.1 DLT Architecture & Value Proposition
Delta Live Tables (DLT) is Databricks’ declarative pipeline framework that manages the full lifecycle of data pipelines: execution ordering, dependency resolution, error handling, retry logic, data quality enforcement, autoscaling, and monitoring. Instead of writing orchestration logic, engineers declare what data should look like — DLT determines how to produce it.
The productivity gain is substantial. A pipeline that required 800 lines of Spark code with custom error handling, dependency management, and retry logic can be expressed in 150 lines of DLT SQL or Python. More importantly, the operational properties (idempotency, exactly-once semantics, automatic backfill) are guaranteed by the framework, not the engineer.
5.2 DLT Pipeline Design Patterns
Pattern 1 — Full Medallion DLT Pipeline (SQL)
| — Bronze: Ingest raw events from Event Hubs (streaming) |
| CREATE OR REFRESH STREAMING LIVE TABLE bronze_events |
| COMMENT “Raw events from IoT sensors via Event Hubs. Schema preserved exactly.” |
| AS SELECT * FROM cloud_files( |
| “abfss://landing@storage.dfs.core.windows.net/events/”, |
| “json”, |
| map(“cloudFiles.inferColumnTypes”, “true”, |
| “cloudFiles.schemaLocation”, “/checkpoints/bronze_events/schema”) |
| ); |
| — Silver: Clean, validate, and enrich (streaming) |
| CREATE OR REFRESH STREAMING LIVE TABLE silver_events ( |
| CONSTRAINT valid_device_id EXPECT (device_id IS NOT NULL) ON VIOLATION DROP ROW, |
| CONSTRAINT valid_temperature EXPECT (temperature BETWEEN -50 AND 150) ON VIOLATION QUARANTINE, |
| CONSTRAINT valid_timestamp EXPECT (event_ts > ‘2020-01-01’) ON VIOLATION FAIL UPDATE |
| ) |
| COMMENT “Validated and enriched sensor events. Quarantined rows in silver_events_quarantine.” |
| AS SELECT |
| e.device_id, |
| e.temperature, |
| e.humidity, |
| e.event_ts, |
| d.device_name, |
| d.location, |
| d.manufacturer |
| FROM STREAM(LIVE.bronze_events) e |
| LEFT JOIN LIVE.dim_devices d ON e.device_id = d.device_id; |
| — Gold: Business aggregation (batch, refreshed every 15 minutes) |
| CREATE OR REFRESH LIVE TABLE gold_device_hourly_stats |
| COMMENT “Hourly aggregated stats per device. Power BI source.” |
| AS SELECT |
| device_id, |
| location, |
| date_trunc(‘HOUR’, event_ts) AS hour, |
| avg(temperature) AS avg_temp, |
| max(temperature) AS max_temp, |
| min(temperature) AS min_temp, |
| avg(humidity) AS avg_humidity, |
| count(*) AS event_count |
| FROM LIVE.silver_events |
| GROUP BY device_id, location, date_trunc(‘HOUR’, event_ts); |
5.3 Data Quality with Expectations
DLT Expectations are the single most important feature for production pipeline reliability. They formalize data quality rules as executable constraints, replacing ad hoc validation notebooks and undocumented “it looks wrong” discoveries.
| Expectation Violation Mode | Behaviour | Use When | Monitoring |
| ON VIOLATION DROP ROW | Invalid rows are silently dropped; valid rows continue processing | Noise in source data is expected and acceptable; business logic tolerates missing records | Metrics track dropped row count; alert if drop rate exceeds threshold |
| ON VIOLATION QUARANTINE | Invalid rows routed to _quarantine table; valid rows continue | Invalid rows need investigation without blocking pipeline; compliance requires audit of rejected data | Query _quarantine table; alert on non-zero counts for critical constraints |
| ON VIOLATION FAIL UPDATE | Pipeline fails if constraint violated; no partial writes | Zero tolerance for invalid data; downstream consumers cannot handle bad data | Pipeline failure triggers alert; requires manual investigation before rerun |
5.4 DLT Enhanced Autoscaling
DLT Enhanced Autoscaling is the recommended compute mode for streaming pipelines with variable throughput. Unlike standard Databricks autoscaling (which scales based on task queue depth), DLT Enhanced Autoscaling uses pipeline-level metrics — lag behind source, processing rate, memory pressure — to make scaling decisions optimized for streaming throughput.
| // DLT Pipeline configuration (JSON) for Enhanced Autoscaling |
| { |
| “name”: “iot-sensor-pipeline”, |
| “clusters”: [{ |
| “label”: “default”, |
| “autoscale”: { |
| “min_workers”: 2, |
| “max_workers”: 20, |
| “mode”: “ENHANCED” // Key: use ENHANCED not LEGACY |
| }, |
| “node_type_id”: “Standard_DS4_v2”, |
| “spark_conf”: { |
| “spark.databricks.delta.optimizeWrite.enabled”: “true”, |
| “spark.databricks.delta.autoCompact.enabled”: “true” |
| } |
| }], |
| “continuous”: true, // For streaming: always-on pipeline |
| // “development”: false — production mode: no interactive restarts |
| “channel”: “CURRENT”, // Production: always use CURRENT, not PREVIEW |
| “edition”: “ADVANCED” // Required for Enhanced Autoscaling + expectations |
| } |
CHAPTER 6
Machine Learning at Scale: MLflow & Model Serving
6.1 The Databricks ML Platform
Databricks’ ML platform integrates three components into a unified workflow: the Databricks Runtime for ML (pre-installed with TensorFlow, PyTorch, scikit-learn, XGBoost, and 50+ ML libraries), MLflow for experiment tracking and model lifecycle management, and Model Serving for production model deployment. Together, they reduce the time from experiment to production from weeks to hours.
6.2 MLflow: Experiment Tracking & Model Registry
| import mlflow |
| import mlflow.sklearn |
| from sklearn.ensemble import GradientBoostingClassifier |
| from sklearn.metrics import roc_auc_score |
| # Set experiment (creates if not exists; tracked in Unity Catalog) |
| mlflow.set_experiment(“/Shared/churn-prediction/experiments/gbm-v2”) |
| with mlflow.start_run(run_name=”gbm-depth6-lr0.05″) as run: |
| # Log hyperparameters |
| params = {“n_estimators”: 500, “max_depth”: 6, “learning_rate”: 0.05} |
| mlflow.log_params(params) |
| # Train model |
| model = GradientBoostingClassifier(**params) |
| model.fit(X_train, y_train) |
| # Log metrics |
| auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]) |
| mlflow.log_metric(“test_auc”, auc) |
| mlflow.log_metric(“train_auc”, roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])) |
| # Log model with input example and signature (required for Model Serving) |
| signature = mlflow.models.infer_signature(X_train, model.predict(X_train)) |
| mlflow.sklearn.log_model( |
| model, “model”, |
| signature=signature, |
| input_example=X_train[:5], |
| registered_model_name=”main.ml_models.churn_prediction” # UC-registered |
| ) |
| print(f”Run ID: {run.info.run_id} | AUC: {auc:.4f}”) |
6.3 Model Registry & Deployment Lifecycle
| Stage | Registry Status | Who Transitions | Automated Gates |
| Development | None (experiment runs) | Data Scientist | N/A — free experimentation |
| Candidate | Registered (no alias) | Data Scientist | Auto: unit tests, schema validation, bias metrics check |
| Staging | @staging alias | ML Engineer | Auto: integration test, shadow mode evaluation against production traffic, data drift check |
| Production | @production alias | ML Lead (approval gate) | Auto: A/B test gate (>2% lift required), latency SLA check (<100ms p99), manual sign-off |
| Archived | @archived alias | Automated (on new production) | Retain for 90 days for rollback capability; auto-delete after retention window |
6.4 Model Serving: Production Deployment
Databricks Model Serving provides serverless, auto-scaling REST endpoints for ML models registered in Unity Catalog. The endpoint handles traffic routing, A/B testing, canary deployments, and feature lookup — eliminating the need for a separate model serving infrastructure.
| # Deploy model to Databricks Model Serving via REST API |
| # (also configurable via UI or Terraform) |
| import requests |
| import json |
| token = dbutils.secrets.get(scope=”ml-secrets”, key=”databricks-token”) |
| workspace_url = “https://adb-<workspace-id>.azuredatabricks.net” |
| endpoint_config = { |
| “name”: “churn-prediction-v2”, |
| “config”: { |
| “served_models”: [ |
| { |
| “name”: “churn-gbm-production”, |
| “model_name”: “main.ml_models.churn_prediction”, |
| “model_version”: “3”, |
| “workload_size”: “Small”, # Small/Medium/Large (CPU); GPU options available |
| “scale_to_zero_enabled”: True, # Cost saving for non-critical endpoints |
| } |
| ], |
| “traffic_config”: { |
| “routes”: [{“served_model_name”: “churn-gbm-production”, “traffic_percentage”: 100}] |
| } |
| } |
| } |
| response = requests.post( |
| f”{workspace_url}/api/2.0/serving-endpoints”, |
| headers={“Authorization”: f”Bearer {token}”}, |
| json=endpoint_config |
| ) |
| # Endpoint ready in ~5 minutes. SLA: <100ms p99 for Small workload size. |
6.5 Feature Store
The Databricks Feature Store (now Unity Catalog Feature Store) solves the training-serving skew problem — the most common source of ML model degradation in production. Features computed during training are stored in Unity Catalog, and the same feature values are retrieved at inference time, guaranteeing consistency.
Key capability: Point-in-time lookups. The Feature Store retrieves the feature value that was valid at the timestamp of each training record — not the current value. This is critical for time-series problems (churn prediction, fraud detection, demand forecasting) where using future feature values in training creates data leakage.
CHAPTER 7
Azure Integration Architecture
7.1 Network Architecture
Production Databricks deployments on Azure require deliberate network design. The default “no VNet injection” configuration routes all cluster traffic through public endpoints — acceptable for development, unacceptable for regulated data workloads.
VNet Injection (Required for Production)
VNet injection deploys Databricks cluster VMs into a customer-managed Azure Virtual Network. Both the control plane (Databricks-managed) and data plane (cluster VMs) communicate through private endpoints. This architecture enables:
- Databricks clusters that cannot be reached from the public internet
- Outbound traffic routed through customer-managed Azure Firewall or NVA with full packet inspection
- Private endpoint connectivity from clusters to ADLS Gen2, Azure Key Vault, Azure SQL, Event Hubs, and all other Azure PaaS services
- NSG rules that restrict lateral movement between cluster nodes and to/from non-approved destinations
| Network Component | Configuration | Purpose |
| VNet | Dedicated /16 or /17 (e.g., 10.10.0.0/16) | Container for all Databricks network resources |
| Public Subnet | /17 or /18; NSG with Databricks service tag rules | Cluster driver nodes; required by Databricks (even in “no public IP” mode) |
| Private Subnet | /17 or /18; NSG with Databricks service tag rules | Cluster worker nodes; no public IP; data plane traffic |
| Azure Private Endpoints | One per PaaS service (ADLS, Key Vault, ACR, etc.) | Private IP access to Azure services; eliminates data exfiltration via public endpoints |
| Azure Firewall / NVA | Custom route table (0.0.0.0/0 → Firewall) | Inspect and log all outbound traffic; allow only approved destinations |
| Private DNS Zones | One per private endpoint service type | Resolve Azure service hostnames to private IPs within VNet |
7.2 Identity Integration with Azure Active Directory
Azure Databricks uses Azure Active Directory (Entra ID) as the exclusive identity provider when Unity Catalog is enabled. This eliminates Databricks-local user management and enables:
- Single Sign-On via Entra ID SAML 2.0 — users authenticate with their corporate credentials
- Conditional Access Policies — MFA requirements, device compliance, IP location restrictions apply to Databricks access
- Group synchronization via SCIM — Entra ID security groups sync to Databricks automatically; deprovisioning in Entra ID immediately revokes Databricks access
- Service principal authentication — pipelines and automation use managed identities or service principals with federated credentials; no secrets in code
7.3 Key Azure Integration Patterns
| Azure Service | Integration Pattern | Authentication | Use Case |
| ADLS Gen2 | Direct mount via Unity Catalog external location; no storage credentials in notebooks | Managed Identity on cluster; or service principal via Unity Catalog storage credential | All Bronze/Silver/Gold storage; primary lakehouse storage layer |
| Azure Event Hubs | Kafka-compatible connector (EventHubsConf); or Azure Event Hubs Spark connector | SAS token via Key Vault secret; or Managed Identity (recommended) | Streaming ingestion into Bronze tables via DLT or Structured Streaming |
| Azure Key Vault | Databricks secret scope backed by Azure Key Vault | Managed Identity; no AKV credentials stored in Databricks | All secrets (connection strings, API keys, tokens); never hardcode credentials |
| Azure SQL Database | JDBC connector; or recommended: Data Factory staging to Delta | Service principal with SQL login; or Managed Identity (SQL MI) | Incremental extraction from operational databases into Bronze |
| Azure Data Factory | ADF Databricks Notebook/Jar/Python activities | Service principal with Contributor role on workspace | Orchestrate Databricks jobs from ADF parent pipelines; cross-system dependencies |
| Microsoft Fabric | Fabric shortcut to Delta tables in ADLS Gen2; or Delta Sharing protocol | Entra ID passthrough; or Delta Sharing credential | Serve Databricks gold tables in Fabric/Power BI without data movement |
| Azure Monitor | Cluster log delivery to Log Analytics via diagnostic settings | Workspace MSI; no configuration required beyond enabling | Unified observability; query Databricks logs alongside all Azure resources |
7.4 Secrets Management
| # NEVER do this: |
| # connection_string = “Server=myserver.database.windows.net;User=admin;Password=P@ssw0rd!” |
| # ALWAYS do this: Key Vault-backed secret scope |
| # Step 1: Create Key Vault-backed secret scope (one-time, via Databricks CLI) |
| # databricks secrets create-scope \ |
| # –scope “production-secrets” \ |
| # –scope-backend-type AZURE_KEYVAULT \ |
| # –resource-id “/subscriptions/…/vaults/contoso-kv-prod” \ |
| # –dns-name “https://contoso-kv-prod.vault.azure.net/” |
| # Step 2: Read secret in notebook/job (secret never exposed in plain text) |
| jdbc_password = dbutils.secrets.get(scope=”production-secrets”, key=”sql-db-password”) |
| # The value of jdbc_password is REDACTED in all Databricks logs |
| # print(jdbc_password) → [REDACTED] |
| # Step 3: Use in connection |
| df = spark.read.format(“jdbc”) \ |
| .option(“url”, f”jdbc:sqlserver://server.database.windows.net”) \ |
| .option(“dbtable”, “dbo.orders”) \ |
| .option(“user”, dbutils.secrets.get(“production-secrets”, “sql-db-user”)) \ |
| .option(“password”, jdbc_password) \ |
| .load() |
CHAPTER 8
Security Architecture & Zero-Trust Design
8.1 Security Layers
A production Databricks security architecture operates across five layers. Each layer is necessary; none is sufficient alone:
| Security Layer | Controls | Databricks Mechanism |
| Identity & Authentication | SSO, MFA, conditional access, SCIM provisioning, service principal governance | Entra ID integration, Unity Catalog identity management |
| Network | VNet injection, private endpoints, NSG rules, firewall egress control, no-public-IP clusters | Azure VNet injection, private Link, Azure Firewall integration |
| Data Access | Table/column/row-level access control, column masks, row filters, data masking | Unity Catalog RBAC, column masks, row filters |
| Compute | Cluster policies, single-user access mode, no SSH access, immutable cluster configuration | Cluster access control, cluster policies, admin restrictions |
| Audit & Monitoring | All data access logged, admin actions logged, anomaly detection, retention compliance | Unity Catalog audit logs, Azure Monitor, Databricks audit log delivery |
8.2 Cluster Access Modes
Cluster access mode is one of the most critical security decisions in Databricks. The access mode determines who can use the cluster and what data they can access — a misconfigured access mode is the most common source of unintended data exposure in Databricks deployments.
| Access Mode | Who Can Use | Unity Catalog Support | Use Case | Recommended | ||
| Single User | One user or service principal only | Full support (all features) | Production jobs, sensitive ML workloads, privileged analyst access | Yes — for all production jobs | ||
| Shared | Multiple users, process isolation via kernel namespacing | Full support | Shared interactive clusters for teams with different permissions | Yes — for multi-user interactive | ||
| No Isolation Shared (Legacy) | Multiple users, NO process isolation | Limited (no row filters, column masks) | Legacy workloads only | No — avoid for any new deployment | ||
| Custom (legacy) | Databricks-local users/groups only | Not supported | Pre-UC legacy workloads | No — migrate to Unity Catalog | ||
| Security Anti-Pattern: Many organizations set cluster access mode to “No Isolation Shared” because it was the historical default and “things just work.” This mode provides zero process isolation between users — a notebook run by User A can access the memory, temp files, and Spark state of User B running concurrently on the same cluster. Never use No Isolation Shared mode in production environments handling sensitive data. | ||||||
8.3 Cluster Policies
Cluster policies enforce governance controls on cluster creation — preventing cost overruns, security misconfigurations, and unapproved software. In enterprise deployments, cluster policies are mandatory: without them, any user with CREATE CLUSTER permission can launch arbitrarily large clusters with any configuration.
| // Cluster Policy JSON: Production Job Clusters |
| // Enforce: spot instances, auto-termination, approved node types, no SSH |
| { |
| “cluster_type”: {“type”: “fixed”, “value”: “job”}, // Jobs only, no all-purpose |
| “autotermination_minutes”: { |
| “type”: “range”, “minValue”: 10, “maxValue”: 30, “defaultValue”: 20 |
| }, |
| “node_type_id”: { |
| “type”: “allowlist”, |
| “values”: [“Standard_DS4_v2”, “Standard_DS5_v2”, “Standard_E8s_v3”, “Standard_E16s_v3”] |
| }, |
| “azure_attributes.availability”: { |
| “type”: “fixed”, “value”: “SPOT_WITH_FALLBACK_AZURE” // Spot instances required |
| }, |
| “spark_conf.spark.databricks.cluster.profile”: { |
| “type”: “fixed”, “value”: “singleNode”, |
| “hidden”: true // Users cannot override |
| }, |
| “ssh_public_keys”: {“type”: “fixed”, “value”: []}, // No SSH access |
| “data_security_mode”: { |
| “type”: “fixed”, “value”: “SINGLE_USER” // Enforce single-user access mode |
| }, |
| “runtime_engine”: {“type”: “fixed”, “value”: “PHOTON”} // Require Photon for performance |
| } |
CHAPTER 9
Performance Engineering & Cost Optimization
9.1 Photon Engine
Photon is Databricks’ C++ vectorized query engine that replaces the JVM-based Spark execution engine for SQL and DataFrame operations. Photon delivers 2-8x query speedup for analytical workloads with no code changes required — it is drop-in compatible with any Spark SQL or DataFrame API code.
| Workload Type | Photon Speedup | Recommendation |
| Aggregations (GROUP BY, COUNT, SUM) | 3-8x | Enable Photon on all analytics clusters; cost of DBU premium pays back immediately |
| Hash joins on large tables | 2-5x | Critical for gold layer transformations with multi-table joins |
| Sorting and ordering | 2-4x | Especially impactful for OPTIMIZE Z-ORDER operations |
| Data scans with filters (WHERE clauses) | 2-6x | Combined with Delta data skipping, Photon makes scans of 10TB tables feel like point lookups |
| Streaming (Structured Streaming) | 1.5-3x | Enable on DLT clusters; reduces streaming lag during throughput spikes |
| Complex Python UDFs | No improvement | Python UDFs bypass Photon; rewrite as SQL expressions or pandas UDFs for Photon benefit |
9.2 Cost Optimization: The Four Levers
Lever 1 — Spot Instances
Spot (preemptible) instances reduce VM cost by 60-80% compared to on-demand pricing. Databricks handles spot instance interruptions transparently: if a worker is preempted, the task is automatically retried on a replacement worker. The job runs longer but at 1/3 to 1/5 the compute cost.
Use SPOT_WITH_FALLBACK_AZURE in cluster policies: spot instances preferred, falls back to on-demand if no spot capacity. In practice, Azure spot availability for DS4_v2 and DS5_v2 exceeds 95% in major regions, making spot suitable for all production batch jobs.
Lever 2 — Cluster Right-Sizing
| // Finding the right cluster size: use Ganglia metrics or Databricks Compute UI |
| // |
| // Symptoms of over-provisioned cluster: |
| // – Executor CPU utilization < 40% during active processing |
| // – Memory utilization < 50% (driver and executors) |
| // – Shuffle spill to disk = 0 (could handle smaller workers) |
| // |
| // Symptoms of under-provisioned cluster: |
| // – Executor CPU utilization > 85% sustained |
| // – Shuffle spill to disk > 0 (increase worker count or use memory-optimized VMs) |
| // – Task time variance high (some tasks 10x slower = data skew; fix in code, not hardware) |
| // |
| // Rule of thumb: start with 1 worker per 50GB of data processed |
| // For memory-intensive operations (ML, wide joins): use E-series (memory-optimized) |
| // For compute-intensive operations (Photon SQL, streaming): use D-series |
| // |
| // Enable cluster autoscaling for variable workloads: |
| // min_workers = minimum needed for SLA at lowest throughput |
| // max_workers = maximum needed at peak throughput |
| // autoscale.mode = “ENHANCED” for DLT, “LEGACY” for general clusters |
Lever 3 — Serverless Compute
Databricks Serverless SQL Warehouses and Serverless Job Compute eliminate the cluster start time and idle cost problem entirely. With serverless, compute is ready in < 5 seconds, scales to zero when not in use, and charges only for active query time. For workloads with unpredictable or bursty patterns, serverless is almost always the lowest total cost option.
Lever 4 — Storage Optimization
| Storage Optimization | Implementation | Cost Impact |
| ADLS Gen2 tiering | Move Bronze data older than 90 days to Cool tier; archive after 1 year | 40-70% reduction in Bronze storage cost |
| Delta VACUUM | Run VACUUM RETAIN 30 DAYS weekly on all Delta tables | Removes old Parquet files from deleted/updated rows; 20-40% storage reduction on high-churn tables |
| Compression tuning | Use Snappy for hot data (faster decompress); Zstd for cold data (better ratio) | 10-30% storage reduction with no query performance impact for ADLS-backed tables |
| Predictive Optimization | Enable UC Predictive Optimization to auto-manage OPTIMIZE and VACUUM | Reduces over-optimizing small tables; reduces VACUUM not being run; saves engineering time |
9.3 Handling Data Skew
Data skew — where some partition keys have vastly more rows than others — is the most common cause of mysterious Spark job slowdowns that more hardware cannot fix. Identifying and resolving skew is a core Databricks engineering competency.
| # Identify skew: look for tasks taking 10x+ longer than median task |
| # In Spark UI: Stage Detail → Task Metrics → sort by Duration descending |
| # Solution 1: Salting (for joins with skewed keys) |
| # Add random salt to skewed key to distribute across multiple partitions |
| from pyspark.sql import functions as F |
| SALT_FACTOR = 20 # Tune based on skew severity; more salt = more parallelism |
| # Salt the large (skewed) table |
| large_df_salted = large_df.withColumn( |
| “customer_id_salted”, |
| F.concat(F.col(“customer_id”), F.lit(“_”), (F.rand() * SALT_FACTOR).cast(“int”)) |
| ) |
| # Explode the small (lookup) table to match salt values |
| small_df_exploded = small_df.crossJoin( |
| spark.range(SALT_FACTOR).select(F.col(“id”).cast(“string”).alias(“salt”)) |
| ).withColumn( |
| “customer_id_salted”, |
| F.concat(F.col(“customer_id”), F.lit(“_”), F.col(“salt”)) |
| ) |
| # Join on salted key |
| result = large_df_salted.join(small_df_exploded, “customer_id_salted”) |
| # Solution 2: Adaptive Query Execution (AQE) — enable for automatic skew handling |
| spark.conf.set(“spark.sql.adaptive.enabled”, “true”) |
| spark.conf.set(“spark.sql.adaptive.skewJoin.enabled”, “true”) |
| # AQE automatically splits skewed partitions at runtime (Databricks Runtime 8.0+) |
| # Enable first; only implement manual salting if AQE is insufficient |
CHAPTER 10
Operations, Monitoring & SRE Practices
10.1 Observability Stack
A production Databricks deployment requires three observability layers: job-level monitoring (did the job succeed? how long did it take?), cluster-level monitoring (CPU, memory, network utilization), and data-level monitoring (did the data arrive? is it fresh? does it pass quality checks?).
| Observability Layer | Data Source | Tooling | Key Alerts |
| Job & Pipeline | Databricks Jobs API, DLT pipeline events, MLflow run metrics | Azure Monitor (Log Analytics), Databricks Workflow emails | Job failure, duration > SLA, DLT expectation violation rate spike |
| Cluster & Compute | Databricks cluster metrics to Azure Monitor, Ganglia (legacy) | Azure Monitor dashboards, custom KQL queries | Spot interruption rate > 10%, memory OOM events, driver failure |
| Data Quality | DLT expectations metrics, custom Great Expectations/Soda suites | Databricks SQL dashboards, Power BI, PagerDuty | Null rate spike, row count anomaly, freshness breach (> SLA hours since last update) |
| Cost | Databricks cost API, Azure Cost Management | Azure Cost Management dashboards + budget alerts | Daily DBU spend > budget threshold, unexpected long-running cluster |
10.2 CI/CD for Databricks
Production Databricks deployments require a CI/CD pipeline that tests notebooks and jobs before deployment, promotes code through environments (dev → test → prod), and maintains configuration as code for all workspace resources.
| # Azure DevOps Pipeline: Databricks CI/CD (YAML) |
| # Stages: Test → Deploy to Test → Integration Test → Deploy to Prod |
| trigger: |
| branches: |
| include: [main, release/*] |
| stages: |
| – stage: Test |
| jobs: |
| – job: UnitTests |
| steps: |
| – script: pip install databricks-connect pytest pytest-mock |
| – script: pytest tests/unit/ -v –junitxml=test-results.xml |
| – task: PublishTestResults@2 |
| inputs: |
| testResultsFiles: test-results.xml |
| – stage: DeployTest |
| dependsOn: Test |
| jobs: |
| – job: DeployToTest |
| steps: |
| – script: | |
| pip install databricks-cli |
| databricks workspace import_dir ./notebooks /Shared/pipelines/test –overwrite |
| databricks jobs reset –job-id $TEST_JOB_ID –json @job_configs/test/ingest_job.json |
| env: |
| DATABRICKS_HOST: $(DATABRICKS_TEST_HOST) |
| DATABRICKS_TOKEN: $(DATABRICKS_TEST_TOKEN) |
| – stage: DeployProd |
| dependsOn: IntegrationTest |
| condition: and(succeeded(), eq(variables[‘Build.SourceBranch’], ‘refs/heads/main’)) |
| jobs: |
| – deployment: DeployToProd |
| environment: production # Requires manual approval gate |
| strategy: |
| runOnce: |
| deploy: |
| steps: |
| – script: databricks bundles deploy –target prod |
| # Databricks Asset Bundles: recommended for production deployments |
10.3 Databricks Asset Bundles
Databricks Asset Bundles (DABs) are the recommended approach for managing Databricks resources as code. DABs define jobs, pipelines, notebooks, and cluster configurations in a YAML manifest that can be version-controlled, deployed consistently across environments, and integrated with CI/CD pipelines.
| # databricks.yml — Asset Bundle definition |
| bundle: |
| name: analytics-platform |
| variables: |
| catalog: {default: dev_catalog} |
| environment: {default: dev} |
| targets: |
| dev: |
| mode: development |
| workspace: |
| host: https://adb-dev.azuredatabricks.net |
| variables: |
| catalog: dev_catalog |
| environment: dev |
| prod: |
| mode: production |
| workspace: |
| host: https://adb-prod.azuredatabricks.net |
| variables: |
| catalog: main |
| environment: prod |
| resources: |
| pipelines: |
| iot_sensor_pipeline: |
| name: “IoT Sensor Pipeline (${var.environment})” |
| catalog: ${var.catalog} |
| schema: silver |
| libraries: |
| – notebook: {path: ./src/pipelines/iot_sensor_pipeline.py} |
| continuous: true |
| channel: CURRENT |
| # Deploy: databricks bundle deploy –target prod |
| # Destroy: databricks bundle destroy –target dev |
CHAPTER 11
Migration Patterns: On-Premises & Cloud to Databricks
11.1 Migration Assessment Framework
Before migrating any workload to Databricks, a structured assessment determines which workloads benefit most, which require significant re-engineering, and which should remain on existing platforms.
| Workload Type | Migration Complexity | Migration Value | Recommended Approach |
| SQL Server SSIS packages | Medium | High | Rewrite as DLT pipelines or ADF; use SSIS → Databricks migration accelerator tools |
| Informatica PowerCenter mappings | High | High | Use Databricks professional services migration toolkit; expect 60-70% automation, 30-40% manual rewrite |
| Teradata SQL workloads | Medium-High | Very High (cost reduction) | Use Databricks Teradata migration utilities; SQL dialect conversion tools; validate results |
| On-premises Hadoop (HDFS + Hive) | Low-Medium | Very High | Lift-and-shift HDFS to ADLS Gen2; convert Hive tables to Delta; Spark code works with minimal changes |
| Azure Synapse Analytics (dedicated pool) | Low | Medium | Delta Sharing from Databricks to Synapse; or migrate workloads preferring Databricks compute to Databricks |
| Snowflake | Medium | High (for ML-heavy orgs) | Use Delta Sharing to mirror key tables; migrate engineering workloads; evaluate co-existence vs. full migration |
11.2 Hadoop Migration Pattern
Hadoop-to-Databricks is the most common migration pattern and also the most straightforward. HDFS and Hive are architecturally similar to ADLS Gen2 and Delta Lake, making it possible to migrate 80% of Hadoop workloads with minimal code changes.
- Storage migration: Use Azure Data Box for initial bulk transfer of HDFS data to ADLS Gen2 (for datasets >10TB). Use AzCopy for incremental transfer during cutover. Target: Bronze layer in ADLS Gen2.
- Hive table conversion: Run Databricks CONVERT TO DELTA on all Parquet-backed Hive tables. For ORC-backed tables, read as Spark DataFrame and write as Delta. Register in Unity Catalog metastore.
- Spark code migration: Existing PySpark and Scala Spark code runs on Databricks Runtime without modification in ~85% of cases. Common issues: Hadoop-specific configuration properties, HDFS path references, and deprecated Spark 2.x APIs that were removed in Spark 3.x.
- Hive metastore migration: Use Databricks Unity Catalog migration tool to migrate existing Hive metastore to Unity Catalog, preserving table definitions, partition metadata, and (where possible) access control information.
- Scheduler migration: Replace Apache Oozie or Airflow (on Hadoop YARN) with Databricks Workflows or Azure Data Factory. Export Oozie workflow XML and use the Databricks Oozie migration converter for automated translation.
11.3 Teradata Migration Pattern
Teradata migrations deliver the highest ROI of any Databricks migration — typically 70-85% cost reduction on compute — but require the most careful SQL dialect conversion and performance validation.
| Migration Phase | Activities | Duration (Typical) | Key Risk |
| Discovery | Inventory all Teradata objects (tables, views, stored procedures, macros); profile query patterns; identify BTEQ and TPT scripts | 2-4 weeks | Undocumented dependencies; legacy macros with no documentation |
| Schema conversion | Convert Teradata DDL to Delta Lake DDL; map data types; handle SET tables (implicit dedup) vs. MULTISET (allow duplication) | 4-8 weeks | SET table semantics differ from Delta; requires explicit deduplication logic |
| SQL conversion | Convert BTEQ → PySpark/SQL; convert Teradata-specific functions (QUALIFY, SAMPLE, periods) to Spark equivalents | 8-16 weeks | QUALIFY (row numbering within groups) needs careful WINDOW function translation |
| Performance validation | Run parallel execution: Teradata vs. Databricks; validate row counts, aggregations, and business metrics match | 4-8 weeks | Floating point rounding differences; NULL handling differences |
| Cutover | Final data sync; redirect pipelines; decommission Teradata | 2-4 weeks | Cutover window; rollback plan if validation fails |
CHAPTER 12
Real-World Case Studies
Case Study 1: Global Retailer — Real-Time Inventory Analytics
| Dimension | Details |
| Organization | Pan-European omnichannel retailer; 1,200 stores; 85M transactions/day across POS, e-commerce, and fulfilment systems |
| Challenge | Inventory visibility latency of 4-6 hours causing 12% out-of-stock rate. Demand forecasting models running on stale batch data producing recommendations 2 days behind actual demand signals. €47M annual revenue loss attributed to stockouts and overstocking combined. |
| Databricks Scope | DLT streaming pipelines ingesting POS events from 1,200 stores via Event Hubs (120K events/second peak); real-time inventory position table updated every 60 seconds; XGBoost demand forecasting model retrained daily; 180-day sales history in Silver; real-time serving via Model Serving API |
| Architecture | Bronze: raw POS events (Event Hubs → DLT streaming, Delta CDF enabled). Silver: cleaned transactions, inventory deltas (DLT streaming). Gold: real-time inventory position table (MERGE every 60 seconds), store-level demand signals. ML: daily demand forecast model using Gold features; served via Model Serving REST API to merchandising system. |
Key Engineering Decisions
Event deduplication: POS systems occasionally double-emit events during network retries. DLT Expectations with FAIL UPDATE on duplicate transaction_id, combined with a 2-hour deduplication window using Delta CDF, eliminate duplicate inventory decrements without requiring a full table scan.
Inventory MERGE optimization: The inventory position table (450M rows, one per SKU per store) required MERGE every 60 seconds. Naive MERGE was too slow (45 seconds). Solution: partition by store_id (1,200 partitions), MERGE with WHEN NOT MATCHED THEN INSERT only on the affected store partition. Merge time reduced to 8 seconds — well within the 60-second SLA.
Model serving latency: Merchandising system requires forecast in < 200ms. Databricks Model Serving (Small workload size, scale-to-zero disabled) delivers p99 < 85ms for the XGBoost model with 320 input features. Feature lookup from Delta table adds 12ms average — within SLA.
| KPI | Before Databricks | After Databricks (6 months) |
| Inventory visibility latency | 4-6 hours | < 2 minutes |
| Out-of-stock rate | 12% | 7.3% |
| Demand forecast accuracy (MAPE) | 31% | 18% |
| Data infrastructure cost (annual) | €4.2M (Teradata + SAS) | €1.1M (Databricks + ADLS) |
| Engineering team size (data) | 28 FTE | 16 FTE (12 freed for product work) |
Case Study 2: Financial Services Firm — Credit Risk Model Factory
| Dimension | Details |
| Organization | UK challenger bank; 2.4M customers; regulatory capital models (IFRS 9 ECL, Basel III PD/LGD/EAD) requiring quarterly revalidation and monthly production runs |
| Challenge | Monthly credit risk model runs taking 72 hours on legacy SAS environment. Regulatory validation requiring complete audit trail of every model input, transformation, and output — no automated lineage. Model retraining requiring 3-week manual process (data preparation, training, validation, sign-off, deployment). |
| Databricks Scope | Full model factory: data ingestion from core banking system via CDC (Azure SQL → Bronze Delta), feature engineering pipeline (Silver), model training and registry (MLflow + Unity Catalog), monthly production scoring (Gold), regulatory reporting (Gold → Azure SQL → Power BI) |
Regulatory Compliance Architecture
Immutable audit trail: Delta Lake time travel serves as the regulatory audit trail. Every model training run records the exact Delta table version of all input tables in the MLflow run metadata. Regulators can request reconstruction of any model output by restoring the exact input data state — guaranteed by Delta’s transaction log.
Model governance workflow: MLflow model registry stages (Candidate → Staging → Production) map to the bank’s model risk governance process. Each stage transition requires documented challenger model test results, populated as MLflow run tags. The Production promotion requires manual sign-off in the registry UI from the Model Risk team — creating an automated, auditable governance record.
Parallel validation: During the 6-month transition, Databricks and SAS ran in parallel. Databricks results were compared to SAS output at each stage. Any discrepancy >0.01% triggered an investigation. Final validation showed 99.97% numerical agreement — the 0.03% difference traced to floating-point precision differences between SAS and Python, accepted by the regulator after documentation.
| KPI | Before Databricks (SAS) | After Databricks |
| Monthly model run duration | 72 hours | 4.5 hours |
| Model retraining cycle | 3 weeks manual | 6 hours automated (CI/CD + MLflow) |
| Audit trail completeness | Manual documentation; gaps common | 100% automated via Delta time travel + MLflow |
| Regulatory audit preparation | 8 weeks | 3 days |
| Annual platform cost | £1.8M (SAS licences) | £420K (Databricks + compute) |
Case Study 3: Manufacturing — Predictive Maintenance at IoT Scale
A global industrial manufacturer deployed Databricks to process sensor telemetry from 85,000 connected machines across 140 factories in 28 countries. The business case: reduce unplanned downtime (average cost: $260,000/hour per production line) through ML-based predictive maintenance.
Scale challenge: 85,000 machines each emitting 20 sensor readings per second = 1.7 million events per second sustained, 4.2 million at peak. Event Hubs (100 throughput units) ingests to Databricks DLT streaming pipeline. Bronze layer receives 147TB of raw sensor data per month.
ML architecture: One model per machine type (42 machine types) trained on 18-month historical data including 3,200 documented failure events. Isolation Forest for anomaly detection (unsupervised), Gradient Boosting for time-to-failure prediction (supervised, where labelled failure data available). Models retrained weekly using latest telemetry via automated Databricks Workflow.
Edge-to-cloud integration: High-priority alerts (predicted failure within 4 hours) routed from Model Serving → Event Grid → factory floor SCADA system via Azure API Management. Round-trip latency from sensor reading to SCADA alert: < 90 seconds.
| KPI | Baseline | After 12 Months |
| Unplanned downtime (hours/year) | 14,200 across all factories | 8,900 (37% reduction) |
| Annual downtime cost avoidance | — | $135M (at $260K/hour average) |
| False positive rate (maintenance alerts) | N/A (no predictive system) | 8.3% (within 10% target) |
| Model prediction accuracy (failure within 4hrs) | — | 84% precision, 79% recall |
| Databricks platform cost (annual) | — | $2.1M all-in |
CHAPTER 13
90-Day Implementation Roadmap
The following roadmap reflects deployment patterns across 15+ enterprise Databricks implementations. The sequencing is optimized for: early value demonstration (quick wins by Day 30), platform stability before scale (foundation before onboarding all workloads), and sustainable operations (CI/CD and monitoring before production traffic).
Phase 1: Foundation (Days 1-30)
| Week | Activity | Owner | Success Criteria |
| 1 | Azure Databricks workspace deployment with VNet injection; Unity Catalog metastore creation; Entra ID SCIM sync configuration; Key Vault-backed secret scope setup | Data Platform Engineer + Azure Architect | Workspace live in private network; Unity Catalog active; first Entra ID group synced; Key Vault scope working |
| 1-2 | ADLS Gen2 storage account creation (bronze/silver/gold); Unity Catalog external locations and storage credentials; cluster policies defined and published | Data Platform Engineer | Storage accounts with hierarchical namespace; external locations in Unity Catalog; cluster policies active |
| 2 | Catalog and schema design in Unity Catalog; RBAC model design; data domain owner identification; Entra ID group structure for Databricks access | Data Governance Lead + Data Architect | Catalog hierarchy documented and deployed; RBAC matrix approved; all data domain owners identified and briefed |
| 2-3 | First pipeline: select highest-value Bronze ingestion (one source); build as DLT pipeline with expectations; deploy to dev; validate against source | Senior Data Engineer | First DLT pipeline ingesting into Bronze Delta table; expectations defined; data validated against source system |
| 3-4 | CI/CD pipeline setup (Azure DevOps); Databricks Asset Bundles configured; dev/prod workspace deployment automation; first automated unit tests | Data Engineer + DevOps | Code deployment automated; dev→prod promotion via pipeline; unit tests running on every PR |
| 4 | Azure Monitor integration; Log Analytics workspace linked; first cost alert configured; Databricks audit log delivery enabled | Data Platform Engineer | Cluster metrics in Log Analytics; cost alert active; audit logs flowing |
Phase 2: Activation (Days 31-60)
| Week | Activity | Owner | Success Criteria |
| 5-6 | Onboard Tier 1 workloads: build Silver transformation pipelines with DLT; define data quality expectations; deploy to production | Data Engineering Team | Tier 1 Silver tables populated; DLT expectations capturing quality metrics; pipeline SLAs defined |
| 6 | Quick win: first Gold table serving Power BI / Fabric; first analyst using Databricks SQL for self-service query; Governance Maturity demo to leadership | Data Engineering + Analytics Lead | First Gold table live; at least 3 analysts running queries; leadership demo delivered |
| 6-7 | Unity Catalog lineage verification; column masks and row filters for PII data; Purview integration configured (if in scope) | Data Engineer + Data Governance | Lineage visible in Catalog Explorer for Tier 1 tables; PII columns masked; Purview sync active |
| 7-8 | MLflow setup (if ML in scope): first experiment tracked; model registered in Unity Catalog; model serving endpoint deployed for at least one model | ML Engineer | MLflow experiments running; first model in registry with Champion/Challenger staging; serving endpoint live |
| 7-8 | Streaming pipeline deployment (if real-time in scope): Event Hubs connector; DLT streaming with Enhanced Autoscaling; latency SLA validation | Senior Data Engineer | Streaming pipeline ingesting at production volume; end-to-end latency within SLA; autoscaling tested under load |
| 8 | Cost review: spot instance utilization report; cluster right-sizing audit; Predictive Optimization enabled on UC catalogs; budget alert refinement | Data Platform Engineer + FinOps | Spot utilization >80%; identified and actioned top 3 cost reduction opportunities |
Phase 3: Scale & Optimise (Days 61-90)
| Week | Activity | Owner | Success Criteria | ||
| 9-10 | Onboard remaining workloads: complete Bronze ingestion for all Tier 2 sources; Silver and Gold for priority domains; DLT for all new pipelines (no raw Spark jobs) | Data Engineering Team | >80% of planned data sources ingested; all new pipelines using DLT; data quality SLAs documented | ||
| 10-11 | Performance engineering: identify top 10 slowest jobs; apply Z-ORDER, OPTIMIZE, and Photon; validate speedup; document optimization patterns for team | Senior Data Engineer | Top 10 jobs optimized; average job time reduced by >30%; performance tuning runbook created | ||
| 11 | Operations runbook: incident response procedures; on-call rotation; DLT pipeline restart procedures; cost escalation process; SLA breach response | Data Platform Lead + SRE | Runbook published; on-call rotation active; first gameday exercise completed | ||
| 12 | Platform review: measure KPIs vs. Day 1 baseline; business case revalidation; 90-180 day roadmap (next workloads, ML roadmap, Fabric integration plan) | CDO + Data Platform Lead | KPI improvement documented; business case validated; 90-180 day plan approved by leadership | ||
| Critical Success Factor: The most common Databricks deployment failure mode is “platform without pipelines” — the infrastructure is deployed correctly but business workloads are never migrated to it. Ensure at least one business-critical pipeline is live in production by Day 45. Early production usage drives adoption, surfaces real-world issues before scale, and maintains organizational momentum. | |||||
APPENDIX
Appendix: Reference Materials
A. Essential PySpark Patterns
A1. Efficient Incremental Load with Watermarks
| # Production pattern: Incremental load with high-water mark |
| # Stores last processed timestamp in a metadata Delta table |
| from delta.tables import DeltaTable |
| from pyspark.sql import functions as F |
| # Read watermark |
| watermark = spark.table(“main.metadata.pipeline_watermarks”) \ |
| .filter(“pipeline_name = ‘orders_ingestion'”) \ |
| .select(“last_processed_ts”) \ |
| .collect()[0][0] |
| # Extract only new/changed records |
| new_records = spark.read.format(“jdbc”) \ |
| .option(“url”, jdbc_url) \ |
| .option(“query”, f”SELECT * FROM orders WHERE updated_at > ‘{watermark}'”) \ |
| .load() |
| # Upsert to Bronze Delta table |
| bronze_table = DeltaTable.forName(spark, “main.bronze.orders”) |
| bronze_table.alias(“target”) \ |
| .merge(new_records.alias(“source”), “target.order_id = source.order_id”) \ |
| .whenMatchedUpdateAll() \ |
| .whenNotMatchedInsertAll() \ |
| .execute() |
| # Update watermark |
| new_watermark = new_records.agg(F.max(“updated_at”)).collect()[0][0] |
| spark.sql(f””” |
| UPDATE main.metadata.pipeline_watermarks |
| SET last_processed_ts = ‘{new_watermark}’ |
| WHERE pipeline_name = ‘orders_ingestion’ |
| “””) |
A2. Dynamic Partition Overwrite Pattern
| # Safe overwrite of specific partitions without touching others |
| # Use case: reprocess a date range after source data correction |
| spark.conf.set(“spark.sql.sources.partitionOverwriteMode”, “dynamic”) |
| # This overwrites ONLY the partitions present in df, leaves others untouched |
| df.write \ |
| .format(“delta”) \ |
| .mode(“overwrite”) \ |
| .option(“replaceWhere”, “transaction_dt >= ‘2025-01-01’ AND transaction_dt <= ‘2025-01-07′”) \ |
| .saveAsTable(“main.silver.sales_transactions”) |
| # Note: replaceWhere is safer than dynamic partition overwrite for Delta |
| # It uses Delta transaction semantics; atomic; rolls back on failure |
B. Databricks REST API Quick Reference
| Operation | Method | Endpoint | Common Use |
| List jobs | GET | /api/2.1/jobs/list | Audit all jobs; generate cost attribution reports |
| Run job now | POST | /api/2.1/jobs/run-now | Trigger job from external orchestrator (ADF, Logic Apps) |
| Get run status | GET | /api/2.1/jobs/runs/get?run_id={id} | Poll completion from external orchestrator |
| List DLT pipelines | GET | /api/2.0/pipelines | Audit pipelines; governance inventory |
| Get pipeline events | GET | /api/2.0/pipelines/{id}/events | Extract DLT expectation metrics for monitoring |
| List Unity Catalog tables | GET | /api/2.1/unity-catalog/tables | Data catalog integration; asset inventory |
| Get table lineage | GET | /api/2.0/lineage-tracking/table-lineage | Extract lineage for Purview integration |
| Create cluster policy | POST | /api/2.0/policies/clusters/create | Infrastructure-as-code cluster governance |
| List MLflow experiments | GET | /api/2.0/mlflow/experiments/list | ML governance; experiment inventory |
| Get model versions | GET | /api/2.0/mlflow/model-versions/search | Model registry governance; deployment audit |
C. Databricks Runtime Version Policy
| Runtime Channel | Description | Recommendation |
| CURRENT | Latest stable DBR; receives bug fixes and security patches | Use for all production workloads; update quarterly |
| PREVIEW | Next DBR version; not yet GA; may contain breaking changes | Use in development/test only; never in production |
| LTS (Long-Term Support) | Extended support cycle (2 years); conservative updates | Use for regulated workloads with strict change control; accept performance tradeoff vs. CURRENT |
| Custom (ML Runtime) | Includes TensorFlow, PyTorch, scikit-learn, XGBoost pre-installed | Use for all ML workloads; eliminates library installation time on cluster start |
D. Glossary
| Term | Definition |
| Adaptive Query Execution (AQE) | Spark feature that re-optimizes query plans at runtime based on actual data statistics; automatically handles skew, coalesces partitions, and selects join strategies |
| Asset Bundle (DAB) | Databricks resource configuration framework that defines jobs, pipelines, and clusters as YAML manifests for version-controlled, environment-aware deployment |
| Change Data Feed (CDF) | Delta Lake feature that captures row-level changes (inserts, updates, deletes) with change type metadata, enabling efficient incremental downstream processing |
| Delta Lake | Open-source storage layer providing ACID transactions, scalable metadata, time travel, and schema enforcement on top of Parquet files in cloud object storage |
| Delta Live Tables (DLT) | Databricks declarative pipeline framework; engineers declare data transformations and quality expectations; DLT manages execution, retry, monitoring, and autoscaling |
| Lakehouse | Architecture combining data lake storage economics with data warehouse reliability and performance, using open table formats (Delta Lake, Iceberg, Hudi) as the unifying layer |
| Medallion Architecture | Data organization pattern (Bronze/Silver/Gold) representing progressive data quality levels, from raw ingestion to business-certified analytics-ready assets |
| MLflow | Open-source platform for ML lifecycle management: experiment tracking, model registry, model serving, and evaluation; created by Databricks; integrated into Unity Catalog |
| Photon | Databricks C++ vectorized query engine replacing JVM Spark execution for analytical workloads; delivers 2-8x performance improvement with no code changes |
| Unity Catalog | Databricks unified governance layer providing centralized access control, lineage tracking, data sharing, and audit across all workspaces in an organization |
| Z-Ordering | Delta Lake data layout optimization that co-locates related data in the same files based on specified columns, improving query performance by enabling more aggressive data skipping |
