DATABRICKS ON AZURE

Lakehouse Architecture for Petabyte-Scale Analytics

Unity Catalog, Delta Live Tables, ML at Scale & Azure Integration Patterns

Data Platform Practice  |  Microsoft Ecosystem Veratas  |  2025 Edition

Classification: Public  |  Version 1.0  |  2025

Executive Summary

The modern data stack has converged on a single architectural truth: the lakehouse. By unifying the low-cost, flexible storage of a data lake with the reliability, performance, and governance of a data warehouse, the lakehouse eliminates the duplication, inconsistency, and operational overhead that plagued the two-tier architectures of the previous decade. Databricks — the company that originated Apache Spark, Delta Lake, and MLflow — delivers the most mature and battle-tested lakehouse platform available, and on Azure it integrates deeply with the Microsoft ecosystem that most enterprises already operate.

This white paper addresses the gap between Databricks introductory documentation and the architectural decisions that determine whether a Databricks deployment scales from a proof-of-concept to a petabyte-scale production platform. It is written for data architects, senior data engineers, ML engineers, and technical leaders making platform decisions for organizations processing millions to trillions of records daily.

Key Findings

DimensionTraditional ArchitectureDatabricks Lakehouse (Measured Outcomes)
Storage cost per TB/month$180-$450 (data warehouse)$23-$45 (ADLS Gen2 + Delta)
Batch-to-serving latency4-24 hours (nightly ETL)< 5 minutes (Delta Live Tables streaming)
ML model deployment time2-8 weeks (manual MLOps)< 4 hours (MLflow + Model Serving)
Data pipeline reliability60-75% on-time SLA (complex DAGs)95%+ with DLT expectations & auto-retry
Governance coveragePer-tool ACLs; no unified lineageUnity Catalog: column-level lineage, ABAC, row filters
Compute cost (analytics)Fixed warehouse capacity40-70% reduction via auto-scaling + spot instances
Time to insight (ad hoc)Hours (provisioning + query queue)Minutes (serverless SQL warehouse, instant start)
Strategic Insight: Databricks and Microsoft Fabric are not mutually exclusive. The optimal enterprise architecture in 2025 uses both: Databricks for heavy data engineering, ML training, and streaming at scale; Microsoft Fabric for business-facing analytics, Power BI integration, and self-service BI. Unity Catalog bridges both platforms through its Fabric connector, enabling shared governance across the full data estate.

CHAPTER 1

Why Databricks on Azure — Strategic Positioning

1.1  The Lakehouse Market in 2025

The lakehouse market has reached a tipping point. Snowflake, Google BigQuery, and Microsoft Fabric all offer lakehouse-compatible architectures, but Databricks retains three durable advantages that make it the preferred platform for engineering-heavy, ML-intensive, or high-volume data workloads:

Apache Spark ownership: Databricks employs the core Spark committers. Every Databricks Runtime release ships Spark optimizations that are 1-2 years ahead of open-source availability. For workloads processing terabytes or petabytes per job, this translates to 2-5x faster execution over commodity Spark environments.

Delta Lake protocol maturity: Delta Lake — created by Databricks — is the most widely adopted open table format, with 30,000+ production deployments. Delta Universal Format (UniForm) enables interoperability with Apache Iceberg and Apache Hudi readers, future-proofing storage investments against format fragmentation.

Unified ML platform: MLflow (also created by Databricks) is the de facto standard for ML experiment tracking and model registry, with integrations in Azure ML, SageMaker, Google Vertex AI, and every major ML framework. No other lakehouse platform matches Databricks’ native integration between data engineering and ML workflows.

1.2  Databricks vs. Microsoft Fabric: Decision Framework

The most common architectural question in the Microsoft ecosystem: when to use Databricks, when to use Fabric, and when to use both. The answer is workload-driven:

Workload CharacteristicRecommended PlatformRationale
Large-scale Spark ETL (>10TB/day)DatabricksDatabricks Runtime 2-5x faster than open-source Spark; better autoscaling for burst workloads
Streaming ingestion at high volume (>100K events/sec)Databricks (Structured Streaming)Native Kafka connector, exactly-once semantics, DLT auto-scaling for variable throughput
ML model training (>1M parameter models)DatabricksDistributed training (Horovod, Ray), GPU cluster management, native MLflow
Business analytics & self-service BIMicrosoft Fabric / Power BIDirectLake mode, Fabric semantic model, M365 integration, non-technical user experience
SQL analytics on small-medium datasets (<1TB)Microsoft Fabric (Warehouse)Easier SQL authoring, better Power BI integration, lower cost for SQL-only workloads
Enterprise data catalog & governanceMicrosoft Purview + Unity CatalogPurview for estate-wide governance; Unity Catalog for Databricks-specific fine-grained control
Mixed workload (ETL + analytics + ML)Databricks + Fabric (integrated)Databricks processes & governs; Fabric serves; Unity Catalog bridges via Fabric connector

1.3  Azure-Native Integration Advantages

Deploying Databricks on Azure (Azure Databricks) provides integration benefits that are unavailable on other cloud providers:

  • Azure Active Directory (Entra ID) as the identity provider — no separate Databricks user management; SSO and conditional access policies apply automatically
  • Azure Private Link for Databricks control plane and data plane — full network isolation without public internet exposure
  • Unity Catalog integration with Microsoft Purview — bidirectional metadata sync for enterprise-wide governance
  • Native connectors to Azure Event Hubs (Kafka protocol), Azure Blob Storage, ADLS Gen2, Azure SQL, Cosmos DB, Azure Synapse — no third-party connector licensing
  • Azure Monitor + Log Analytics for unified observability — Databricks cluster logs, job metrics, and SQL warehouse query history in the same observability stack as all Azure resources
  • Microsoft Fabric shortcut to Databricks-managed Delta tables — Fabric reports can query Databricks Lakehouse tables directly without data movement
Cost Reality Check: Azure Databricks pricing combines DBU (Databricks Unit) consumption with underlying Azure VM costs. A common planning mistake is budgeting only DBU costs. Total cost = DBU cost + VM cost + storage cost + networking cost. In practice, VM cost represents 40-60% of total Databricks spend. This white paper’s cost optimization chapter addresses all four components.

CHAPTER 2

Lakehouse Architecture: Foundations & Design Patterns

2.1  The Medallion Architecture

The Medallion Architecture — Bronze, Silver, Gold — is the de facto standard for organizing data in a Databricks Lakehouse. Its value is not the three-tier label but the quality contract each tier enforces, the decoupling it provides between ingestion and consumption, and the lineage clarity it creates for governance teams.

LayerData StateQuality ContractConsumersStorage Format
Bronze (Raw)Exact copy of source data; no transformations; append-onlySchema-on-read; no business rules applied; complete audit trail of source dataData engineers, audit/complianceDelta (with source schema preserved); partitioned by ingestion date
Silver (Curated)Cleaned, deduplicated, conformed; business rules applied; joins across source systemsSchema enforced; data quality rules pass; SLA-driven freshness; no PII in open columnsData engineers, data scientists, advanced analystsDelta; partitioned by business date; Z-ordered on high-cardinality query columns
Gold (Certified)Business-ready aggregations, dimensional models, feature stores; domain-specific semanticsCertified by domain owner; SLA guaranteed; documentation complete; sensitivity labelledAnalysts, Power BI / Fabric, ML feature pipelines, APIsDelta; highly optimized; partitioned and Z-ordered for query patterns

Extended Medallion: Platinum Layer

High-maturity organizations add a Platinum layer: ML feature store output, aggregated serving tables with sub-second query SLAs, or API-serving materialized views. Platinum tables are the most heavily governed, most tightly SLA-bound, and most aggressively optimized assets in the lakehouse.

2.2  Storage Organization Patterns

The physical organization of the lakehouse in ADLS Gen2 has major implications for performance, security, cost, and governance. Two primary patterns:

Pattern 1 — Zone-Isolated Storage Accounts (Recommended for Enterprise)

Separate ADLS Gen2 storage accounts per medallion layer: adls-bronze-prod, adls-silver-prod, adls-gold-prod. Advantages: independent encryption keys per zone, network security group rules per zone, independent soft-delete and versioning policies, clear billing separation. Disadvantage: slightly more complex pipeline configuration.

Pattern 2 — Container-Isolated Single Account

Single ADLS Gen2 account with containers per layer: /bronze, /silver, /gold. Advantages: simpler configuration, single place for RBAC. Disadvantage: shared encryption key, harder to enforce zone-level network policies, difficult to separate storage costs by layer.

Recommendation: Use zone-isolated storage accounts for any deployment handling regulated data (PCI, PHI, PII at scale). The per-key encryption enables different key rotation policies per zone and makes regulatory scoping unambiguous. For development environments, a single account with containers is sufficient.

2.3  Workspace Architecture for Enterprise

A Databricks workspace is the primary deployment unit — the boundary for compute resources, notebooks, jobs, and most RBAC. Enterprise deployments require deliberate workspace strategy:

Workspace PatternUse CaseAdvantagesDisadvantages
Single workspaceSmall teams (<20 engineers), single project, low sensitivitySimple administration, low overheadNo environment isolation; production/dev share resources and governance
Environment-separated (dev/test/prod)Most enterprise deploymentsEnvironment isolation; CI/CD promotion; prod cluster policies enforced separatelyData sharing between envs requires Delta Sharing or data copies
Domain-separated (Data Mesh)Large orgs with multiple data domainsDomain autonomy; per-domain billing; independent upgrade cyclesUnity Catalog metastore still shared; requires cross-workspace governance coordination
Hub-and-spoke (Recommended)Enterprise with shared platform team + domain teamsShared infra (hub) + domain autonomy (spokes); central governance + local engineeringMost complex to set up; requires strong platform team

2.4  Cluster Architecture

Cluster selection is the single biggest driver of both performance and cost. Databricks offers three compute types with fundamentally different operational models:

All-Purpose Clusters: Interactive, long-lived clusters for notebook development and ad hoc analysis. Charged continuously while running. Never use all-purpose clusters for production jobs — auto-termination misconfiguration is the single most common source of unexpected Databricks cost overruns. Enforce auto-termination of 30 minutes via cluster policy.

Job Clusters: Ephemeral clusters created at job start and terminated at job end. The correct compute type for all production jobs. Cost = compute time only. Cluster start time (2-5 minutes) is the only overhead. Use cluster pools to pre-warm instances and reduce start latency to 30-60 seconds.

SQL Warehouses (Serverless preferred): Purpose-built for SQL analytics workloads. Classic SQL Warehouses run on customer-managed VMs. Serverless SQL Warehouses run on Databricks-managed compute and start in < 5 seconds. Use serverless for all interactive BI and analyst SQL workloads. Classic warehouses only when data residency requirements prohibit serverless.

Cluster TypeStart TimeCost ModelBest ForAvoid For
All-Purpose3-7 minPer-second while runningInteractive development, notebooksProduction jobs (cost risk)
Job Cluster2-5 min (cold)Per-second for job duration onlyAll production batch/streaming jobsInteractive exploration
Job Cluster + Pool30-60 sec (warm)Pool idle cost + per-second for jobLatency-sensitive production jobsInfrequent jobs (pool idle cost not worth it)
Serverless SQL Warehouse< 5 secPer-second, serverless pricingBI analytics, analyst SQL, Power BI DirectQueryLarge Spark batch processing
Classic SQL Warehouse2-4 minPer-second (customer VM)Data residency-constrained SQL analyticsGeneral use (serverless preferred)

CHAPTER 3

Delta Lake: The Storage Layer Deep Dive

3.1  Delta Lake Architecture

Delta Lake is not merely a file format — it is a transactional storage layer that adds ACID guarantees, scalable metadata management, and time travel to Parquet files stored in cloud object storage. Understanding its internals is essential for designing high-performance, cost-efficient lakehouse pipelines.

Every Delta table consists of: a data directory (Parquet files organized by partition), a transaction log directory (_delta_log/) containing JSON commit files, and periodically generated Parquet checkpoint files that compact the JSON transaction log for fast metadata reads.

Delta CapabilityTechnical MechanismOperational Significance
ACID transactionsOptimistic concurrency control via transaction log; writers serialize via log entry; readers see consistent snapshotsMultiple concurrent writers without data corruption; safe concurrent reads during writes
Time travelTransaction log preserves all historical versions; TIMESTAMP AS OF or VERSION AS OF syntaxAudit compliance; rollback from bad pipeline runs; reproducible ML training datasets
Schema enforcementWriter validates incoming data against table schema; rejects non-conforming writes by defaultPrevents silent data corruption from schema drift in upstream sources
Schema evolutionMERGE INTO, ALTER TABLE ADD COLUMN, or mergeSchema optionSafely add new columns without breaking existing readers
Scalable metadataParquet checkpoints compact JSON log files; column stats stored in log for data skippingTables with billions of files read metadata in seconds, not minutes
Data skippingMin/max column statistics per file stored in transaction log; queries skip files outside predicate range10-100x query speedup on filtered queries without full table scan
Z-orderingCo-locates related data in same files based on specified columns; improves data skipping effectivenessEspecially effective for high-cardinality columns (customer_id, product_id) used in WHERE clauses

3.2  Delta Table Optimization: Production Patterns

OPTIMIZE and Z-ORDER

— Run after bulk loads or when file count exceeds 1000 per partition
— Schedule as a post-load job, not inline in the pipeline
 
— Basic OPTIMIZE (compacts small files into 1GB target file size)
OPTIMIZE silver.sales_transactions;
 
— OPTIMIZE with Z-ORDER (co-locates data by query columns)
— Z-ORDER columns: choose 1-4 columns used in WHERE/JOIN predicates
— Do NOT Z-ORDER on partition columns (already co-located by partition)
— Do NOT Z-ORDER on columns with low cardinality (<100 distinct values)
OPTIMIZE silver.sales_transactions
ZORDER BY (customer_id, product_id);
 
— Predictive Optimization (DBR 12.2+) — let Databricks decide when to OPTIMIZE
— Enable at catalog or schema level (Unity Catalog required):
ALTER CATALOG main SET DBPROPERTIES (delta.enablePredictiveOptimization = TRUE);
— Predictive Optimization monitors file stats and triggers OPTIMIZE automatically
— Eliminates need for scheduled OPTIMIZE jobs in most cases

Partitioning Strategy

— Partitioning rules:
— Partition only on LOW cardinality columns (date, region, status)
— Target partition size: 1GB – 10GB of data per partition
— NEVER partition on high-cardinality columns (customer_id, order_id)
— High-cardinality partitioning = millions of tiny files = metadata explosion
 
— Good: partition by date (daily batches ~1-10GB per day)
CREATE TABLE silver.sales_transactions (
  transaction_id STRING,
  customer_id    STRING,
  product_id     STRING,
  amount         DECIMAL(18,2),
  transaction_ts TIMESTAMP,
  transaction_dt DATE
) USING DELTA
PARTITIONED BY (transaction_dt)
LOCATION “abfss://silver@storage.dfs.core.windows.net/sales/transactions/”;
 
— After partitioning by date, Z-ORDER by customer_id and product_id
— to optimize point lookups within each date partition
OPTIMIZE silver.sales_transactions
WHERE transaction_dt >= current_date() – INTERVAL 7 DAYS
ZORDER BY (customer_id, product_id);

3.3  Change Data Capture with Delta

Delta’s Change Data Feed (CDF) is the recommended pattern for incremental processing pipelines that need to propagate changes downstream without full table scans. CDF captures insert, update, and delete operations at the row level with change type metadata.

— Enable Change Data Feed on a Delta table
ALTER TABLE silver.customers
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
 
— Read changes since a specific version or timestamp
val changes = spark.read
  .format(“delta”)
  .option(“readChangeFeed”, “true”)
  .option(“startingVersion”, lastProcessedVersion)  // store in metadata table
  .table(“silver.customers”)
 
— _change_type values: insert, update_preimage, update_postimage, delete
// Process only inserts and updates (ignore preimage rows)
val upserts = changes.filter(“_change_type IN (‘insert’, ‘update_postimage’)”)
 
— Apply to downstream gold table using MERGE
deltaTable.alias(“target”)
  .merge(upserts.alias(“source”), “target.customer_id = source.customer_id”)
  .whenMatchedUpdateAll()
  .whenNotMatchedInsertAll()
  .execute()
 
— Store the last processed version for idempotency
spark.sql(s”UPDATE pipeline_metadata.watermarks SET last_version = ${changes.select(max(‘_commit_version’)).collect()(0)(0)} WHERE table_name = ‘silver.customers'”)

3.4  Time Travel & Data Retention

OperationSyntaxUse CaseRetention Constraint
Query historical versionSELECT * FROM table VERSION AS OF 42Debug pipeline issues; compare before/after transformationOnly works within retention window
Query by timestampSELECT * FROM table TIMESTAMP AS OF ‘2025-01-15’Reproduce ML training dataset as of a specific dateTimestamp must be within retention window
Restore tableRESTORE TABLE silver.customers TO VERSION AS OF 38Rollback bad pipeline run or accidental deleteRestores table state; does not delete newer log entries
View table historyDESCRIBE HISTORY silver.customersAudit all operations; find version before bad writeShows full transaction log within retention window

Default retention is 30 days (7 days for log files). For regulatory environments requiring longer audit trails, set: ALTER TABLE silver.customers SET TBLPROPERTIES (‘delta.logRetentionDuration’ = ‘interval 365 days’, ‘delta.deletedFileRetentionDuration’ = ‘interval 365 days’);

CHAPTER 4

Unity Catalog: Enterprise Data Governance

4.1  Unity Catalog Architecture

Unity Catalog (UC) is Databricks’ unified governance layer — the control plane for data access, audit logging, lineage tracking, and data sharing across all workspaces in an organization. It is the most significant governance advancement in Databricks’ history, replacing the legacy per-workspace Hive metastore with a centrally managed, ANSI-SQL-compatible catalog with fine-grained access control.

The UC object hierarchy determines how permissions propagate and how assets are organized:

Object LevelExampleGrants Propagate ToPrimary Governance Action
MetastoreContoso-AUED-Metastore (one per Azure region)All catalogs, schemas, tables belowAssign metastore admins; attach to workspaces; configure audit log destination
Catalogmain, dev_catalog, finance_catalogAll schemas and tables within catalogAssign catalog owners; set default privileges; configure storage credentials
Schema (Database)main.silver, finance_catalog.reportingAll tables, views, functions within schemaAssign schema owners; set managed location; grant domain team access
Table / Viewmain.silver.sales_transactionsColumns (if column mask applied)Assign owners; grant SELECT/MODIFY; apply row filters; apply column masks
Columnsales_transactions.customer_emailN/A (leaf node)Apply column masks (e.g., mask PII for non-privileged users)
Functionmain.silver.calculate_ltv()N/AControl which users can call UDFs and table-valued functions
Storage Credentialadls-silver-credentialExternal locations using this credentialGrant to specific workspaces or service principals; never grant to end users

4.2  Privilege Model & RBAC Design

Unity Catalog uses a privilege model inspired by ANSI SQL GRANT/REVOKE but extended for cloud-native data access patterns. Understanding privilege inheritance prevents common over-permissioning mistakes.

Privilege Inheritance Rules

  • Privileges granted at a higher level (catalog) are inherited by all objects below (schemas, tables, columns)
  • DENY is not supported — UC uses positive permissions only; to restrict access, do not grant at that level
  • Table owner automatically has all privileges on that table; schema owner has all privileges on all tables in the schema
  • USE CATALOG and USE SCHEMA are “traversal” privileges — required to access any object within, but do not grant read access to data
  • SELECT on a table requires: USE CATALOG on the parent catalog AND USE SCHEMA on the parent schema AND SELECT on the table — all three must be explicitly granted

Recommended RBAC Pattern

— Unity Catalog RBAC: Principle of Least Privilege
— Use Entra ID (AAD) groups, not individual users
 
— Tier 1: Read-only analysts (most common role)
GRANT USE CATALOG ON CATALOG main TO `data-analysts@contoso.com`;
GRANT USE SCHEMA ON SCHEMA main.gold TO `data-analysts@contoso.com`;
GRANT SELECT ON ALL TABLES IN SCHEMA main.gold TO `data-analysts@contoso.com`;
— Note: analysts CAN see gold data but cannot see silver or bronze
 
— Tier 2: Data engineers (read silver/bronze, write gold)
GRANT USE CATALOG ON CATALOG main TO `data-engineers@contoso.com`;
GRANT USE SCHEMA ON SCHEMA main.bronze TO `data-engineers@contoso.com`;
GRANT USE SCHEMA ON SCHEMA main.silver TO `data-engineers@contoso.com`;
GRANT USE SCHEMA ON SCHEMA main.gold TO `data-engineers@contoso.com`;
GRANT SELECT ON ALL TABLES IN SCHEMA main.bronze TO `data-engineers@contoso.com`;
GRANT SELECT ON ALL TABLES IN SCHEMA main.silver TO `data-engineers@contoso.com`;
GRANT SELECT, MODIFY ON ALL TABLES IN SCHEMA main.gold TO `data-engineers@contoso.com`;
 
— Tier 3: Service principals for production pipelines
— Use service principals (not groups) for job clusters
GRANT USE CATALOG, USE SCHEMA ON CATALOG main TO `sp-pipeline-prod`;
GRANT SELECT ON SCHEMA main.silver TO `sp-pipeline-prod`;
GRANT SELECT, MODIFY ON SCHEMA main.gold TO `sp-pipeline-prod`;
— Principle: pipelines have exactly the permissions needed for their specific tables

4.3  Column Masks & Row Filters

Column masks and row filters are the most powerful and underutilized Unity Catalog features. They implement attribute-based access control (ABAC) at the data level — the same data asset serves different data consumers with different visibility based on their group membership or attributes, without duplicating data or maintaining separate views.

— Column Mask: Mask email addresses for non-privileged users
— Users in “data-scientists-privileged” group see real email
— All other users see masked version
 
CREATE FUNCTION main.security.mask_email(email STRING)
RETURNS STRING
RETURN CASE
  WHEN is_member(‘data-scientists-privileged’) THEN email
  ELSE CONCAT(LEFT(email, 2), ‘***@***.com’)
END;
 
ALTER TABLE main.silver.customers
ALTER COLUMN customer_email
SET MASK main.security.mask_email;
 
 
— Row Filter: Data scientists only see rows for their assigned region
— Useful for multi-tenant datasets, regional data residency, or cost centre allocation
 
CREATE FUNCTION main.security.filter_by_region(region STRING)
RETURNS BOOLEAN
RETURN is_member(‘global-data-access’)    — global team sees all
    OR is_member(CONCAT(‘region-‘, region)); — regional team sees their region
 
ALTER TABLE main.silver.sales_transactions
SET ROW FILTER main.security.filter_by_region ON (region);
 
— Verify mask/filter is working (run as non-privileged user):
SELECT customer_id, customer_email FROM main.silver.customers LIMIT 5;
— Should return: “jo***@***.com” style masked emails

4.4  Automated Lineage & Data Discovery

Unity Catalog automatically captures lineage for all operations executed through Databricks — SQL queries, Spark DataFrame transformations, Python, R, and Scala notebooks — without any instrumentation code. Lineage is captured at the column level and stored in the UC metastore, queryable via the Databricks REST API or viewable in the Catalog Explorer UI.

Lineage from Unity Catalog is also exported to Microsoft Purview via the native connector, creating a unified lineage graph that spans Databricks (engineering/ML) and the broader Microsoft data estate (Fabric, ADF, Power BI). This is the recommended governance architecture for organizations using both platforms.

Lineage SourceGranularityCapture MethodPurview Export
Spark SQL (notebooks, jobs)Column-levelAutomatic (UC query analysis)Yes, via Purview-UC connector
Python DataFrame APITable-levelAutomatic (I/O operation capture)Yes
Delta Live Tables pipelinesColumn-level (full DAG)Automatic (DLT execution graph)Yes
ML training runs (MLflow)Dataset-to-modelAutomatic (MLflow UC integration)Yes (model entity in Purview)
dbt on DatabricksColumn-level (dbt metadata)Via dbt-databricks adapter + UC integrationYes (via dbt-purview connector)

CHAPTER 5

Delta Live Tables: Declarative Pipeline Engineering

5.1  DLT Architecture & Value Proposition

Delta Live Tables (DLT) is Databricks’ declarative pipeline framework that manages the full lifecycle of data pipelines: execution ordering, dependency resolution, error handling, retry logic, data quality enforcement, autoscaling, and monitoring. Instead of writing orchestration logic, engineers declare what data should look like — DLT determines how to produce it.

The productivity gain is substantial. A pipeline that required 800 lines of Spark code with custom error handling, dependency management, and retry logic can be expressed in 150 lines of DLT SQL or Python. More importantly, the operational properties (idempotency, exactly-once semantics, automatic backfill) are guaranteed by the framework, not the engineer.

5.2  DLT Pipeline Design Patterns

Pattern 1 — Full Medallion DLT Pipeline (SQL)

— Bronze: Ingest raw events from Event Hubs (streaming)
CREATE OR REFRESH STREAMING LIVE TABLE bronze_events
COMMENT “Raw events from IoT sensors via Event Hubs. Schema preserved exactly.”
AS SELECT * FROM cloud_files(
  “abfss://landing@storage.dfs.core.windows.net/events/”,
  “json”,
  map(“cloudFiles.inferColumnTypes”, “true”,
       “cloudFiles.schemaLocation”, “/checkpoints/bronze_events/schema”)
);
 
— Silver: Clean, validate, and enrich (streaming)
CREATE OR REFRESH STREAMING LIVE TABLE silver_events (
  CONSTRAINT valid_device_id EXPECT (device_id IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT valid_temperature EXPECT (temperature BETWEEN -50 AND 150) ON VIOLATION QUARANTINE,
  CONSTRAINT valid_timestamp EXPECT (event_ts > ‘2020-01-01’) ON VIOLATION FAIL UPDATE
)
COMMENT “Validated and enriched sensor events. Quarantined rows in silver_events_quarantine.”
AS SELECT
  e.device_id,
  e.temperature,
  e.humidity,
  e.event_ts,
  d.device_name,
  d.location,
  d.manufacturer
FROM STREAM(LIVE.bronze_events) e
LEFT JOIN LIVE.dim_devices d ON e.device_id = d.device_id;
 
— Gold: Business aggregation (batch, refreshed every 15 minutes)
CREATE OR REFRESH LIVE TABLE gold_device_hourly_stats
COMMENT “Hourly aggregated stats per device. Power BI source.”
AS SELECT
  device_id,
  location,
  date_trunc(‘HOUR’, event_ts) AS hour,
  avg(temperature) AS avg_temp,
  max(temperature) AS max_temp,
  min(temperature) AS min_temp,
  avg(humidity) AS avg_humidity,
  count(*) AS event_count
FROM LIVE.silver_events
GROUP BY device_id, location, date_trunc(‘HOUR’, event_ts);

5.3  Data Quality with Expectations

DLT Expectations are the single most important feature for production pipeline reliability. They formalize data quality rules as executable constraints, replacing ad hoc validation notebooks and undocumented “it looks wrong” discoveries.

Expectation Violation ModeBehaviourUse WhenMonitoring
ON VIOLATION DROP ROWInvalid rows are silently dropped; valid rows continue processingNoise in source data is expected and acceptable; business logic tolerates missing recordsMetrics track dropped row count; alert if drop rate exceeds threshold
ON VIOLATION QUARANTINEInvalid rows routed to _quarantine table; valid rows continueInvalid rows need investigation without blocking pipeline; compliance requires audit of rejected dataQuery _quarantine table; alert on non-zero counts for critical constraints
ON VIOLATION FAIL UPDATEPipeline fails if constraint violated; no partial writesZero tolerance for invalid data; downstream consumers cannot handle bad dataPipeline failure triggers alert; requires manual investigation before rerun

5.4  DLT Enhanced Autoscaling

DLT Enhanced Autoscaling is the recommended compute mode for streaming pipelines with variable throughput. Unlike standard Databricks autoscaling (which scales based on task queue depth), DLT Enhanced Autoscaling uses pipeline-level metrics — lag behind source, processing rate, memory pressure — to make scaling decisions optimized for streaming throughput.

// DLT Pipeline configuration (JSON) for Enhanced Autoscaling
{
  “name”: “iot-sensor-pipeline”,
  “clusters”: [{
    “label”: “default”,
    “autoscale”: {
      “min_workers”: 2,
      “max_workers”: 20,
      “mode”: “ENHANCED”  // Key: use ENHANCED not LEGACY
    },
    “node_type_id”: “Standard_DS4_v2”,
    “spark_conf”: {
      “spark.databricks.delta.optimizeWrite.enabled”: “true”,
      “spark.databricks.delta.autoCompact.enabled”: “true”
    }
  }],
  “continuous”: true,   // For streaming: always-on pipeline
  // “development”: false — production mode: no interactive restarts
  “channel”: “CURRENT”, // Production: always use CURRENT, not PREVIEW
  “edition”: “ADVANCED” // Required for Enhanced Autoscaling + expectations
}

CHAPTER 6

Machine Learning at Scale: MLflow & Model Serving

6.1  The Databricks ML Platform

Databricks’ ML platform integrates three components into a unified workflow: the Databricks Runtime for ML (pre-installed with TensorFlow, PyTorch, scikit-learn, XGBoost, and 50+ ML libraries), MLflow for experiment tracking and model lifecycle management, and Model Serving for production model deployment. Together, they reduce the time from experiment to production from weeks to hours.

6.2  MLflow: Experiment Tracking & Model Registry

import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
 
# Set experiment (creates if not exists; tracked in Unity Catalog)
mlflow.set_experiment(“/Shared/churn-prediction/experiments/gbm-v2”)
 
with mlflow.start_run(run_name=”gbm-depth6-lr0.05″) as run:
 
    # Log hyperparameters
    params = {“n_estimators”: 500, “max_depth”: 6, “learning_rate”: 0.05}
    mlflow.log_params(params)
 
    # Train model
    model = GradientBoostingClassifier(**params)
    model.fit(X_train, y_train)
 
    # Log metrics
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    mlflow.log_metric(“test_auc”, auc)
    mlflow.log_metric(“train_auc”, roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]))
 
    # Log model with input example and signature (required for Model Serving)
    signature = mlflow.models.infer_signature(X_train, model.predict(X_train))
    mlflow.sklearn.log_model(
        model, “model”,
        signature=signature,
        input_example=X_train[:5],
        registered_model_name=”main.ml_models.churn_prediction”  # UC-registered
    )
 
print(f”Run ID: {run.info.run_id} | AUC: {auc:.4f}”)

6.3  Model Registry & Deployment Lifecycle

StageRegistry StatusWho TransitionsAutomated Gates
DevelopmentNone (experiment runs)Data ScientistN/A — free experimentation
CandidateRegistered (no alias)Data ScientistAuto: unit tests, schema validation, bias metrics check
Staging@staging aliasML EngineerAuto: integration test, shadow mode evaluation against production traffic, data drift check
Production@production aliasML Lead (approval gate)Auto: A/B test gate (>2% lift required), latency SLA check (<100ms p99), manual sign-off
Archived@archived aliasAutomated (on new production)Retain for 90 days for rollback capability; auto-delete after retention window

6.4  Model Serving: Production Deployment

Databricks Model Serving provides serverless, auto-scaling REST endpoints for ML models registered in Unity Catalog. The endpoint handles traffic routing, A/B testing, canary deployments, and feature lookup — eliminating the need for a separate model serving infrastructure.

# Deploy model to Databricks Model Serving via REST API
# (also configurable via UI or Terraform)
 
import requests
import json
 
token = dbutils.secrets.get(scope=”ml-secrets”, key=”databricks-token”)
workspace_url = “https://adb-<workspace-id>.azuredatabricks.net”
 
endpoint_config = {
    “name”: “churn-prediction-v2”,
    “config”: {
        “served_models”: [
            {
                “name”: “churn-gbm-production”,
                “model_name”: “main.ml_models.churn_prediction”,
                “model_version”: “3”,
                “workload_size”: “Small”,   # Small/Medium/Large (CPU); GPU options available
                “scale_to_zero_enabled”: True,  # Cost saving for non-critical endpoints
            }
        ],
        “traffic_config”: {
            “routes”: [{“served_model_name”: “churn-gbm-production”, “traffic_percentage”: 100}]
        }
    }
}
 
response = requests.post(
    f”{workspace_url}/api/2.0/serving-endpoints”,
    headers={“Authorization”: f”Bearer {token}”},
    json=endpoint_config
)
# Endpoint ready in ~5 minutes. SLA: <100ms p99 for Small workload size.

6.5  Feature Store

The Databricks Feature Store (now Unity Catalog Feature Store) solves the training-serving skew problem — the most common source of ML model degradation in production. Features computed during training are stored in Unity Catalog, and the same feature values are retrieved at inference time, guaranteeing consistency.

Key capability: Point-in-time lookups. The Feature Store retrieves the feature value that was valid at the timestamp of each training record — not the current value. This is critical for time-series problems (churn prediction, fraud detection, demand forecasting) where using future feature values in training creates data leakage.

CHAPTER 7

Azure Integration Architecture

7.1  Network Architecture

Production Databricks deployments on Azure require deliberate network design. The default “no VNet injection” configuration routes all cluster traffic through public endpoints — acceptable for development, unacceptable for regulated data workloads.

VNet Injection (Required for Production)

VNet injection deploys Databricks cluster VMs into a customer-managed Azure Virtual Network. Both the control plane (Databricks-managed) and data plane (cluster VMs) communicate through private endpoints. This architecture enables:

  • Databricks clusters that cannot be reached from the public internet
  • Outbound traffic routed through customer-managed Azure Firewall or NVA with full packet inspection
  • Private endpoint connectivity from clusters to ADLS Gen2, Azure Key Vault, Azure SQL, Event Hubs, and all other Azure PaaS services
  • NSG rules that restrict lateral movement between cluster nodes and to/from non-approved destinations
Network ComponentConfigurationPurpose
VNetDedicated /16 or /17 (e.g., 10.10.0.0/16)Container for all Databricks network resources
Public Subnet/17 or /18; NSG with Databricks service tag rulesCluster driver nodes; required by Databricks (even in “no public IP” mode)
Private Subnet/17 or /18; NSG with Databricks service tag rulesCluster worker nodes; no public IP; data plane traffic
Azure Private EndpointsOne per PaaS service (ADLS, Key Vault, ACR, etc.)Private IP access to Azure services; eliminates data exfiltration via public endpoints
Azure Firewall / NVACustom route table (0.0.0.0/0 → Firewall)Inspect and log all outbound traffic; allow only approved destinations
Private DNS ZonesOne per private endpoint service typeResolve Azure service hostnames to private IPs within VNet

7.2  Identity Integration with Azure Active Directory

Azure Databricks uses Azure Active Directory (Entra ID) as the exclusive identity provider when Unity Catalog is enabled. This eliminates Databricks-local user management and enables:

  • Single Sign-On via Entra ID SAML 2.0 — users authenticate with their corporate credentials
  • Conditional Access Policies — MFA requirements, device compliance, IP location restrictions apply to Databricks access
  • Group synchronization via SCIM — Entra ID security groups sync to Databricks automatically; deprovisioning in Entra ID immediately revokes Databricks access
  • Service principal authentication — pipelines and automation use managed identities or service principals with federated credentials; no secrets in code

7.3  Key Azure Integration Patterns

Azure ServiceIntegration PatternAuthenticationUse Case
ADLS Gen2Direct mount via Unity Catalog external location; no storage credentials in notebooksManaged Identity on cluster; or service principal via Unity Catalog storage credentialAll Bronze/Silver/Gold storage; primary lakehouse storage layer
Azure Event HubsKafka-compatible connector (EventHubsConf); or Azure Event Hubs Spark connectorSAS token via Key Vault secret; or Managed Identity (recommended)Streaming ingestion into Bronze tables via DLT or Structured Streaming
Azure Key VaultDatabricks secret scope backed by Azure Key VaultManaged Identity; no AKV credentials stored in DatabricksAll secrets (connection strings, API keys, tokens); never hardcode credentials
Azure SQL DatabaseJDBC connector; or recommended: Data Factory staging to DeltaService principal with SQL login; or Managed Identity (SQL MI)Incremental extraction from operational databases into Bronze
Azure Data FactoryADF Databricks Notebook/Jar/Python activitiesService principal with Contributor role on workspaceOrchestrate Databricks jobs from ADF parent pipelines; cross-system dependencies
Microsoft FabricFabric shortcut to Delta tables in ADLS Gen2; or Delta Sharing protocolEntra ID passthrough; or Delta Sharing credentialServe Databricks gold tables in Fabric/Power BI without data movement
Azure MonitorCluster log delivery to Log Analytics via diagnostic settingsWorkspace MSI; no configuration required beyond enablingUnified observability; query Databricks logs alongside all Azure resources

7.4  Secrets Management

# NEVER do this:
# connection_string = “Server=myserver.database.windows.net;User=admin;Password=P@ssw0rd!”
 
# ALWAYS do this: Key Vault-backed secret scope
 
# Step 1: Create Key Vault-backed secret scope (one-time, via Databricks CLI)
# databricks secrets create-scope \
#   –scope “production-secrets” \
#   –scope-backend-type AZURE_KEYVAULT \
#   –resource-id “/subscriptions/…/vaults/contoso-kv-prod” \
#   –dns-name “https://contoso-kv-prod.vault.azure.net/”
 
# Step 2: Read secret in notebook/job (secret never exposed in plain text)
jdbc_password = dbutils.secrets.get(scope=”production-secrets”, key=”sql-db-password”)
 
# The value of jdbc_password is REDACTED in all Databricks logs
# print(jdbc_password) → [REDACTED]
 
# Step 3: Use in connection
df = spark.read.format(“jdbc”) \
    .option(“url”, f”jdbc:sqlserver://server.database.windows.net”) \
    .option(“dbtable”, “dbo.orders”) \
    .option(“user”, dbutils.secrets.get(“production-secrets”, “sql-db-user”)) \
    .option(“password”, jdbc_password) \
    .load()

CHAPTER 8

Security Architecture & Zero-Trust Design

8.1  Security Layers

A production Databricks security architecture operates across five layers. Each layer is necessary; none is sufficient alone:

Security LayerControlsDatabricks Mechanism
Identity & AuthenticationSSO, MFA, conditional access, SCIM provisioning, service principal governanceEntra ID integration, Unity Catalog identity management
NetworkVNet injection, private endpoints, NSG rules, firewall egress control, no-public-IP clustersAzure VNet injection, private Link, Azure Firewall integration
Data AccessTable/column/row-level access control, column masks, row filters, data maskingUnity Catalog RBAC, column masks, row filters
ComputeCluster policies, single-user access mode, no SSH access, immutable cluster configurationCluster access control, cluster policies, admin restrictions
Audit & MonitoringAll data access logged, admin actions logged, anomaly detection, retention complianceUnity Catalog audit logs, Azure Monitor, Databricks audit log delivery

8.2  Cluster Access Modes

Cluster access mode is one of the most critical security decisions in Databricks. The access mode determines who can use the cluster and what data they can access — a misconfigured access mode is the most common source of unintended data exposure in Databricks deployments.

Access ModeWho Can UseUnity Catalog SupportUse CaseRecommended 
Single UserOne user or service principal onlyFull support (all features)Production jobs, sensitive ML workloads, privileged analyst accessYes — for all production jobs 
SharedMultiple users, process isolation via kernel namespacingFull supportShared interactive clusters for teams with different permissionsYes — for multi-user interactive 
No Isolation Shared (Legacy)Multiple users, NO process isolationLimited (no row filters, column masks)Legacy workloads onlyNo — avoid for any new deployment 
Custom (legacy)Databricks-local users/groups onlyNot supportedPre-UC legacy workloadsNo — migrate to Unity Catalog 
 Security Anti-Pattern: Many organizations set cluster access mode to “No Isolation Shared” because it was the historical default and “things just work.” This mode provides zero process isolation between users — a notebook run by User A can access the memory, temp files, and Spark state of User B running concurrently on the same cluster. Never use No Isolation Shared mode in production environments handling sensitive data.

8.3  Cluster Policies

Cluster policies enforce governance controls on cluster creation — preventing cost overruns, security misconfigurations, and unapproved software. In enterprise deployments, cluster policies are mandatory: without them, any user with CREATE CLUSTER permission can launch arbitrarily large clusters with any configuration.

// Cluster Policy JSON: Production Job Clusters
// Enforce: spot instances, auto-termination, approved node types, no SSH
{
  “cluster_type”: {“type”: “fixed”, “value”: “job”},  // Jobs only, no all-purpose
  “autotermination_minutes”: {
    “type”: “range”, “minValue”: 10, “maxValue”: 30, “defaultValue”: 20
  },
  “node_type_id”: {
    “type”: “allowlist”,
    “values”: [“Standard_DS4_v2”, “Standard_DS5_v2”, “Standard_E8s_v3”, “Standard_E16s_v3”]
  },
  “azure_attributes.availability”: {
    “type”: “fixed”, “value”: “SPOT_WITH_FALLBACK_AZURE”  // Spot instances required
  },
  “spark_conf.spark.databricks.cluster.profile”: {
    “type”: “fixed”, “value”: “singleNode”,
    “hidden”: true  // Users cannot override
  },
  “ssh_public_keys”: {“type”: “fixed”, “value”: []},  // No SSH access
  “data_security_mode”: {
    “type”: “fixed”, “value”: “SINGLE_USER”  // Enforce single-user access mode
  },
  “runtime_engine”: {“type”: “fixed”, “value”: “PHOTON”}  // Require Photon for performance
}

CHAPTER 9

Performance Engineering & Cost Optimization

9.1  Photon Engine

Photon is Databricks’ C++ vectorized query engine that replaces the JVM-based Spark execution engine for SQL and DataFrame operations. Photon delivers 2-8x query speedup for analytical workloads with no code changes required — it is drop-in compatible with any Spark SQL or DataFrame API code.

Workload TypePhoton SpeedupRecommendation
Aggregations (GROUP BY, COUNT, SUM)3-8xEnable Photon on all analytics clusters; cost of DBU premium pays back immediately
Hash joins on large tables2-5xCritical for gold layer transformations with multi-table joins
Sorting and ordering2-4xEspecially impactful for OPTIMIZE Z-ORDER operations
Data scans with filters (WHERE clauses)2-6xCombined with Delta data skipping, Photon makes scans of 10TB tables feel like point lookups
Streaming (Structured Streaming)1.5-3xEnable on DLT clusters; reduces streaming lag during throughput spikes
Complex Python UDFsNo improvementPython UDFs bypass Photon; rewrite as SQL expressions or pandas UDFs for Photon benefit

9.2  Cost Optimization: The Four Levers

Lever 1 — Spot Instances

Spot (preemptible) instances reduce VM cost by 60-80% compared to on-demand pricing. Databricks handles spot instance interruptions transparently: if a worker is preempted, the task is automatically retried on a replacement worker. The job runs longer but at 1/3 to 1/5 the compute cost.

Use SPOT_WITH_FALLBACK_AZURE in cluster policies: spot instances preferred, falls back to on-demand if no spot capacity. In practice, Azure spot availability for DS4_v2 and DS5_v2 exceeds 95% in major regions, making spot suitable for all production batch jobs.

Lever 2 — Cluster Right-Sizing

// Finding the right cluster size: use Ganglia metrics or Databricks Compute UI
//
// Symptoms of over-provisioned cluster:
// – Executor CPU utilization < 40% during active processing
// – Memory utilization < 50% (driver and executors)
// – Shuffle spill to disk = 0 (could handle smaller workers)
//
// Symptoms of under-provisioned cluster:
// – Executor CPU utilization > 85% sustained
// – Shuffle spill to disk > 0 (increase worker count or use memory-optimized VMs)
// – Task time variance high (some tasks 10x slower = data skew; fix in code, not hardware)
//
// Rule of thumb: start with 1 worker per 50GB of data processed
// For memory-intensive operations (ML, wide joins): use E-series (memory-optimized)
// For compute-intensive operations (Photon SQL, streaming): use D-series
//
// Enable cluster autoscaling for variable workloads:
// min_workers = minimum needed for SLA at lowest throughput
// max_workers = maximum needed at peak throughput
// autoscale.mode = “ENHANCED” for DLT, “LEGACY” for general clusters

Lever 3 — Serverless Compute

Databricks Serverless SQL Warehouses and Serverless Job Compute eliminate the cluster start time and idle cost problem entirely. With serverless, compute is ready in < 5 seconds, scales to zero when not in use, and charges only for active query time. For workloads with unpredictable or bursty patterns, serverless is almost always the lowest total cost option.

Lever 4 — Storage Optimization

Storage OptimizationImplementationCost Impact
ADLS Gen2 tieringMove Bronze data older than 90 days to Cool tier; archive after 1 year40-70% reduction in Bronze storage cost
Delta VACUUMRun VACUUM RETAIN 30 DAYS weekly on all Delta tablesRemoves old Parquet files from deleted/updated rows; 20-40% storage reduction on high-churn tables
Compression tuningUse Snappy for hot data (faster decompress); Zstd for cold data (better ratio)10-30% storage reduction with no query performance impact for ADLS-backed tables
Predictive OptimizationEnable UC Predictive Optimization to auto-manage OPTIMIZE and VACUUMReduces over-optimizing small tables; reduces VACUUM not being run; saves engineering time

9.3  Handling Data Skew

Data skew — where some partition keys have vastly more rows than others — is the most common cause of mysterious Spark job slowdowns that more hardware cannot fix. Identifying and resolving skew is a core Databricks engineering competency.

# Identify skew: look for tasks taking 10x+ longer than median task
# In Spark UI: Stage Detail → Task Metrics → sort by Duration descending
 
# Solution 1: Salting (for joins with skewed keys)
# Add random salt to skewed key to distribute across multiple partitions
from pyspark.sql import functions as F
 
SALT_FACTOR = 20  # Tune based on skew severity; more salt = more parallelism
 
# Salt the large (skewed) table
large_df_salted = large_df.withColumn(
    “customer_id_salted”,
    F.concat(F.col(“customer_id”), F.lit(“_”), (F.rand() * SALT_FACTOR).cast(“int”))
)
 
# Explode the small (lookup) table to match salt values
small_df_exploded = small_df.crossJoin(
    spark.range(SALT_FACTOR).select(F.col(“id”).cast(“string”).alias(“salt”))
).withColumn(
    “customer_id_salted”,
    F.concat(F.col(“customer_id”), F.lit(“_”), F.col(“salt”))
)
 
# Join on salted key
result = large_df_salted.join(small_df_exploded, “customer_id_salted”)
 
# Solution 2: Adaptive Query Execution (AQE) — enable for automatic skew handling
spark.conf.set(“spark.sql.adaptive.enabled”, “true”)
spark.conf.set(“spark.sql.adaptive.skewJoin.enabled”, “true”)
# AQE automatically splits skewed partitions at runtime (Databricks Runtime 8.0+)
# Enable first; only implement manual salting if AQE is insufficient

CHAPTER 10

Operations, Monitoring & SRE Practices

10.1  Observability Stack

A production Databricks deployment requires three observability layers: job-level monitoring (did the job succeed? how long did it take?), cluster-level monitoring (CPU, memory, network utilization), and data-level monitoring (did the data arrive? is it fresh? does it pass quality checks?).

Observability LayerData SourceToolingKey Alerts
Job & PipelineDatabricks Jobs API, DLT pipeline events, MLflow run metricsAzure Monitor (Log Analytics), Databricks Workflow emailsJob failure, duration > SLA, DLT expectation violation rate spike
Cluster & ComputeDatabricks cluster metrics to Azure Monitor, Ganglia (legacy)Azure Monitor dashboards, custom KQL queriesSpot interruption rate > 10%, memory OOM events, driver failure
Data QualityDLT expectations metrics, custom Great Expectations/Soda suitesDatabricks SQL dashboards, Power BI, PagerDutyNull rate spike, row count anomaly, freshness breach (> SLA hours since last update)
CostDatabricks cost API, Azure Cost ManagementAzure Cost Management dashboards + budget alertsDaily DBU spend > budget threshold, unexpected long-running cluster

10.2  CI/CD for Databricks

Production Databricks deployments require a CI/CD pipeline that tests notebooks and jobs before deployment, promotes code through environments (dev → test → prod), and maintains configuration as code for all workspace resources.

# Azure DevOps Pipeline: Databricks CI/CD (YAML)
# Stages: Test → Deploy to Test → Integration Test → Deploy to Prod
 
trigger:
  branches:
    include: [main, release/*]
 
stages:
– stage: Test
  jobs:
  – job: UnitTests
    steps:
    – script: pip install databricks-connect pytest pytest-mock
    – script: pytest tests/unit/ -v –junitxml=test-results.xml
    – task: PublishTestResults@2
      inputs:
        testResultsFiles: test-results.xml
 
– stage: DeployTest
  dependsOn: Test
  jobs:
  – job: DeployToTest
    steps:
    – script: |
        pip install databricks-cli
        databricks workspace import_dir ./notebooks /Shared/pipelines/test –overwrite
        databricks jobs reset –job-id $TEST_JOB_ID –json @job_configs/test/ingest_job.json
      env:
        DATABRICKS_HOST: $(DATABRICKS_TEST_HOST)
        DATABRICKS_TOKEN: $(DATABRICKS_TEST_TOKEN)
 
– stage: DeployProd
  dependsOn: IntegrationTest
  condition: and(succeeded(), eq(variables[‘Build.SourceBranch’], ‘refs/heads/main’))
  jobs:
  – deployment: DeployToProd
    environment: production  # Requires manual approval gate
    strategy:
      runOnce:
        deploy:
          steps:
          – script: databricks bundles deploy –target prod
            # Databricks Asset Bundles: recommended for production deployments

10.3  Databricks Asset Bundles

Databricks Asset Bundles (DABs) are the recommended approach for managing Databricks resources as code. DABs define jobs, pipelines, notebooks, and cluster configurations in a YAML manifest that can be version-controlled, deployed consistently across environments, and integrated with CI/CD pipelines.

# databricks.yml — Asset Bundle definition
bundle:
  name: analytics-platform
 
variables:
  catalog: {default: dev_catalog}
  environment: {default: dev}
 
targets:
  dev:
    mode: development
    workspace:
      host: https://adb-dev.azuredatabricks.net
    variables:
      catalog: dev_catalog
      environment: dev
  prod:
    mode: production
    workspace:
      host: https://adb-prod.azuredatabricks.net
    variables:
      catalog: main
      environment: prod
 
resources:
  pipelines:
    iot_sensor_pipeline:
      name: “IoT Sensor Pipeline (${var.environment})”
      catalog: ${var.catalog}
      schema: silver
      libraries:
        – notebook: {path: ./src/pipelines/iot_sensor_pipeline.py}
      continuous: true
      channel: CURRENT
 
# Deploy: databricks bundle deploy –target prod
# Destroy: databricks bundle destroy –target dev

CHAPTER 11

Migration Patterns: On-Premises & Cloud to Databricks

11.1  Migration Assessment Framework

Before migrating any workload to Databricks, a structured assessment determines which workloads benefit most, which require significant re-engineering, and which should remain on existing platforms.

Workload TypeMigration ComplexityMigration ValueRecommended Approach
SQL Server SSIS packagesMediumHighRewrite as DLT pipelines or ADF; use SSIS → Databricks migration accelerator tools
Informatica PowerCenter mappingsHighHighUse Databricks professional services migration toolkit; expect 60-70% automation, 30-40% manual rewrite
Teradata SQL workloadsMedium-HighVery High (cost reduction)Use Databricks Teradata migration utilities; SQL dialect conversion tools; validate results
On-premises Hadoop (HDFS + Hive)Low-MediumVery HighLift-and-shift HDFS to ADLS Gen2; convert Hive tables to Delta; Spark code works with minimal changes
Azure Synapse Analytics (dedicated pool)LowMediumDelta Sharing from Databricks to Synapse; or migrate workloads preferring Databricks compute to Databricks
SnowflakeMediumHigh (for ML-heavy orgs)Use Delta Sharing to mirror key tables; migrate engineering workloads; evaluate co-existence vs. full migration

11.2  Hadoop Migration Pattern

Hadoop-to-Databricks is the most common migration pattern and also the most straightforward. HDFS and Hive are architecturally similar to ADLS Gen2 and Delta Lake, making it possible to migrate 80% of Hadoop workloads with minimal code changes.

  1. Storage migration: Use Azure Data Box for initial bulk transfer of HDFS data to ADLS Gen2 (for datasets >10TB). Use AzCopy for incremental transfer during cutover. Target: Bronze layer in ADLS Gen2.
  2. Hive table conversion: Run Databricks CONVERT TO DELTA on all Parquet-backed Hive tables. For ORC-backed tables, read as Spark DataFrame and write as Delta. Register in Unity Catalog metastore.
  3. Spark code migration: Existing PySpark and Scala Spark code runs on Databricks Runtime without modification in ~85% of cases. Common issues: Hadoop-specific configuration properties, HDFS path references, and deprecated Spark 2.x APIs that were removed in Spark 3.x.
  4. Hive metastore migration: Use Databricks Unity Catalog migration tool to migrate existing Hive metastore to Unity Catalog, preserving table definitions, partition metadata, and (where possible) access control information.
  5. Scheduler migration: Replace Apache Oozie or Airflow (on Hadoop YARN) with Databricks Workflows or Azure Data Factory. Export Oozie workflow XML and use the Databricks Oozie migration converter for automated translation.

11.3  Teradata Migration Pattern

Teradata migrations deliver the highest ROI of any Databricks migration — typically 70-85% cost reduction on compute — but require the most careful SQL dialect conversion and performance validation.

Migration PhaseActivitiesDuration (Typical)Key Risk
DiscoveryInventory all Teradata objects (tables, views, stored procedures, macros); profile query patterns; identify BTEQ and TPT scripts2-4 weeksUndocumented dependencies; legacy macros with no documentation
Schema conversionConvert Teradata DDL to Delta Lake DDL; map data types; handle SET tables (implicit dedup) vs. MULTISET (allow duplication)4-8 weeksSET table semantics differ from Delta; requires explicit deduplication logic
SQL conversionConvert BTEQ → PySpark/SQL; convert Teradata-specific functions (QUALIFY, SAMPLE, periods) to Spark equivalents8-16 weeksQUALIFY (row numbering within groups) needs careful WINDOW function translation
Performance validationRun parallel execution: Teradata vs. Databricks; validate row counts, aggregations, and business metrics match4-8 weeksFloating point rounding differences; NULL handling differences
CutoverFinal data sync; redirect pipelines; decommission Teradata2-4 weeksCutover window; rollback plan if validation fails

CHAPTER 12

Real-World Case Studies

Case Study 1: Global Retailer — Real-Time Inventory Analytics

DimensionDetails
OrganizationPan-European omnichannel retailer; 1,200 stores; 85M transactions/day across POS, e-commerce, and fulfilment systems
ChallengeInventory visibility latency of 4-6 hours causing 12% out-of-stock rate. Demand forecasting models running on stale batch data producing recommendations 2 days behind actual demand signals. €47M annual revenue loss attributed to stockouts and overstocking combined.
Databricks ScopeDLT streaming pipelines ingesting POS events from 1,200 stores via Event Hubs (120K events/second peak); real-time inventory position table updated every 60 seconds; XGBoost demand forecasting model retrained daily; 180-day sales history in Silver; real-time serving via Model Serving API
ArchitectureBronze: raw POS events (Event Hubs → DLT streaming, Delta CDF enabled). Silver: cleaned transactions, inventory deltas (DLT streaming). Gold: real-time inventory position table (MERGE every 60 seconds), store-level demand signals. ML: daily demand forecast model using Gold features; served via Model Serving REST API to merchandising system.

Key Engineering Decisions

Event deduplication: POS systems occasionally double-emit events during network retries. DLT Expectations with FAIL UPDATE on duplicate transaction_id, combined with a 2-hour deduplication window using Delta CDF, eliminate duplicate inventory decrements without requiring a full table scan.

Inventory MERGE optimization: The inventory position table (450M rows, one per SKU per store) required MERGE every 60 seconds. Naive MERGE was too slow (45 seconds). Solution: partition by store_id (1,200 partitions), MERGE with WHEN NOT MATCHED THEN INSERT only on the affected store partition. Merge time reduced to 8 seconds — well within the 60-second SLA.

Model serving latency: Merchandising system requires forecast in < 200ms. Databricks Model Serving (Small workload size, scale-to-zero disabled) delivers p99 < 85ms for the XGBoost model with 320 input features. Feature lookup from Delta table adds 12ms average — within SLA.

KPIBefore DatabricksAfter Databricks (6 months)
Inventory visibility latency4-6 hours< 2 minutes
Out-of-stock rate12%7.3%
Demand forecast accuracy (MAPE)31%18%
Data infrastructure cost (annual)€4.2M (Teradata + SAS)€1.1M (Databricks + ADLS)
Engineering team size (data)28 FTE16 FTE (12 freed for product work)

Case Study 2: Financial Services Firm — Credit Risk Model Factory

DimensionDetails
OrganizationUK challenger bank; 2.4M customers; regulatory capital models (IFRS 9 ECL, Basel III PD/LGD/EAD) requiring quarterly revalidation and monthly production runs
ChallengeMonthly credit risk model runs taking 72 hours on legacy SAS environment. Regulatory validation requiring complete audit trail of every model input, transformation, and output — no automated lineage. Model retraining requiring 3-week manual process (data preparation, training, validation, sign-off, deployment).
Databricks ScopeFull model factory: data ingestion from core banking system via CDC (Azure SQL → Bronze Delta), feature engineering pipeline (Silver), model training and registry (MLflow + Unity Catalog), monthly production scoring (Gold), regulatory reporting (Gold → Azure SQL → Power BI)

Regulatory Compliance Architecture

Immutable audit trail: Delta Lake time travel serves as the regulatory audit trail. Every model training run records the exact Delta table version of all input tables in the MLflow run metadata. Regulators can request reconstruction of any model output by restoring the exact input data state — guaranteed by Delta’s transaction log.

Model governance workflow: MLflow model registry stages (Candidate → Staging → Production) map to the bank’s model risk governance process. Each stage transition requires documented challenger model test results, populated as MLflow run tags. The Production promotion requires manual sign-off in the registry UI from the Model Risk team — creating an automated, auditable governance record.

Parallel validation: During the 6-month transition, Databricks and SAS ran in parallel. Databricks results were compared to SAS output at each stage. Any discrepancy >0.01% triggered an investigation. Final validation showed 99.97% numerical agreement — the 0.03% difference traced to floating-point precision differences between SAS and Python, accepted by the regulator after documentation.

KPIBefore Databricks (SAS)After Databricks
Monthly model run duration72 hours4.5 hours
Model retraining cycle3 weeks manual6 hours automated (CI/CD + MLflow)
Audit trail completenessManual documentation; gaps common100% automated via Delta time travel + MLflow
Regulatory audit preparation8 weeks3 days
Annual platform cost£1.8M (SAS licences)£420K (Databricks + compute)

Case Study 3: Manufacturing — Predictive Maintenance at IoT Scale

A global industrial manufacturer deployed Databricks to process sensor telemetry from 85,000 connected machines across 140 factories in 28 countries. The business case: reduce unplanned downtime (average cost: $260,000/hour per production line) through ML-based predictive maintenance.

Scale challenge: 85,000 machines each emitting 20 sensor readings per second = 1.7 million events per second sustained, 4.2 million at peak. Event Hubs (100 throughput units) ingests to Databricks DLT streaming pipeline. Bronze layer receives 147TB of raw sensor data per month.

ML architecture: One model per machine type (42 machine types) trained on 18-month historical data including 3,200 documented failure events. Isolation Forest for anomaly detection (unsupervised), Gradient Boosting for time-to-failure prediction (supervised, where labelled failure data available). Models retrained weekly using latest telemetry via automated Databricks Workflow.

Edge-to-cloud integration: High-priority alerts (predicted failure within 4 hours) routed from Model Serving → Event Grid → factory floor SCADA system via Azure API Management. Round-trip latency from sensor reading to SCADA alert: < 90 seconds.

KPIBaselineAfter 12 Months
Unplanned downtime (hours/year)14,200 across all factories8,900 (37% reduction)
Annual downtime cost avoidance$135M (at $260K/hour average)
False positive rate (maintenance alerts)N/A (no predictive system)8.3% (within 10% target)
Model prediction accuracy (failure within 4hrs)84% precision, 79% recall
Databricks platform cost (annual)$2.1M all-in

CHAPTER 13

90-Day Implementation Roadmap

The following roadmap reflects deployment patterns across 15+ enterprise Databricks implementations. The sequencing is optimized for: early value demonstration (quick wins by Day 30), platform stability before scale (foundation before onboarding all workloads), and sustainable operations (CI/CD and monitoring before production traffic).

Phase 1: Foundation (Days 1-30)

WeekActivityOwnerSuccess Criteria
1Azure Databricks workspace deployment with VNet injection; Unity Catalog metastore creation; Entra ID SCIM sync configuration; Key Vault-backed secret scope setupData Platform Engineer + Azure ArchitectWorkspace live in private network; Unity Catalog active; first Entra ID group synced; Key Vault scope working
1-2ADLS Gen2 storage account creation (bronze/silver/gold); Unity Catalog external locations and storage credentials; cluster policies defined and publishedData Platform EngineerStorage accounts with hierarchical namespace; external locations in Unity Catalog; cluster policies active
2Catalog and schema design in Unity Catalog; RBAC model design; data domain owner identification; Entra ID group structure for Databricks accessData Governance Lead + Data ArchitectCatalog hierarchy documented and deployed; RBAC matrix approved; all data domain owners identified and briefed
2-3First pipeline: select highest-value Bronze ingestion (one source); build as DLT pipeline with expectations; deploy to dev; validate against sourceSenior Data EngineerFirst DLT pipeline ingesting into Bronze Delta table; expectations defined; data validated against source system
3-4CI/CD pipeline setup (Azure DevOps); Databricks Asset Bundles configured; dev/prod workspace deployment automation; first automated unit testsData Engineer + DevOpsCode deployment automated; dev→prod promotion via pipeline; unit tests running on every PR
4Azure Monitor integration; Log Analytics workspace linked; first cost alert configured; Databricks audit log delivery enabledData Platform EngineerCluster metrics in Log Analytics; cost alert active; audit logs flowing

Phase 2: Activation (Days 31-60)

WeekActivityOwnerSuccess Criteria
5-6Onboard Tier 1 workloads: build Silver transformation pipelines with DLT; define data quality expectations; deploy to productionData Engineering TeamTier 1 Silver tables populated; DLT expectations capturing quality metrics; pipeline SLAs defined
6Quick win: first Gold table serving Power BI / Fabric; first analyst using Databricks SQL for self-service query; Governance Maturity demo to leadershipData Engineering + Analytics LeadFirst Gold table live; at least 3 analysts running queries; leadership demo delivered
6-7Unity Catalog lineage verification; column masks and row filters for PII data; Purview integration configured (if in scope)Data Engineer + Data GovernanceLineage visible in Catalog Explorer for Tier 1 tables; PII columns masked; Purview sync active
7-8MLflow setup (if ML in scope): first experiment tracked; model registered in Unity Catalog; model serving endpoint deployed for at least one modelML EngineerMLflow experiments running; first model in registry with Champion/Challenger staging; serving endpoint live
7-8Streaming pipeline deployment (if real-time in scope): Event Hubs connector; DLT streaming with Enhanced Autoscaling; latency SLA validationSenior Data EngineerStreaming pipeline ingesting at production volume; end-to-end latency within SLA; autoscaling tested under load
8Cost review: spot instance utilization report; cluster right-sizing audit; Predictive Optimization enabled on UC catalogs; budget alert refinementData Platform Engineer + FinOpsSpot utilization >80%; identified and actioned top 3 cost reduction opportunities

Phase 3: Scale & Optimise (Days 61-90)

WeekActivityOwnerSuccess Criteria 
9-10Onboard remaining workloads: complete Bronze ingestion for all Tier 2 sources; Silver and Gold for priority domains; DLT for all new pipelines (no raw Spark jobs)Data Engineering Team>80% of planned data sources ingested; all new pipelines using DLT; data quality SLAs documented 
10-11Performance engineering: identify top 10 slowest jobs; apply Z-ORDER, OPTIMIZE, and Photon; validate speedup; document optimization patterns for teamSenior Data EngineerTop 10 jobs optimized; average job time reduced by >30%; performance tuning runbook created 
11Operations runbook: incident response procedures; on-call rotation; DLT pipeline restart procedures; cost escalation process; SLA breach responseData Platform Lead + SRERunbook published; on-call rotation active; first gameday exercise completed 
12Platform review: measure KPIs vs. Day 1 baseline; business case revalidation; 90-180 day roadmap (next workloads, ML roadmap, Fabric integration plan)CDO + Data Platform LeadKPI improvement documented; business case validated; 90-180 day plan approved by leadership 
 Critical Success Factor: The most common Databricks deployment failure mode is “platform without pipelines” — the infrastructure is deployed correctly but business workloads are never migrated to it. Ensure at least one business-critical pipeline is live in production by Day 45. Early production usage drives adoption, surfaces real-world issues before scale, and maintains organizational momentum.

APPENDIX

Appendix: Reference Materials

A. Essential PySpark Patterns

A1. Efficient Incremental Load with Watermarks

# Production pattern: Incremental load with high-water mark
# Stores last processed timestamp in a metadata Delta table
 
from delta.tables import DeltaTable
from pyspark.sql import functions as F
 
# Read watermark
watermark = spark.table(“main.metadata.pipeline_watermarks”) \
    .filter(“pipeline_name = ‘orders_ingestion'”) \
    .select(“last_processed_ts”) \
    .collect()[0][0]
 
# Extract only new/changed records
new_records = spark.read.format(“jdbc”) \
    .option(“url”, jdbc_url) \
    .option(“query”, f”SELECT * FROM orders WHERE updated_at > ‘{watermark}'”) \
    .load()
 
# Upsert to Bronze Delta table
bronze_table = DeltaTable.forName(spark, “main.bronze.orders”)
bronze_table.alias(“target”) \
    .merge(new_records.alias(“source”), “target.order_id = source.order_id”) \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()
 
# Update watermark
new_watermark = new_records.agg(F.max(“updated_at”)).collect()[0][0]
spark.sql(f”””
    UPDATE main.metadata.pipeline_watermarks
    SET last_processed_ts = ‘{new_watermark}’
    WHERE pipeline_name = ‘orders_ingestion’
“””)

A2. Dynamic Partition Overwrite Pattern

# Safe overwrite of specific partitions without touching others
# Use case: reprocess a date range after source data correction
 
spark.conf.set(“spark.sql.sources.partitionOverwriteMode”, “dynamic”)
 
# This overwrites ONLY the partitions present in df, leaves others untouched
df.write \
    .format(“delta”) \
    .mode(“overwrite”) \
    .option(“replaceWhere”, “transaction_dt >= ‘2025-01-01’ AND transaction_dt <= ‘2025-01-07′”) \
    .saveAsTable(“main.silver.sales_transactions”)
 
# Note: replaceWhere is safer than dynamic partition overwrite for Delta
# It uses Delta transaction semantics; atomic; rolls back on failure

B. Databricks REST API Quick Reference

OperationMethodEndpointCommon Use
List jobsGET/api/2.1/jobs/listAudit all jobs; generate cost attribution reports
Run job nowPOST/api/2.1/jobs/run-nowTrigger job from external orchestrator (ADF, Logic Apps)
Get run statusGET/api/2.1/jobs/runs/get?run_id={id}Poll completion from external orchestrator
List DLT pipelinesGET/api/2.0/pipelinesAudit pipelines; governance inventory
Get pipeline eventsGET/api/2.0/pipelines/{id}/eventsExtract DLT expectation metrics for monitoring
List Unity Catalog tablesGET/api/2.1/unity-catalog/tablesData catalog integration; asset inventory
Get table lineageGET/api/2.0/lineage-tracking/table-lineageExtract lineage for Purview integration
Create cluster policyPOST/api/2.0/policies/clusters/createInfrastructure-as-code cluster governance
List MLflow experimentsGET/api/2.0/mlflow/experiments/listML governance; experiment inventory
Get model versionsGET/api/2.0/mlflow/model-versions/searchModel registry governance; deployment audit

C. Databricks Runtime Version Policy

Runtime ChannelDescriptionRecommendation
CURRENTLatest stable DBR; receives bug fixes and security patchesUse for all production workloads; update quarterly
PREVIEWNext DBR version; not yet GA; may contain breaking changesUse in development/test only; never in production
LTS (Long-Term Support)Extended support cycle (2 years); conservative updatesUse for regulated workloads with strict change control; accept performance tradeoff vs. CURRENT
Custom (ML Runtime)Includes TensorFlow, PyTorch, scikit-learn, XGBoost pre-installedUse for all ML workloads; eliminates library installation time on cluster start

D. Glossary

TermDefinition
Adaptive Query Execution (AQE)Spark feature that re-optimizes query plans at runtime based on actual data statistics; automatically handles skew, coalesces partitions, and selects join strategies
Asset Bundle (DAB)Databricks resource configuration framework that defines jobs, pipelines, and clusters as YAML manifests for version-controlled, environment-aware deployment
Change Data Feed (CDF)Delta Lake feature that captures row-level changes (inserts, updates, deletes) with change type metadata, enabling efficient incremental downstream processing
Delta LakeOpen-source storage layer providing ACID transactions, scalable metadata, time travel, and schema enforcement on top of Parquet files in cloud object storage
Delta Live Tables (DLT)Databricks declarative pipeline framework; engineers declare data transformations and quality expectations; DLT manages execution, retry, monitoring, and autoscaling
LakehouseArchitecture combining data lake storage economics with data warehouse reliability and performance, using open table formats (Delta Lake, Iceberg, Hudi) as the unifying layer
Medallion ArchitectureData organization pattern (Bronze/Silver/Gold) representing progressive data quality levels, from raw ingestion to business-certified analytics-ready assets
MLflowOpen-source platform for ML lifecycle management: experiment tracking, model registry, model serving, and evaluation; created by Databricks; integrated into Unity Catalog
PhotonDatabricks C++ vectorized query engine replacing JVM Spark execution for analytical workloads; delivers 2-8x performance improvement with no code changes
Unity CatalogDatabricks unified governance layer providing centralized access control, lineage tracking, data sharing, and audit across all workspaces in an organization
Z-OrderingDelta Lake data layout optimization that co-locates related data in the same files based on specified columns, improving query performance by enabling more aggressive data skipping

Leave a Reply

Your email address will not be published. Required fields are marked *