Implementing Medallion Architecture in Databricks: A Complete Guide
Implementing Medallion Architecture in Databricks: A Complete Guide
The medallion architecture (also called the multi-hop pattern) has become the de facto standard for organizing data lakes. It provides a structured, incremental approach to data quality improvement — from raw ingestion to analytics-ready gold tables. In this guide, we'll walk through building a production-ready medallion pipeline on Databricks with Delta Lake, PySpark, and Unity Catalog.
What Is Medallion Architecture?
The medallion architecture organizes data into three layers:
| Layer | Description | Typical Use Cases |
|---|---|---|
| Bronze | Raw data as-is, append-only | Audit, reprocessing, lineage |
| Silver | Cleaned, validated, deduplicated | Operational analytics, feature engineering |
| Gold | Business-aggregated, domain-specific | Dashboards, ML training sets, reports |
Each layer refines the previous one, allowing you to always trace data back to its source while providing clean, performant tables to consumers.
Prerequisites
- Databricks Runtime 12.2 LTS or higher
- Delta Lake (bundled with Databricks)
- Unity Catalog enabled (recommended)
- A cloud storage mount or Unity Catalog external location
Setting Up the Unity Catalog Structure
Before writing any code, establish a clear namespace:
-- Create catalogs per environment
CREATE CATALOG IF NOT EXISTS prod;
CREATE CATALOG IF NOT EXISTS dev;
-- Create schemas for each medallion layer
CREATE SCHEMA IF NOT EXISTS prod.bronze;
CREATE SCHEMA IF NOT EXISTS prod.silver;
CREATE SCHEMA IF NOT EXISTS prod.gold;
Bronze Layer: Raw Ingestion
The Bronze layer captures data exactly as it arrives — no transformations, no filtering. This is your audit log and your safety net.
Auto Loader for Streaming Ingestion
Databricks Auto Loader is the recommended way to ingest files at scale. It efficiently processes new files as they arrive using file notification or directory listing:
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name, lit
spark = SparkSession.builder.getOrCreate()
def ingest_bronze(source_path: str, target_table: str, schema_hint: str = None):
# Incrementally ingest raw files into the Bronze layer using Auto Loader
reader = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", f"/mnt/checkpoints/{target_table}/schema")
.option("cloudFiles.inferColumnTypes", "true")
)
if schema_hint:
reader = reader.schema(schema_hint)
df = (
reader.load(source_path)
.withColumn("_ingested_at", current_timestamp())
.withColumn("_source_file", input_file_name())
.withColumn("_batch_id", lit("manual"))
)
(
df.writeStream
.format("delta")
.option("checkpointLocation", f"/mnt/checkpoints/{target_table}/stream")
.option("mergeSchema", "true")
.outputMode("append")
.trigger(availableNow=True)
.table(target_table)
.awaitTermination()
)
# Usage
ingest_bronze(
source_path="abfss://landing@storageaccount.dfs.core.windows.net/events/",
target_table="prod.bronze.events_raw"
)
Key Bronze Principles
- Never delete from Bronze — it's your source of truth
- Add metadata columns (
_ingested_at,_source_file,_batch_id) - Use
mergeSchema=trueto handle schema evolution gracefully - Partition by ingestion date for performance and lifecycle management
Silver Layer: Cleansing and Standardization
The Silver layer is where business logic starts. Here you validate, deduplicate, type-cast, and join reference data.
Idempotent Merge with Delta Lake
Use MERGE INTO (upsert) to make Silver writes idempotent — safe to re-run after failures:
from delta.tables import DeltaTable
from pyspark.sql.functions import col, to_timestamp, trim, upper, when, current_timestamp
def transform_silver_events(spark):
# Clean and standardize raw events into the Silver layer
bronze_df = spark.read.table("prod.bronze.events_raw")
silver_df = (
bronze_df
.withColumn("event_timestamp", to_timestamp(col("event_ts"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("user_id", trim(col("user_id")))
.withColumn("event_type", upper(trim(col("event_type"))))
.withColumn("amount", col("amount").cast("decimal(18,4)"))
.filter(col("event_timestamp").isNotNull())
.filter(col("user_id").isNotNull() & (col("user_id") != ""))
.dropDuplicates(["event_id", "event_timestamp"])
.withColumn("_silver_processed_at", current_timestamp())
)
return silver_df
def write_silver_merge(df, target_table: str, merge_keys: list):
# Upsert into Silver using Delta merge for idempotency
df.limit(0).write.format("delta").mode("ignore").saveAsTable(target_table)
target = DeltaTable.forName(spark, target_table)
merge_condition = " AND ".join([f"target.{k} = source.{k}" for k in merge_keys])
(
target.alias("target")
.merge(df.alias("source"), merge_condition)
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
)
silver_df = transform_silver_events(spark)
write_silver_merge(
df=silver_df,
target_table="prod.silver.events",
merge_keys=["event_id"]
)
Data Quality with Delta Constraints
Enforce data quality at the table level using Delta Lake constraints:
ALTER TABLE prod.silver.events
ADD CONSTRAINT chk_amount_positive CHECK (amount >= 0);
ALTER TABLE prod.silver.events
ADD CONSTRAINT chk_event_type_valid CHECK (event_type IN ('CLICK', 'PURCHASE', 'VIEW', 'SIGN_UP'));
You can also use Databricks Delta Live Tables (DLT) expectations for pipeline-level quality control:
import dlt
from pyspark.sql.functions import col, to_timestamp
@dlt.table(name="events_silver")
@dlt.expect_or_drop("valid_amount", "amount >= 0")
@dlt.expect_or_fail("non_null_user", "user_id IS NOT NULL")
def events_silver():
return (
dlt.read("events_raw")
.withColumn("event_timestamp", to_timestamp(col("event_ts")))
.dropDuplicates(["event_id"])
)
Gold Layer: Business-Ready Aggregations
Gold tables are purpose-built for specific consumers — BI dashboards, ML models, or APIs. Optimize for read performance.
Building a Daily Revenue Aggregation
from pyspark.sql.functions import (
col, sum as spark_sum, count, countDistinct,
date_trunc, avg, current_timestamp
)
def build_gold_daily_revenue(spark):
# Aggregate Silver events into a daily revenue Gold table
df = (
spark.read.table("prod.silver.events")
.filter(col("event_type") == "PURCHASE")
.filter(col("event_timestamp") >= "2024-01-01")
)
gold_df = (
df
.withColumn("event_date", date_trunc("day", col("event_timestamp")))
.groupBy("event_date", "country", "product_category")
.agg(
spark_sum("amount").alias("total_revenue"),
count("event_id").alias("transaction_count"),
countDistinct("user_id").alias("unique_buyers"),
avg("amount").alias("avg_order_value")
)
.withColumn("_gold_updated_at", current_timestamp())
)
return gold_df
gold_df = build_gold_daily_revenue(spark)
(
gold_df.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", "event_date >= '2024-01-01'")
.partitionBy("event_date")
.saveAsTable("prod.gold.daily_revenue")
)
Optimizing Gold Tables for BI Tools
-- Optimize file sizes for efficient scans
OPTIMIZE prod.gold.daily_revenue ZORDER BY (country, product_category);
-- Enable liquid clustering for automatic optimization (DBR 13.3+)
ALTER TABLE prod.gold.daily_revenue
CLUSTER BY (event_date, country);
-- Set table properties for BI performance
ALTER TABLE prod.gold.daily_revenue
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
Orchestration with Databricks Workflows
Wire all three layers together in a Databricks Workflow:
# workflow.yml (Databricks Asset Bundle format)
resources:
jobs:
medallion_pipeline:
name: "Medallion Pipeline - Events"
tasks:
- task_key: bronze_ingestion
notebook_task:
notebook_path: ./notebooks/bronze_ingestion
job_cluster_key: bronze_cluster
- task_key: silver_transform
depends_on:
- task_key: bronze_ingestion
notebook_task:
notebook_path: ./notebooks/silver_transform
job_cluster_key: standard_cluster
- task_key: gold_aggregation
depends_on:
- task_key: silver_transform
notebook_task:
notebook_path: ./notebooks/gold_aggregation
job_cluster_key: standard_cluster
Monitoring and Observability
Track data quality metrics across layers:
from datetime import datetime
def log_layer_metrics(table: str, layer: str, spark):
# Log row counts and quality metrics for observability
df = spark.read.table(table)
metrics = {
"table": table,
"layer": layer,
"row_count": df.count(),
"timestamp": datetime.now().isoformat()
}
print(f"[{layer.upper()}] {table}: {metrics['row_count']:,} rows")
return metrics
You can extend this pattern to push metrics into a monitoring Delta table and visualize them in Databricks SQL dashboards — or connect tools like Harbinger Explorer to get cross-platform visibility into your data pipeline health.
Best Practices Summary
| Concern | Recommendation |
|---|---|
| Bronze writes | Append-only, add metadata columns |
| Silver writes | Idempotent MERGE INTO by natural key |
| Gold writes | replaceWhere for incremental partition overwrite |
| Schema evolution | mergeSchema=true in Bronze, explicit in Silver+ |
| Data quality | Delta constraints + DLT expectations |
| Performance | OPTIMIZE ZORDER or Liquid Clustering on Gold |
| Orchestration | Databricks Workflows with dependency chains |
| Monitoring | Delta history + custom metrics tables |
Conclusion
The medallion architecture isn't just about organizing folders — it's a discipline for managing data quality, reliability, and trust across your entire lakehouse. By combining Delta Lake's ACID guarantees with Databricks' processing power and Unity Catalog's governance, you get a production-grade data platform that scales with your organization.
Start small: even a two-layer Bronze/Gold setup provides enormous benefits over a raw data swamp. Add Silver when you need reusable cleaned datasets.
Try Harbinger Explorer free for 7 days — get instant visibility into your Databricks pipelines, lineage, and data quality metrics without writing a single line of monitoring code. Start your free trial at harbingerexplorer.com
Continue Reading
Databricks Autoloader: The Complete Guide
CI/CD Pipelines for Databricks Projects: A Production-Ready Guide
Build a robust CI/CD pipeline for your Databricks projects using GitHub Actions, Databricks Asset Bundles, and automated testing. Covers branching strategy, testing, and deployment.
Databricks Cluster Policies for Cost Control: A Practical Guide
Learn how to use Databricks cluster policies to enforce cost guardrails, standardize cluster configurations, and prevent cloud bill surprises without blocking your team's productivity.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial