Harbinger Explorer

Back to Knowledge Hub
databricks
Published:

Implementing Medallion Architecture in Databricks: A Complete Guide

12 min read·Tags: databricks, medallion-architecture, delta-lake, data-engineering, pyspark, unity-catalog

Implementing Medallion Architecture in Databricks: A Complete Guide

The medallion architecture (also called the multi-hop pattern) has become the de facto standard for organizing data lakes. It provides a structured, incremental approach to data quality improvement — from raw ingestion to analytics-ready gold tables. In this guide, we'll walk through building a production-ready medallion pipeline on Databricks with Delta Lake, PySpark, and Unity Catalog.


What Is Medallion Architecture?

The medallion architecture organizes data into three layers:

LayerDescriptionTypical Use Cases
BronzeRaw data as-is, append-onlyAudit, reprocessing, lineage
SilverCleaned, validated, deduplicatedOperational analytics, feature engineering
GoldBusiness-aggregated, domain-specificDashboards, ML training sets, reports

Each layer refines the previous one, allowing you to always trace data back to its source while providing clean, performant tables to consumers.


Prerequisites

  • Databricks Runtime 12.2 LTS or higher
  • Delta Lake (bundled with Databricks)
  • Unity Catalog enabled (recommended)
  • A cloud storage mount or Unity Catalog external location

Setting Up the Unity Catalog Structure

Before writing any code, establish a clear namespace:

-- Create catalogs per environment
CREATE CATALOG IF NOT EXISTS prod;
CREATE CATALOG IF NOT EXISTS dev;

-- Create schemas for each medallion layer
CREATE SCHEMA IF NOT EXISTS prod.bronze;
CREATE SCHEMA IF NOT EXISTS prod.silver;
CREATE SCHEMA IF NOT EXISTS prod.gold;

Bronze Layer: Raw Ingestion

The Bronze layer captures data exactly as it arrives — no transformations, no filtering. This is your audit log and your safety net.

Auto Loader for Streaming Ingestion

Databricks Auto Loader is the recommended way to ingest files at scale. It efficiently processes new files as they arrive using file notification or directory listing:

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name, lit

spark = SparkSession.builder.getOrCreate()

def ingest_bronze(source_path: str, target_table: str, schema_hint: str = None):
    # Incrementally ingest raw files into the Bronze layer using Auto Loader
    
    reader = (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .option("cloudFiles.schemaLocation", f"/mnt/checkpoints/{target_table}/schema")
        .option("cloudFiles.inferColumnTypes", "true")
    )
    
    if schema_hint:
        reader = reader.schema(schema_hint)
    
    df = (
        reader.load(source_path)
        .withColumn("_ingested_at", current_timestamp())
        .withColumn("_source_file", input_file_name())
        .withColumn("_batch_id", lit("manual"))
    )
    
    (
        df.writeStream
        .format("delta")
        .option("checkpointLocation", f"/mnt/checkpoints/{target_table}/stream")
        .option("mergeSchema", "true")
        .outputMode("append")
        .trigger(availableNow=True)
        .table(target_table)
        .awaitTermination()
    )

# Usage
ingest_bronze(
    source_path="abfss://landing@storageaccount.dfs.core.windows.net/events/",
    target_table="prod.bronze.events_raw"
)

Key Bronze Principles

  1. Never delete from Bronze — it's your source of truth
  2. Add metadata columns (_ingested_at, _source_file, _batch_id)
  3. Use mergeSchema=true to handle schema evolution gracefully
  4. Partition by ingestion date for performance and lifecycle management

Silver Layer: Cleansing and Standardization

The Silver layer is where business logic starts. Here you validate, deduplicate, type-cast, and join reference data.

Idempotent Merge with Delta Lake

Use MERGE INTO (upsert) to make Silver writes idempotent — safe to re-run after failures:

from delta.tables import DeltaTable
from pyspark.sql.functions import col, to_timestamp, trim, upper, when, current_timestamp

def transform_silver_events(spark):
    # Clean and standardize raw events into the Silver layer
    
    bronze_df = spark.read.table("prod.bronze.events_raw")
    
    silver_df = (
        bronze_df
        .withColumn("event_timestamp", to_timestamp(col("event_ts"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
        .withColumn("user_id", trim(col("user_id")))
        .withColumn("event_type", upper(trim(col("event_type"))))
        .withColumn("amount", col("amount").cast("decimal(18,4)"))
        .filter(col("event_timestamp").isNotNull())
        .filter(col("user_id").isNotNull() & (col("user_id") != ""))
        .dropDuplicates(["event_id", "event_timestamp"])
        .withColumn("_silver_processed_at", current_timestamp())
    )
    
    return silver_df

def write_silver_merge(df, target_table: str, merge_keys: list):
    # Upsert into Silver using Delta merge for idempotency
    
    df.limit(0).write.format("delta").mode("ignore").saveAsTable(target_table)
    
    target = DeltaTable.forName(spark, target_table)
    
    merge_condition = " AND ".join([f"target.{k} = source.{k}" for k in merge_keys])
    
    (
        target.alias("target")
        .merge(df.alias("source"), merge_condition)
        .whenMatchedUpdateAll()
        .whenNotMatchedInsertAll()
        .execute()
    )

silver_df = transform_silver_events(spark)
write_silver_merge(
    df=silver_df,
    target_table="prod.silver.events",
    merge_keys=["event_id"]
)

Data Quality with Delta Constraints

Enforce data quality at the table level using Delta Lake constraints:

ALTER TABLE prod.silver.events
  ADD CONSTRAINT chk_amount_positive CHECK (amount >= 0);

ALTER TABLE prod.silver.events
  ADD CONSTRAINT chk_event_type_valid CHECK (event_type IN ('CLICK', 'PURCHASE', 'VIEW', 'SIGN_UP'));

You can also use Databricks Delta Live Tables (DLT) expectations for pipeline-level quality control:

import dlt
from pyspark.sql.functions import col, to_timestamp

@dlt.table(name="events_silver")
@dlt.expect_or_drop("valid_amount", "amount >= 0")
@dlt.expect_or_fail("non_null_user", "user_id IS NOT NULL")
def events_silver():
    return (
        dlt.read("events_raw")
        .withColumn("event_timestamp", to_timestamp(col("event_ts")))
        .dropDuplicates(["event_id"])
    )

Gold Layer: Business-Ready Aggregations

Gold tables are purpose-built for specific consumers — BI dashboards, ML models, or APIs. Optimize for read performance.

Building a Daily Revenue Aggregation

from pyspark.sql.functions import (
    col, sum as spark_sum, count, countDistinct,
    date_trunc, avg, current_timestamp
)

def build_gold_daily_revenue(spark):
    # Aggregate Silver events into a daily revenue Gold table
    
    df = (
        spark.read.table("prod.silver.events")
        .filter(col("event_type") == "PURCHASE")
        .filter(col("event_timestamp") >= "2024-01-01")
    )
    
    gold_df = (
        df
        .withColumn("event_date", date_trunc("day", col("event_timestamp")))
        .groupBy("event_date", "country", "product_category")
        .agg(
            spark_sum("amount").alias("total_revenue"),
            count("event_id").alias("transaction_count"),
            countDistinct("user_id").alias("unique_buyers"),
            avg("amount").alias("avg_order_value")
        )
        .withColumn("_gold_updated_at", current_timestamp())
    )
    
    return gold_df

gold_df = build_gold_daily_revenue(spark)

(
    gold_df.write
    .format("delta")
    .mode("overwrite")
    .option("replaceWhere", "event_date >= '2024-01-01'")
    .partitionBy("event_date")
    .saveAsTable("prod.gold.daily_revenue")
)

Optimizing Gold Tables for BI Tools

-- Optimize file sizes for efficient scans
OPTIMIZE prod.gold.daily_revenue ZORDER BY (country, product_category);

-- Enable liquid clustering for automatic optimization (DBR 13.3+)
ALTER TABLE prod.gold.daily_revenue
  CLUSTER BY (event_date, country);

-- Set table properties for BI performance
ALTER TABLE prod.gold.daily_revenue
  SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact' = 'true'
  );

Orchestration with Databricks Workflows

Wire all three layers together in a Databricks Workflow:

# workflow.yml (Databricks Asset Bundle format)
resources:
  jobs:
    medallion_pipeline:
      name: "Medallion Pipeline - Events"
      tasks:
        - task_key: bronze_ingestion
          notebook_task:
            notebook_path: ./notebooks/bronze_ingestion
          job_cluster_key: bronze_cluster
          
        - task_key: silver_transform
          depends_on:
            - task_key: bronze_ingestion
          notebook_task:
            notebook_path: ./notebooks/silver_transform
          job_cluster_key: standard_cluster
          
        - task_key: gold_aggregation
          depends_on:
            - task_key: silver_transform
          notebook_task:
            notebook_path: ./notebooks/gold_aggregation
          job_cluster_key: standard_cluster

Monitoring and Observability

Track data quality metrics across layers:

from datetime import datetime

def log_layer_metrics(table: str, layer: str, spark):
    # Log row counts and quality metrics for observability
    df = spark.read.table(table)
    
    metrics = {
        "table": table,
        "layer": layer,
        "row_count": df.count(),
        "timestamp": datetime.now().isoformat()
    }
    
    print(f"[{layer.upper()}] {table}: {metrics['row_count']:,} rows")
    return metrics

You can extend this pattern to push metrics into a monitoring Delta table and visualize them in Databricks SQL dashboards — or connect tools like Harbinger Explorer to get cross-platform visibility into your data pipeline health.


Best Practices Summary

ConcernRecommendation
Bronze writesAppend-only, add metadata columns
Silver writesIdempotent MERGE INTO by natural key
Gold writesreplaceWhere for incremental partition overwrite
Schema evolutionmergeSchema=true in Bronze, explicit in Silver+
Data qualityDelta constraints + DLT expectations
PerformanceOPTIMIZE ZORDER or Liquid Clustering on Gold
OrchestrationDatabricks Workflows with dependency chains
MonitoringDelta history + custom metrics tables

Conclusion

The medallion architecture isn't just about organizing folders — it's a discipline for managing data quality, reliability, and trust across your entire lakehouse. By combining Delta Lake's ACID guarantees with Databricks' processing power and Unity Catalog's governance, you get a production-grade data platform that scales with your organization.

Start small: even a two-layer Bronze/Gold setup provides enormous benefits over a raw data swamp. Add Silver when you need reusable cleaned datasets.


Try Harbinger Explorer free for 7 days — get instant visibility into your Databricks pipelines, lineage, and data quality metrics without writing a single line of monitoring code. Start your free trial at harbingerexplorer.com


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...