Back to Knowledge Hub
Engineering

Medallion Architecture Explained

9 min read·Tags: medallion architecture, lakehouse, bronze silver gold, data quality, delta lake, pyspark, data engineering

Every data engineer setting up a lakehouse eventually hits the same question: how do you go from a raw S3 dump to a clean, trustworthy table that a data scientist or analyst can actually use? Medallion architecture is the answer most production lakehouses converge on — a Bronze → Silver → Gold layering pattern that turns raw ingestion chaos into governed, business-ready data. Here's what it actually means, how it works in practice, and where it breaks down.

What Medallion Architecture Is

Medallion architecture is a data design pattern for organizing lakehouse data into progressively cleaner layers. Each layer has a specific quality contract:

  • Bronze: Raw ingestion. Land data exactly as it arrives from source systems — no transformation, no filtering.
  • Silver: Cleaned and conformed. Apply schema enforcement, deduplication, type casting, and business rules. This is where most transformation work happens.
  • Gold: Business-ready aggregates. Pre-joined, pre-aggregated tables optimized for specific analytical use cases.

The name is a quality metaphor, not a technical requirement. You'll also see it called "multi-hop architecture" or the "Bronze-Silver-Gold" pattern. The underlying storage is typically Delta Lake (in Databricks) or Apache Iceberg (in other lakehouses), though the pattern works with any ACID-capable storage format.

Architecture Overview

Loading diagram...

Raw Sources — any origin system: REST APIs, relational databases via CDC, CSV uploads, event streams, third-party SaaS exports.

Bronze Layer — data lands here immediately and unchanged. The only transformations allowed are technical ones: adding an _ingested_at timestamp, a _source tag, or a _raw JSON column to capture the full payload. Think of Bronze as your time machine — if something breaks downstream, you replay from here.

Silver Layer — this is where engineering happens. Schema enforcement, deduplication, NULL handling, type casting, and cross-source joins. Silver tables are the "single version of truth" for a given domain (e.g., silver.orders, silver.customers).

Gold Layer — denormalized, aggregated tables built for specific consumers. A gold.daily_revenue_by_region table serves a dashboard. A gold.customer_360 table serves a model feature store. Gold tables are often rebuilt from Silver on a schedule.

Consumers — dashboards, ML models, APIs, and ad-hoc analysts. Consumers should read from Gold (or occasionally Silver for power users). Never from Bronze.

🔵 Blue = data sources | 🟡 Amber = raw/in-progress | 🟢 Green = clean/trusted | 🔴 Rose = business-ready aggregates

Layer Comparison

DimensionBronzeSilverGold
Data qualityRaw, unvalidatedCleaned, validatedAggregated, denormalized
SchemaFlexible / schema-on-readEnforced (schema-on-write)Optimized for specific use cases
TransformationsMinimal (timestamps, source tags)Heavy (dedup, casts, joins)Aggregations, business logic
LatencyNear-real-time (minutes)Batch or micro-batchBatch (hourly/daily typical)
Who reads itData engineers (debugging)Data engineers, senior analystsAnalysts, dashboards, ML models
RetentionLong (years — audit trail)Medium (months)Short (rebuild from Silver)
Storage costHigh (all raw data)MediumLow (aggregated)
Typical formatParquet / Delta, append-onlyDelta with MERGEDelta, optimized with ZORDER

Bronze to Silver: A PySpark Example

This is a realistic Bronze → Silver transformation for an orders dataset. It handles schema casting, deduplication, and NULL filtering — the three most common Silver-layer tasks.

# PySpark — Bronze to Silver transformation for orders
# Dialect: Spark SQL / PySpark (Databricks Runtime 14+)

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType
from delta.tables import DeltaTable

spark = SparkSession.builder.appName("bronze_to_silver_orders").getOrCreate()

# 1. Read from Bronze (append-only, raw ingestion)
bronze_df = (
    spark.read
    .format("delta")
    .load("/mnt/datalake/bronze/orders")
    # Only process records not yet in Silver (watermark-based incremental)
    .filter(F.col("_ingested_at") > spark.sql("SELECT MAX(processed_at) FROM silver.orders").collect()[0][0])
)

# 2. Apply Silver-layer transformations
silver_df = (
    bronze_df
    # Cast types — Bronze lands everything as string from the JSON payload
    .withColumn("order_id",    F.col("order_id").cast("string"))
    .withColumn("customer_id", F.col("customer_id").cast("string"))
    .withColumn("amount",      F.col("amount").cast(DoubleType()))
    .withColumn("order_date",  F.to_timestamp(F.col("order_date"), "yyyy-MM-dd'T'HH:mm:ss"))

    # Drop rows missing critical keys
    .filter(F.col("order_id").isNotNull())
    .filter(F.col("customer_id").isNotNull())
    .filter(F.col("amount") > 0)

    # Deduplicate — keep the latest record per order_id
    .withColumn(
        "row_num",
        F.row_number().over(
            Window.partitionBy("order_id").orderBy(F.col("_ingested_at").desc())
        )
    )
    .filter(F.col("row_num") == 1)
    .drop("row_num")

    # Add processing metadata
    .withColumn("processed_at", F.current_timestamp())
    .withColumn("_source_layer", F.lit("silver"))
)

# 3. MERGE into Silver (upsert — handle late-arriving records)
silver_table = DeltaTable.forName(spark, "silver.orders")

(
    silver_table.alias("target")
    .merge(
        silver_df.alias("source"),
        "target.order_id = source.order_id"
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .execute()
)

print(f"Silver merge complete. Records processed: {silver_df.count()}")

Key decisions in this example:

  • Schema casting at Silver, not Bronze — Bronze accepts whatever the source sends; Silver enforces contracts.
  • MERGE instead of overwrite — handles late-arriving records and corrections from the source system.
  • Watermark-based incremental processing — don't reprocess the entire Bronze table on every run.

Common Pitfalls

1. Putting business logic in Bronze Bronze is a safety net. The moment you start filtering, renaming, or transforming in Bronze, you lose the ability to replay cleanly when business logic changes. Keep Bronze as raw as possible.

2. Skipping Silver and going straight to Gold It's tempting to merge Bronze directly into Gold for simple use cases. This works until your source schema changes, a deduplication bug appears, or a new downstream consumer needs slightly different logic. Silver is the insurance policy.

3. Gold tables that nobody owns Gold tables accumulate. After 12 months, you often have 40 Gold tables where 8 are actively used and 32 are stale. Treat Gold tables like APIs — they need owners, SLAs, and deprecation processes.

4. Using a single pipeline for all three layers Bronze → Silver → Gold should be independent, schedulable jobs. Coupling them in a single pipeline means a Silver failure blocks your Bronze landing, and a Gold rebuild triggers full Silver reprocessing.

5. Forgetting about schema evolution Source systems change their schemas. Bronze should absorb schema changes silently (schema evolution enabled). Silver should validate and alert on unexpected columns. Gold should be rebuilt when Silver schema changes.

When NOT to Use Medallion Architecture

Medallion architecture adds overhead. For some scenarios it's the wrong tool:

Small or static datasets: If you're loading a 50k-row reference table from a vendor CSV once a month, the three-layer pattern is overkill. A single cleaned table is fine.

Pure streaming pipelines: Real-time pipelines where sub-second latency is required don't benefit from a batch-oriented Bronze → Silver → Gold cadence. Use Kafka + Flink or Spark Structured Streaming with a simpler schema.

Proof-of-concept work: Don't spend two weeks building medallion infrastructure to answer a one-off analytical question. Use it when the pipeline needs to be production-grade and maintained by a team.

Single-source, single-consumer pipelines: If one source feeds one consumer with no reuse, Bronze-Silver-Gold adds layers without adding value. A single transformation job is more maintainable.

Once your Gold tables are clean and business-ready, exploring them ad-hoc is where a tool like Harbinger Explorer becomes useful — it lets you run natural language queries against your Gold-layer data directly in the browser via DuckDB WASM, without spinning up a full analytics cluster for exploratory questions.

Conclusion

Medallion architecture is the right default for production lakehouses with multiple sources, multiple consumers, and a team that needs clear ownership boundaries. Bronze gives you durability and replayability. Silver gives you the single version of truth. Gold gives you performance and consumer-specific optimization.

Start with Bronze and Silver first — get one domain (e.g., orders or customers) working end-to-end before adding Gold aggregations. Build Gold tables only when you have a specific consumer with a clear SLA. Let the pattern grow with your data complexity, not ahead of it.


Continue Reading


[VERIFY] PySpark Window import — confirm from pyspark.sql import Window is required for Window.partitionBy (yes, it is — add to import) [VERIFY] DeltaTable.forName availability — confirm this API is available in the target Databricks Runtime version [VERIFY] Medallion architecture is Databricks-originated terminology — it is widely adopted but confirm attribution is not required per article framing


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial