Medallion Architecture Explained
Every data engineer setting up a lakehouse eventually hits the same question: how do you go from a raw S3 dump to a clean, trustworthy table that a data scientist or analyst can actually use? Medallion architecture is the answer most production lakehouses converge on — a Bronze → Silver → Gold layering pattern that turns raw ingestion chaos into governed, business-ready data. Here's what it actually means, how it works in practice, and where it breaks down.
What Medallion Architecture Is
Medallion architecture is a data design pattern for organizing lakehouse data into progressively cleaner layers. Each layer has a specific quality contract:
- Bronze: Raw ingestion. Land data exactly as it arrives from source systems — no transformation, no filtering.
- Silver: Cleaned and conformed. Apply schema enforcement, deduplication, type casting, and business rules. This is where most transformation work happens.
- Gold: Business-ready aggregates. Pre-joined, pre-aggregated tables optimized for specific analytical use cases.
The name is a quality metaphor, not a technical requirement. You'll also see it called "multi-hop architecture" or the "Bronze-Silver-Gold" pattern. The underlying storage is typically Delta Lake (in Databricks) or Apache Iceberg (in other lakehouses), though the pattern works with any ACID-capable storage format.
Architecture Overview
Loading diagram...
① Raw Sources — any origin system: REST APIs, relational databases via CDC, CSV uploads, event streams, third-party SaaS exports.
② Bronze Layer — data lands here immediately and unchanged. The only transformations allowed are technical ones: adding an _ingested_at timestamp, a _source tag, or a _raw JSON column to capture the full payload. Think of Bronze as your time machine — if something breaks downstream, you replay from here.
③ Silver Layer — this is where engineering happens. Schema enforcement, deduplication, NULL handling, type casting, and cross-source joins. Silver tables are the "single version of truth" for a given domain (e.g., silver.orders, silver.customers).
④ Gold Layer — denormalized, aggregated tables built for specific consumers. A gold.daily_revenue_by_region table serves a dashboard. A gold.customer_360 table serves a model feature store. Gold tables are often rebuilt from Silver on a schedule.
⑤ Consumers — dashboards, ML models, APIs, and ad-hoc analysts. Consumers should read from Gold (or occasionally Silver for power users). Never from Bronze.
🔵 Blue = data sources | 🟡 Amber = raw/in-progress | 🟢 Green = clean/trusted | 🔴 Rose = business-ready aggregates
Layer Comparison
| Dimension | Bronze | Silver | Gold |
|---|---|---|---|
| Data quality | Raw, unvalidated | Cleaned, validated | Aggregated, denormalized |
| Schema | Flexible / schema-on-read | Enforced (schema-on-write) | Optimized for specific use cases |
| Transformations | Minimal (timestamps, source tags) | Heavy (dedup, casts, joins) | Aggregations, business logic |
| Latency | Near-real-time (minutes) | Batch or micro-batch | Batch (hourly/daily typical) |
| Who reads it | Data engineers (debugging) | Data engineers, senior analysts | Analysts, dashboards, ML models |
| Retention | Long (years — audit trail) | Medium (months) | Short (rebuild from Silver) |
| Storage cost | High (all raw data) | Medium | Low (aggregated) |
| Typical format | Parquet / Delta, append-only | Delta with MERGE | Delta, optimized with ZORDER |
Bronze to Silver: A PySpark Example
This is a realistic Bronze → Silver transformation for an orders dataset. It handles schema casting, deduplication, and NULL filtering — the three most common Silver-layer tasks.
# PySpark — Bronze to Silver transformation for orders
# Dialect: Spark SQL / PySpark (Databricks Runtime 14+)
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType
from delta.tables import DeltaTable
spark = SparkSession.builder.appName("bronze_to_silver_orders").getOrCreate()
# 1. Read from Bronze (append-only, raw ingestion)
bronze_df = (
spark.read
.format("delta")
.load("/mnt/datalake/bronze/orders")
# Only process records not yet in Silver (watermark-based incremental)
.filter(F.col("_ingested_at") > spark.sql("SELECT MAX(processed_at) FROM silver.orders").collect()[0][0])
)
# 2. Apply Silver-layer transformations
silver_df = (
bronze_df
# Cast types — Bronze lands everything as string from the JSON payload
.withColumn("order_id", F.col("order_id").cast("string"))
.withColumn("customer_id", F.col("customer_id").cast("string"))
.withColumn("amount", F.col("amount").cast(DoubleType()))
.withColumn("order_date", F.to_timestamp(F.col("order_date"), "yyyy-MM-dd'T'HH:mm:ss"))
# Drop rows missing critical keys
.filter(F.col("order_id").isNotNull())
.filter(F.col("customer_id").isNotNull())
.filter(F.col("amount") > 0)
# Deduplicate — keep the latest record per order_id
.withColumn(
"row_num",
F.row_number().over(
Window.partitionBy("order_id").orderBy(F.col("_ingested_at").desc())
)
)
.filter(F.col("row_num") == 1)
.drop("row_num")
# Add processing metadata
.withColumn("processed_at", F.current_timestamp())
.withColumn("_source_layer", F.lit("silver"))
)
# 3. MERGE into Silver (upsert — handle late-arriving records)
silver_table = DeltaTable.forName(spark, "silver.orders")
(
silver_table.alias("target")
.merge(
silver_df.alias("source"),
"target.order_id = source.order_id"
)
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
)
print(f"Silver merge complete. Records processed: {silver_df.count()}")
Key decisions in this example:
- Schema casting at Silver, not Bronze — Bronze accepts whatever the source sends; Silver enforces contracts.
- MERGE instead of overwrite — handles late-arriving records and corrections from the source system.
- Watermark-based incremental processing — don't reprocess the entire Bronze table on every run.
Common Pitfalls
1. Putting business logic in Bronze Bronze is a safety net. The moment you start filtering, renaming, or transforming in Bronze, you lose the ability to replay cleanly when business logic changes. Keep Bronze as raw as possible.
2. Skipping Silver and going straight to Gold It's tempting to merge Bronze directly into Gold for simple use cases. This works until your source schema changes, a deduplication bug appears, or a new downstream consumer needs slightly different logic. Silver is the insurance policy.
3. Gold tables that nobody owns Gold tables accumulate. After 12 months, you often have 40 Gold tables where 8 are actively used and 32 are stale. Treat Gold tables like APIs — they need owners, SLAs, and deprecation processes.
4. Using a single pipeline for all three layers Bronze → Silver → Gold should be independent, schedulable jobs. Coupling them in a single pipeline means a Silver failure blocks your Bronze landing, and a Gold rebuild triggers full Silver reprocessing.
5. Forgetting about schema evolution Source systems change their schemas. Bronze should absorb schema changes silently (schema evolution enabled). Silver should validate and alert on unexpected columns. Gold should be rebuilt when Silver schema changes.
When NOT to Use Medallion Architecture
Medallion architecture adds overhead. For some scenarios it's the wrong tool:
Small or static datasets: If you're loading a 50k-row reference table from a vendor CSV once a month, the three-layer pattern is overkill. A single cleaned table is fine.
Pure streaming pipelines: Real-time pipelines where sub-second latency is required don't benefit from a batch-oriented Bronze → Silver → Gold cadence. Use Kafka + Flink or Spark Structured Streaming with a simpler schema.
Proof-of-concept work: Don't spend two weeks building medallion infrastructure to answer a one-off analytical question. Use it when the pipeline needs to be production-grade and maintained by a team.
Single-source, single-consumer pipelines: If one source feeds one consumer with no reuse, Bronze-Silver-Gold adds layers without adding value. A single transformation job is more maintainable.
Once your Gold tables are clean and business-ready, exploring them ad-hoc is where a tool like Harbinger Explorer becomes useful — it lets you run natural language queries against your Gold-layer data directly in the browser via DuckDB WASM, without spinning up a full analytics cluster for exploratory questions.
Conclusion
Medallion architecture is the right default for production lakehouses with multiple sources, multiple consumers, and a team that needs clear ownership boundaries. Bronze gives you durability and replayability. Silver gives you the single version of truth. Gold gives you performance and consumer-specific optimization.
Start with Bronze and Silver first — get one domain (e.g., orders or customers) working end-to-end before adding Gold aggregations. Build Gold tables only when you have a specific consumer with a clear SLA. Let the pattern grow with your data complexity, not ahead of it.
Continue Reading
- Databricks vs Snowflake vs BigQuery (2026)
- Delta Live Tables vs Classic ETL: Which Fits Your Pipeline?
- Excel to SQL: A Migration Guide for Business Analysts
[VERIFY] PySpark Window import — confirm from pyspark.sql import Window is required for Window.partitionBy (yes, it is — add to import)
[VERIFY] DeltaTable.forName availability — confirm this API is available in the target Databricks Runtime version
[VERIFY] Medallion architecture is Databricks-originated terminology — it is widely adopted but confirm attribution is not required per article framing
Continue Reading
Data Lakehouse Architecture Explained
How data lakehouse architecture works, when to use it over a warehouse or lake, and the common pitfalls that trip up data engineering teams.
dbt vs Spark SQL: How to Choose
dbt or Spark SQL for your transformation layer? A side-by-side comparison of features, pricing, and use cases — with code examples for both and honest trade-offs for analytics engineers.
Delta Live Tables vs Classic ETL: Which Fits Your Pipeline?
DLT vs classic ETL compared honestly: declarative expectations, streaming, debugging, testing, and pricing. Includes DLT code example with expectations syntax.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial