Delta Live Tables vs Classic ETL: Which Fits Your Pipeline?
You've built classic ETL pipelines. PySpark jobs, Airflow DAGs, explicit MERGE statements. It works. Then someone from your team mentions Delta Live Tables and you wonder whether it's genuinely better or just new syntax over the same complexity. The answer: DLT solves specific problems very well and introduces different problems of its own. Here's how to evaluate the delta live tables vs classic etl tradeoff without the hype.
What Each Approach Actually Is
Classic ETL is explicit pipeline code: you write PySpark (or SQL) transformations, wire them together with an orchestrator (Airflow, Prefect, Databricks Workflows), manage dependencies manually, and implement your own error handling and quality checks.
Delta Live Tables (DLT) is Databricks' declarative ETL framework. You define what tables should contain, not how to build them. DLT handles dependency resolution, pipeline execution ordering, quality enforcement, and retry logic. It's opinionated by design.
The fundamental difference: classic ETL is imperative (you control execution), DLT is declarative (you declare expectations and DLT handles execution).
Feature Comparison
| Dimension | Delta Live Tables | Classic ETL (PySpark + Airflow) |
|---|---|---|
| Paradigm | Declarative — define what, not how | Imperative — define how, step by step |
| Dependency resolution | Automatic (DLT builds the DAG) | Manual (you wire jobs/tasks) |
| Data quality checks | Built-in expectations (warn/drop/fail) | DIY (assert statements, custom checks) |
| Streaming support | Native (batch and streaming in one pipeline) | Structured Streaming (separate setup) |
| Schema evolution | Automatic schema evolution | Manual handling required |
| Error handling | Built-in retry, quarantine tables | Custom error handling |
| Observability | Pipeline UI with lineage graph | Depends on orchestrator + logging setup |
| Debugging | Harder — less control over execution order | Easier — run individual jobs in isolation |
| Testing | Limited (ndf unit test framework early stage) | Standard pytest / databricks-connect |
| Flexibility | Constrained — DLT API is the boundary | Full — write any valid Spark/Python code |
| Multi-platform | Databricks only | Platform-agnostic (can run on any Spark) |
| Learning curve | Low for simple pipelines | High (Spark + Airflow + Delta mastery) |
DLT Pricing
Last verified: March 2026. DLT adds a surcharge on top of standard Databricks DBU costs. Verify current figures at databricks.com/product/pricing.
| Pipeline Type | DLT Surcharge | When to Use |
|---|---|---|
| Core (formerly Classic) | 0.2 DBU/hour additional | Development, simple batch pipelines |
| Pro | 0.25 DBU/hour additional | Change Data Capture (CDC), advanced streaming |
| Advanced | 0.36 DBU/hour additional | Enhanced autoscaling, SLA guarantees |
DLT costs are additive to your underlying cluster compute. A medium-size cluster running DLT Pro pipelines can run meaningfully more expensive than equivalent classic jobs — model this before committing, especially for high-frequency streaming pipelines.
[PRICING-CHECK] DLT pricing tiers have been restructured since the Lakeflow rebrand — verify current tier names and surcharges at databricks.com/product/pricing.
DLT in Practice: The Expectations Syntax
The most compelling feature of DLT is data quality expectations. Instead of writing custom assertion code, you declare quality rules that DLT enforces at runtime.
# Python — Delta Live Tables with expectations
# Databricks Runtime with Delta Live Tables
import dlt
from pyspark.sql import functions as F
# Bronze layer: raw ingestion (DLT handles scheduling and incrementalism)
@dlt.table(
name="bronze_orders",
comment="Raw order events from the source API — no transformations applied",
table_properties={"quality": "bronze"}
)
def bronze_orders():
return (
spark.readStream
.format("cloudFiles") # Auto Loader — handles new file detection
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/mnt/datalake/schemas/orders")
.load("/mnt/datalake/landing/orders/")
)
# Silver layer: cleaned orders with quality expectations
@dlt.table(
name="silver_orders",
comment="Cleaned and validated orders — enforced quality contract",
table_properties={"quality": "silver"}
)
# expect: record the violation, but keep the row (for monitoring)
@dlt.expect("positive_amount", "amount > 0")
# expect_or_fail: halt the pipeline if ANY row violates this rule
@dlt.expect_or_fail("non_null_order_id", "order_id IS NOT NULL")
# expect_or_drop: silently remove rows that violate this rule
@dlt.expect_or_drop("valid_status", "status IN ('pending', 'shipped', 'delivered', 'cancelled')")
def silver_orders():
return (
dlt.read_stream("bronze_orders")
.select(
F.col("order_id").cast("string"),
F.col("customer_id").cast("string"),
F.col("amount").cast("double"),
F.col("status").cast("string"),
F.to_timestamp(F.col("order_date"), "yyyy-MM-dd'T'HH:mm:ss").alias("order_date")
)
)
# Gold layer: daily revenue aggregation (batch, reads from Silver)
@dlt.table(
name="gold_daily_revenue",
comment="Daily revenue aggregated from silver_orders — rebuilt daily"
)
def gold_daily_revenue():
return (
dlt.read("silver_orders")
.groupBy(F.date_trunc("day", F.col("order_date")).alias("date"))
.agg(
F.sum("amount").alias("total_revenue"),
F.count("*").alias("order_count")
)
.orderBy("date")
)
The three expectation modes are the core DLT differentiator:
@dlt.expect— log the violation in the pipeline event log, keep the row@dlt.expect_or_fail— stop the pipeline on any violation (good for critical keys)@dlt.expect_or_drop— silently quarantine invalid rows (good for optional fields)
In classic ETL, you'd implement all three modes as custom code — typically 30-50 lines of assertion logic, custom exception handling, and logging setup. DLT handles this in one decorator.
Honest Trade-offs
Where DLT Genuinely Wins
Data quality enforcement is legitimately better. The expectations system covers the 80% case with zero boilerplate. The pipeline event log captures every expectation violation with row-level detail — this is something most classic ETL pipelines only have if someone invested significant engineering effort.
Streaming + batch in one framework. DLT abstracts whether a table is streaming or batch — you can switch a table from batch to streaming by changing dlt.read to dlt.read_stream without restructuring the pipeline. Classic ETL keeps these as fundamentally different code paths.
Observability out of the box. The DLT pipeline graph UI shows data flow, lineage, and quality metrics without any setup. Classic pipelines require assembling this from Airflow/Databricks Workflows logs, custom dashboards, and Great Expectations or similar.
Where Classic ETL Wins
Debugging. DLT pipelines are harder to debug in isolation. You can't easily run a single table definition outside the pipeline context. In classic ETL, you run the Spark job directly in a notebook and inspect intermediate DataFrames.
Testing. Unit testing DLT code is an active pain point. The DLT unit testing framework is still evolving. Classic PySpark code is testable with standard pytest and databricks-connect.
Flexibility. DLT is constrained to the DLT API surface. If you need custom checkpoint logic, complex conditional branching, or integration with non-Databricks systems, you hit the framework boundaries fast. Classic ETL has no such ceiling.
Portability. DLT is Databricks-only. Classic PySpark runs on any Spark cluster — EMR, GCP Dataproc, a self-hosted cluster. If cross-cloud portability matters, DLT is a lock-in risk.
When to Choose Each
Choose DLT when:
- You're building streaming pipelines on Databricks and want batch/streaming unified
- Data quality enforcement is a core requirement and you want it without DIY boilerplate
- The pipeline follows a clear Bronze → Silver → Gold pattern with defined expectations
- Your team is Databricks-focused and operational simplicity matters more than flexibility
- You're comfortable with the Databricks cost premium
Choose Classic ETL when:
- You need platform portability (may leave Databricks or run on multi-cloud)
- Debugging and unit testing in isolation are priorities
- The pipeline logic is complex enough to require full programmatic control
- You have existing Airflow infrastructure and team expertise
- The pipeline involves non-Databricks systems or custom checkpointing
The Gold Layer and What Comes After
Whether you use DLT or classic ETL, the Gold-layer tables it produces need to be explored. That's where the pipeline ends and analysis begins — and for teams doing ad-hoc exploration without spinning up a full BI tool, Harbinger Explorer lets you query those tables directly in the browser using DuckDB WASM with natural language query support.
Conclusion
DLT is not a universal upgrade over classic ETL — it's a different set of tradeoffs. If your team is Databricks-native, building streaming lakehouses with data quality requirements, and willing to accept reduced debugging flexibility, DLT is genuinely the right choice. If you need portability, testability, or pipeline complexity that exceeds what the DLT API handles, classic ETL with Airflow remains the more practical option.
The expectation syntax and streaming unification are DLT's real arguments. Evaluate them against your actual pipeline needs, not the abstract promise of "less code."
Continue Reading
- Medallion Architecture Explained
- Databricks vs Snowflake vs BigQuery (2026)
- Excel to SQL: A Migration Guide for Business Analysts
[VERIFY] DLT tier names — Databricks rebranded DLT tiers as part of the Lakeflow announcement; verify current tier names (Core/Pro/Advanced) at databricks.com/product/pricing
[PRICING-CHECK] DLT DBU surcharges per tier — verify current rates including any changes from the Lakeflow rebrand
[VERIFY] dlt.read_stream vs dlt.read API — confirm current DLT Python API surface at docs.databricks.com/delta-live-tables
Continue Reading
Data Lakehouse Architecture Explained
How data lakehouse architecture works, when to use it over a warehouse or lake, and the common pitfalls that trip up data engineering teams.
dbt vs Spark SQL: How to Choose
dbt or Spark SQL for your transformation layer? A side-by-side comparison of features, pricing, and use cases — with code examples for both and honest trade-offs for analytics engineers.
Medallion Architecture Explained
Medallion architecture (Bronze → Silver → Gold) explained for data engineers. Includes PySpark examples, layer comparison table, common pitfalls, and when not to use it.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial