Databricks Streaming Tables: DLT vs Structured Streaming
Most Databricks teams start with raw Structured Streaming — spark.readStream, writeStream, checkpoints. It works. Then Delta Live Tables arrives and suddenly you have two ways to do the same thing. Understanding when each is the right tool saves you from painful mid-project rewrites.
Two Ways to Build Streaming Pipelines
Databricks gives you two first-class options for streaming:
- Structured Streaming — the underlying Spark API, full control, maximum flexibility
- DLT Streaming Tables — a declarative abstraction on top of Structured Streaming, opinionated and managed
They're not mutually exclusive — DLT Streaming Tables are Structured Streaming under the hood. The question is how much of the plumbing you want to own.
Structured Streaming: Full Control
Structured Streaming processes data as a continuous sequence of micro-batches. You define a source, a transformation, and a sink. Databricks handles the execution.
# PySpark — Basic Structured Streaming pipeline
from pyspark.sql.functions import col, from_json, schema_of_json
# Read from Event Hub (Kafka protocol)
raw = (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "eh-namespace.servicebus.windows.net:9093")
.option("subscribe", "sensor-events")
.option("startingOffsets", "latest")
.load()
)
# Parse JSON payload
parsed = (
raw
.selectExpr("CAST(value AS STRING) as json_payload", "timestamp as kafka_ts")
.withColumn("data", from_json(col("json_payload"), "sensor_id STRING, temp DOUBLE, ts LONG"))
.select("kafka_ts", "data.*")
)
# Write to Delta
(
parsed.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/mnt/checkpoints/sensors_bronze")
.trigger(processingTime="30 seconds")
.table("bronze.sensor_events")
)
You own the checkpoint, the trigger interval, the retry logic, and the monitoring. Maximum control, maximum responsibility.
DLT Streaming Tables: Declarative Pipelines
Delta Live Tables introduces a declarative SQL or Python API where you define what you want, not how to run it. DLT manages execution, retries, schema enforcement, and data quality.
# Python DLT — Streaming Table definition
import dlt
from pyspark.sql.functions import col, from_json
@dlt.table(
name="sensor_events_bronze",
comment="Raw sensor events from Event Hub",
table_properties={"quality": "bronze"}
)
def sensor_events_bronze():
return (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "eh-namespace.servicebus.windows.net:9093")
.option("subscribe", "sensor-events")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as raw_payload", "timestamp as kafka_ts")
)
@dlt.table(name="sensor_events_silver")
@dlt.expect_or_drop("valid_temp", "temp BETWEEN -50 AND 150")
def sensor_events_silver():
return (
dlt.read_stream("sensor_events_bronze")
.withColumn("data", from_json(col("raw_payload"), "sensor_id STRING, temp DOUBLE"))
.select("kafka_ts", "data.sensor_id", "data.temp")
)
DLT handles checkpoints automatically. The @dlt.expect_or_drop decorator enforces data quality — rows failing the expectation are dropped (or quarantined, depending on the annotation). No checkpoint management, no manual recovery.
Streaming Tables in SQL
DLT also supports SQL-first Streaming Tables, which became GA in late 2024:
-- DLT SQL — Streaming Table
CREATE OR REFRESH STREAMING TABLE sensor_events_silver
COMMENT 'Validated sensor readings'
AS SELECT
kafka_ts,
raw_payload:sensor_id::STRING AS sensor_id,
raw_payload:temp::DOUBLE AS temp
FROM STREAM(LIVE.sensor_events_bronze)
WHERE raw_payload:temp::DOUBLE BETWEEN -50 AND 150;
The STREAM() wrapper tells DLT this is a streaming read. The LIVE. prefix references another DLT table in the same pipeline.
Side-by-Side Comparison
| Dimension | Structured Streaming | DLT Streaming Tables |
|---|---|---|
| API style | Imperative (Python/Scala) | Declarative (Python or SQL) |
| Checkpoint management | Manual | Automatic |
| Data quality | Custom validation logic | Built-in expectations framework |
| Pipeline orchestration | External (Workflows, Airflow) | Built-in DLT pipeline |
| Debugging | Spark UI, logs | DLT pipeline UI with lineage |
| Schema enforcement | Manual mergeSchema | Automatic with evolution |
| Cost | Job cluster | DLT cluster (DBU surcharge) |
| Custom sources | Any Spark source | Any Spark source |
| Stateful operations | Full support | Supported, but complex |
| Multi-hop (bronze→silver→gold) | Manual chaining | Native table references |
DLT DBU Surcharge: The Real Cost
DLT pipelines run on Enhanced or Core tier, both with a DBU multiplier on top of the base Databricks cost:
| Tier | DBU Multiplier | When to use |
|---|---|---|
| Core | 0.2x extra | Non-critical pipelines, dev |
| Pro | 0.25x extra | Expectations, autoscaling |
| Advanced | 0.3x extra | Row-level security, enhanced monitoring |
(Last verified: April 2026 — [PRICING-CHECK])
For cost-sensitive pipelines with simple transformations, raw Structured Streaming on a standard job cluster can be significantly cheaper. Run the numbers before committing to DLT for high-volume, always-on pipelines.
Stateful Streaming: Watermarks and Windows
Both approaches support stateful aggregations — but you need to be explicit about watermarks or your state store will grow unbounded.
# PySpark — Windowed aggregation with watermark
from pyspark.sql.functions import window, avg
aggregated = (
parsed
.withWatermark("event_time", "10 minutes") # tolerate 10-min late data
.groupBy(
window(col("event_time"), "5 minutes"), # 5-min tumbling window
col("sensor_id")
)
.agg(avg("temp").alias("avg_temp"))
)
In DLT, stateful aggregations are supported but require more care — use dlt.read_stream() with explicit watermarks and avoid unbounded state by always specifying windows.
Common Pitfalls
1. Missing watermarks on stateful aggregations Without a watermark, Spark holds all state forever. Memory pressure and eventually OOM errors follow.
2. Wrong trigger interval
trigger(once=True) processes all available data and stops — it's a micro-batch job, not continuous streaming. Use trigger(availableNow=True) for the modern equivalent in Databricks Runtime 11.3+.
3. DLT for one-off batch loads DLT has startup overhead. For simple one-time ingestion, a regular notebook job is faster and cheaper.
4. Sharing checkpoints between runs Never reuse a checkpoint directory for a different stream definition. The checkpoint encodes the query plan — mismatches cause failures or silent data loss.
5. Ignoring _rescued_data from Autoloader sources
If you feed Databricks Autoloader into a DLT pipeline, make sure your silver table handles the _rescued_data column explicitly.
When to Choose Which
Use DLT Streaming Tables when:
- You're building a medallion architecture with multiple hops
- Data quality expectations are a first-class requirement
- Your team includes SQL-first engineers
- You want built-in lineage and pipeline observability
- You're already using DLT for batch tables
Use raw Structured Streaming when:
- You need complex stateful logic (custom flatMapGroupsWithState)
- Cost is a primary constraint and DLT DBU surcharge matters
- You need non-Delta sinks (Cassandra, Elasticsearch, custom)
- You're integrating with an existing Airflow or Databricks Workflows orchestration layer
Key Takeaways
DLT Streaming Tables are the right default for new medallion pipelines — they remove operational overhead and add data quality enforcement. Raw Structured Streaming remains the right choice when you need full control over stateful logic, custom sinks, or when DLT's cost overhead doesn't justify the convenience.
The two approaches aren't rivals — many production architectures use raw Structured Streaming for high-volume event ingestion at the edge and DLT for the bronze-to-gold transformation layer.
Continue Reading
Continue Reading
Databricks Autoloader: The Complete Guide
CI/CD Pipelines for Databricks Projects: A Production-Ready Guide
Build a robust CI/CD pipeline for your Databricks projects using GitHub Actions, Databricks Asset Bundles, and automated testing. Covers branching strategy, testing, and deployment.
Databricks Cluster Policies for Cost Control: A Practical Guide
Learn how to use Databricks cluster policies to enforce cost guardrails, standardize cluster configurations, and prevent cloud bill surprises without blocking your team's productivity.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial