databricksApr 3, 2026

Databricks Streaming Tables: DLT vs Structured Streaming

9 min read·Tags: databricks, streaming tables, delta live tables, DLT, structured streaming, kafka, real-time

Most Databricks teams start with raw Structured Streaming — spark.readStream, writeStream, checkpoints. It works. Then Delta Live Tables arrives and suddenly you have two ways to do the same thing. Understanding when each is the right tool saves you from painful mid-project rewrites.

Two Ways to Build Streaming Pipelines

Databricks gives you two first-class options for streaming:

Structured Streaming — the underlying Spark API, full control, maximum flexibility
DLT Streaming Tables — a declarative abstraction on top of Structured Streaming, opinionated and managed

They're not mutually exclusive — DLT Streaming Tables are Structured Streaming under the hood. The question is how much of the plumbing you want to own.

Structured Streaming: Full Control

Structured Streaming processes data as a continuous sequence of micro-batches. You define a source, a transformation, and a sink. Databricks handles the execution.

# PySpark — Basic Structured Streaming pipeline
from pyspark.sql.functions import col, from_json, schema_of_json

# Read from Event Hub (Kafka protocol)
raw = (
    spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "eh-namespace.servicebus.windows.net:9093")
    .option("subscribe", "sensor-events")
    .option("startingOffsets", "latest")
    .load()
)

# Parse JSON payload
parsed = (
    raw
    .selectExpr("CAST(value AS STRING) as json_payload", "timestamp as kafka_ts")
    .withColumn("data", from_json(col("json_payload"), "sensor_id STRING, temp DOUBLE, ts LONG"))
    .select("kafka_ts", "data.*")
)

# Write to Delta
(
    parsed.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/mnt/checkpoints/sensors_bronze")
    .trigger(processingTime="30 seconds")
    .table("bronze.sensor_events")
)

You own the checkpoint, the trigger interval, the retry logic, and the monitoring. Maximum control, maximum responsibility.

DLT Streaming Tables: Declarative Pipelines

Delta Live Tables introduces a declarative SQL or Python API where you define what you want, not how to run it. DLT manages execution, retries, schema enforcement, and data quality.

# Python DLT — Streaming Table definition
import dlt
from pyspark.sql.functions import col, from_json

@dlt.table(
    name="sensor_events_bronze",
    comment="Raw sensor events from Event Hub",
    table_properties={"quality": "bronze"}
)
def sensor_events_bronze():
    return (
        spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "eh-namespace.servicebus.windows.net:9093")
        .option("subscribe", "sensor-events")
        .option("startingOffsets", "latest")
        .load()
        .selectExpr("CAST(value AS STRING) as raw_payload", "timestamp as kafka_ts")
    )

@dlt.table(name="sensor_events_silver")
@dlt.expect_or_drop("valid_temp", "temp BETWEEN -50 AND 150")
def sensor_events_silver():
    return (
        dlt.read_stream("sensor_events_bronze")
        .withColumn("data", from_json(col("raw_payload"), "sensor_id STRING, temp DOUBLE"))
        .select("kafka_ts", "data.sensor_id", "data.temp")
    )

DLT handles checkpoints automatically. The @dlt.expect_or_drop decorator enforces data quality — rows failing the expectation are dropped (or quarantined, depending on the annotation). No checkpoint management, no manual recovery.

Streaming Tables in SQL

DLT also supports SQL-first Streaming Tables, which became GA in late 2024:

-- DLT SQL — Streaming Table
CREATE OR REFRESH STREAMING TABLE sensor_events_silver
COMMENT 'Validated sensor readings'
AS SELECT
    kafka_ts,
    raw_payload:sensor_id::STRING AS sensor_id,
    raw_payload:temp::DOUBLE AS temp
FROM STREAM(LIVE.sensor_events_bronze)
WHERE raw_payload:temp::DOUBLE BETWEEN -50 AND 150;

The STREAM() wrapper tells DLT this is a streaming read. The LIVE. prefix references another DLT table in the same pipeline.

Side-by-Side Comparison

Dimension	Structured Streaming	DLT Streaming Tables
API style	Imperative (Python/Scala)	Declarative (Python or SQL)
Checkpoint management	Manual	Automatic
Data quality	Custom validation logic	Built-in expectations framework
Pipeline orchestration	External (Workflows, Airflow)	Built-in DLT pipeline
Debugging	Spark UI, logs	DLT pipeline UI with lineage
Schema enforcement	Manual `mergeSchema`	Automatic with evolution
Cost	Job cluster	DLT cluster (DBU surcharge)
Custom sources	Any Spark source	Any Spark source
Stateful operations	Full support	Supported, but complex
Multi-hop (bronze→silver→gold)	Manual chaining	Native table references

DLT DBU Surcharge: The Real Cost

DLT pipelines run on Enhanced or Core tier, both with a DBU multiplier on top of the base Databricks cost:

Tier	DBU Multiplier	When to use
Core	0.2x extra	Non-critical pipelines, dev
Pro	0.25x extra	Expectations, autoscaling
Advanced	0.3x extra	Row-level security, enhanced monitoring

(Last verified: April 2026 — [PRICING-CHECK])

For cost-sensitive pipelines with simple transformations, raw Structured Streaming on a standard job cluster can be significantly cheaper. Run the numbers before committing to DLT for high-volume, always-on pipelines.

Stateful Streaming: Watermarks and Windows

Both approaches support stateful aggregations — but you need to be explicit about watermarks or your state store will grow unbounded.

# PySpark — Windowed aggregation with watermark
from pyspark.sql.functions import window, avg

aggregated = (
    parsed
    .withWatermark("event_time", "10 minutes")   # tolerate 10-min late data
    .groupBy(
        window(col("event_time"), "5 minutes"),  # 5-min tumbling window
        col("sensor_id")
    )
    .agg(avg("temp").alias("avg_temp"))
)

In DLT, stateful aggregations are supported but require more care — use dlt.read_stream() with explicit watermarks and avoid unbounded state by always specifying windows.

Common Pitfalls

1. Missing watermarks on stateful aggregations Without a watermark, Spark holds all state forever. Memory pressure and eventually OOM errors follow.

2. Wrong trigger interval trigger(once=True) processes all available data and stops — it's a micro-batch job, not continuous streaming. Use trigger(availableNow=True) for the modern equivalent in Databricks Runtime 11.3+.

3. DLT for one-off batch loads DLT has startup overhead. For simple one-time ingestion, a regular notebook job is faster and cheaper.

4. Sharing checkpoints between runs Never reuse a checkpoint directory for a different stream definition. The checkpoint encodes the query plan — mismatches cause failures or silent data loss.

5. Ignoring _rescued_data from Autoloader sources If you feed Databricks Autoloader into a DLT pipeline, make sure your silver table handles the _rescued_data column explicitly.

When to Choose Which

Use DLT Streaming Tables when:

You're building a medallion architecture with multiple hops
Data quality expectations are a first-class requirement
Your team includes SQL-first engineers
You want built-in lineage and pipeline observability
You're already using DLT for batch tables

Use raw Structured Streaming when:

You need complex stateful logic (custom flatMapGroupsWithState)
Cost is a primary constraint and DLT DBU surcharge matters
You need non-Delta sinks (Cassandra, Elasticsearch, custom)
You're integrating with an existing Airflow or Databricks Workflows orchestration layer

Key Takeaways

DLT Streaming Tables are the right default for new medallion pipelines — they remove operational overhead and add data quality enforcement. Raw Structured Streaming remains the right choice when you need full control over stateful logic, custom sinks, or when DLT's cost overhead doesn't justify the convenience.

The two approaches aren't rivals — many production architectures use raw Structured Streaming for high-volume event ingestion at the edge and DLT for the bronze-to-gold transformation layer.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Databricks Streaming Tables: DLT vs Structured Streaming

Two Ways to Build Streaming Pipelines

Structured Streaming: Full Control

DLT Streaming Tables: Declarative Pipelines

Streaming Tables in SQL

Side-by-Side Comparison

DLT DBU Surcharge: The Real Cost

Stateful Streaming: Watermarks and Windows

Common Pitfalls

When to Choose Which

Key Takeaways

Continue Reading

Continue Reading

Databricks Autoloader: The Complete Guide

CI/CD Pipelines for Databricks Projects: A Production-Ready Guide

Databricks Cluster Policies for Cost Control: A Practical Guide

Try Harbinger Explorer for free

Two Ways to Build Streaming Pipelines

Structured Streaming: Full Control

DLT Streaming Tables: Declarative Pipelines

Streaming Tables in SQL

Side-by-Side Comparison

DLT DBU Surcharge: The Real Cost

Stateful Streaming: Watermarks and Windows

Common Pitfalls

When to Choose Which

Key Takeaways

Continue Reading

Continue Reading

Databricks Autoloader: The Complete Guide

CI/CD Pipelines for Databricks Projects: A Production-Ready Guide

Databricks Cluster Policies for Cost Control: A Practical Guide

Try Harbinger Explorer for free

Command Palette