EngineeringApr 3, 2026

Streaming vs Batch Processing: When to Use Which

10 min read·Tags: streaming, batch processing, kafka, apache flink, spark structured streaming, data engineering, real-time

Your fraud detection model fires 90 seconds after the transaction completes. By then, the money is gone. That's the wrong processing model for the wrong use case — and it's a mistake that happens more often than you'd think.

Choosing between streaming and batch processing isn't about which is technically superior. It's about matching the processing model to the latency requirement of your use case. Here's a clear framework for making that call.

TL;DR

Dimension	Batch	Streaming
Latency	Minutes to hours	Milliseconds to seconds
Throughput	Very high	Moderate (per-event overhead)
Complexity	Lower	Significantly higher
Cost	Lower (bursty compute)	Higher (always-on infra)
Fault tolerance	Rerun the job	Checkpointing, exactly-once semantics
Best for	Reports, ETL, ML training	Fraud detection, alerts, live dashboards

What Is Batch Processing?

Batch processing collects data over a period of time — an hour, a day, a week — and processes it all at once. The paradigm is simple: read a bounded dataset, transform it, write results.

Most data warehouses are built on batch. Your nightly dbt run, your weekly reporting pipeline, your monthly ML retraining job — these are all batch.

# PySpark batch job example — daily aggregation
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum, to_date

spark = SparkSession.builder.appName("daily_orders").getOrCreate()

df = spark.read.parquet("s3://datalake/orders/date=2024-01-15/")

daily_summary = (
    df
    .groupBy(to_date(col("created_at")).alias("order_date"), col("region"))
    .agg(
        spark_sum("amount").alias("total_revenue"),
        spark_sum(col("amount") > 0).cast("long").alias("order_count")
    )
)

daily_summary.write.mode("overwrite").parquet("s3://datalake/daily_order_summary/date=2024-01-15/")

Batch is the right default. It's operationally simpler, cheaper to run, and easier to debug. The question is whether your use case can tolerate the latency.

What Is Streaming Processing?

Streaming processing treats data as an unbounded, continuous flow. Each event is processed as it arrives — or within a configurable time window.

The latency profile is fundamentally different: instead of "we'll know in 6 hours," you're in the range of milliseconds to a few seconds.

# PySpark Structured Streaming — real-time Kafka consumer
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, window
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType

spark = SparkSession.builder.appName("fraud_alerts").getOrCreate()

schema = StructType() \
    .add("transaction_id", StringType()) \
    .add("user_id", StringType()) \
    .add("amount", DoubleType()) \
    .add("event_time", TimestampType())

raw = (
    spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker:9092")
    .option("subscribe", "transactions")
    .load()
)

transactions = raw.select(from_json(col("value").cast("string"), schema).alias("data")).select("data.*")

# 5-minute tumbling window — flag users exceeding $10k
windowed = (
    transactions
    .withWatermark("event_time", "10 minutes")
    .groupBy(col("user_id"), window(col("event_time"), "5 minutes"))
    .sum("amount")
    .filter(col("sum(amount)") > 10000)
)

query = (
    windowed.writeStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker:9092")
    .option("topic", "fraud_alerts")
    .option("checkpointLocation", "/checkpoints/fraud_alerts")
    .outputMode("append")
    .start()
)
query.awaitTermination()

The Three Main Frameworks Compared

Apache Kafka

Kafka is not a stream processor — it's a distributed event log and message broker. It's the ingestion layer: events are produced to Kafka topics, and downstream consumers (Flink, Spark, your own apps) read from those topics.

What Kafka does well:

Durable, replayable event storage (configurable retention)
High-throughput ingestion (millions of events/second)
Decoupling producers from consumers
Exactly-once delivery semantics (with Kafka Streams or transactional producers)

What Kafka is not:

A full stream processor (no windowed aggregations, no SQL joins out of the box without Kafka Streams/ksqlDB)
A database (topic retention is time-based, not query-optimized)

Kafka is the foundation. Almost every production streaming architecture starts here.

Spark Structured Streaming

Spark Structured Streaming is micro-batch under the hood. It processes data in small intervals (configurable trigger intervals, down to ~100ms with continuous processing mode).

Strengths:

Same DataFrame/SQL API as batch Spark — low learning curve for existing Spark teams
Tight integration with Delta Lake (streaming writes, MERGE operations)
Strong ecosystem: MLlib, GraphX, Delta Live Tables
Runs well on Databricks, EMR, and GCP Dataproc

Weaknesses:

True sub-second latency is hard to achieve reliably
Stateful operations (joins, aggregations) require careful watermark tuning
Operational complexity increases significantly vs. batch

Good fit when: your team already uses Spark for batch and you need "near real-time" (seconds to minutes latency is acceptable).

Apache Flink

Flink is a true stream processor — it processes events one at a time, not in micro-batches. It was built for streaming from day one.

Strengths:

True event-time processing with low, predictable latency
Sophisticated stateful operations (session windows, complex event processing)
Exactly-once guarantees with minimal overhead
SQL API (Table API) and Java/Python DataStream API

Weaknesses:

Steeper learning curve than Spark
Smaller ecosystem and community than Spark
Operational complexity (JobManager, TaskManagers, checkpointing)

Good fit when: you need genuine low-latency (sub-second), complex event-driven logic, or you're building event-driven microservices.

-- Flink SQL — tumbling window aggregation
-- (Flink SQL dialect)
CREATE TABLE transactions (
    transaction_id STRING,
    user_id STRING,
    amount DOUBLE,
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
    'connector' = 'kafka',
    'topic' = 'transactions',
    'properties.bootstrap.servers' = 'broker:9092',
    'format' = 'json'
);

SELECT
    user_id,
    TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
    SUM(amount) AS total_amount,
    COUNT(*) AS tx_count
FROM transactions
GROUP BY user_id, TUMBLE(event_time, INTERVAL '5' MINUTE);

Framework Comparison

Feature	Kafka	Spark Structured Streaming	Apache Flink
Primary role	Ingestion / broker	Near-real-time processing	True stream processing
Latency	N/A (transport)	100ms–seconds	Milliseconds
Processing model	Event log	Micro-batch	Event-at-a-time
SQL support	ksqlDB (limited)	Full Spark SQL	Full Table API SQL
State management	Kafka Streams only	RocksDB-backed	Robust RocksDB
Learning curve	Medium	Low (if know Spark)	High
Cloud-native	Confluent, MSK	Databricks, EMR	Kinesis Data Analytics, self-hosted

When to Choose Streaming

Streaming is the right choice when:

Latency < 1 minute is a business requirement, not a nice-to-have (fraud, alerting, live dashboards)
Downstream systems need to react to events (trigger workflows, update user state, power microservices)
Data volume makes full reprocessing impractical (you can't afford to re-read 1TB/day on every run)
Stateful per-user tracking is required in real-time (session tracking, live recommendation updates)

When to Choose Batch

Batch is the right choice when:

Reports and dashboards can tolerate T+1 or T+few-hours freshness (most business intelligence falls here)
ML model training — models need full dataset passes, not incremental event-by-event updates
Complex transformations and joins that are genuinely easier to reason about over bounded datasets
Cost is a constraint — always-on Flink/Kafka infrastructure is expensive
Your team is small — streaming doubles operational overhead

The honest truth: most data engineering work is batch. The vast majority of dashboards, reports, and analytical pipelines don't need sub-minute freshness. Default to batch, add streaming only where the latency requirement genuinely demands it.

Lambda vs Kappa Architecture

Two patterns exist for combining both:

Lambda Architecture: Separate batch layer (accurate, delayed) + speed layer (approximate, fresh). Results are merged at query time. Operationally painful — you maintain two codebases.

Kappa Architecture: A single streaming pipeline that handles both real-time and historical reprocessing by replaying the event log. Simpler operationally, but requires a durable event log (Kafka with long retention or Apache Iceberg-backed storage).

Modern teams generally prefer Kappa when they commit to streaming — Lambda complexity rarely pays off.

Common Pitfalls

1. Underestimating operational overhead. A Kafka + Flink stack requires monitoring consumer lag, managing checkpoints, handling rebalances, and debugging stateful failures. Budget for this.

2. Ignoring late data. Events arrive late. Always. Watermarks in Flink and Spark define how long you wait for late arrivals — setting them too tight drops data, too loose increases latency. There's no free lunch.

3. Overbuilding for future requirements. "We might need real-time eventually" is not a reason to build streaming today. Batch is dramatically easier to maintain and debug.

4. Conflating Kafka with stream processing. Kafka moves data; it doesn't transform it. If you're doing complex aggregations in Kafka consumers without Kafka Streams or ksqlDB, you're building a stream processor by accident.

Practical Decision Framework

Do you need results in < 1 minute?
├── No  -> Use batch. Done.
└── Yes -> Do you need sub-second latency?
          ├── No  -> Spark Structured Streaming (easier, Spark teams)
          └── Yes -> Apache Flink (true streaming, more complexity)

If you're exploring streaming event data or need to run ad-hoc queries against ingested Kafka output, Harbinger Explorer lets you query result datasets directly in the browser using DuckDB WASM — useful for validating window aggregations or spot-checking event distributions without spinning up another Spark cluster. The AI-powered natural language query layer is particularly handy for exploratory analysis of streaming outputs landing in object storage.

The Bottom Line

Streaming is powerful and sometimes genuinely necessary. But it comes with real operational costs that batch doesn't. Know your latency requirement before choosing your architecture — not after.

For most teams: start with batch, monitor your SLAs, and introduce streaming surgically where the latency gap actually hurts.

Next step: If you're running batch pipelines today, read Airflow vs Dagster vs Prefect to make sure your orchestration layer can handle the transition.

Continue Reading

[VERIFY]: Kafka continuous processing latency numbers. [VERIFY]: Flink RocksDB state backend as default in Flink 1.18+.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Streaming vs Batch Processing: When to Use Which

TL;DR

What Is Batch Processing?

What Is Streaming Processing?

The Three Main Frameworks Compared

Apache Kafka

Spark Structured Streaming

Apache Flink

Framework Comparison

When to Choose Streaming

When to Choose Batch

Lambda vs Kappa Architecture

Common Pitfalls

Practical Decision Framework

The Bottom Line

Continue Reading

Continue Reading

Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage

Airflow vs Dagster vs Prefect: An Honest Comparison

Change Data Capture Explained

Try Harbinger Explorer for free

TL;DR

What Is Batch Processing?

What Is Streaming Processing?

The Three Main Frameworks Compared

Apache Kafka

Spark Structured Streaming

Apache Flink

Framework Comparison

When to Choose Streaming

When to Choose Batch

Lambda vs Kappa Architecture

Common Pitfalls

Practical Decision Framework

The Bottom Line

Continue Reading

Continue Reading

Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage

Airflow vs Dagster vs Prefect: An Honest Comparison

Change Data Capture Explained

Try Harbinger Explorer for free

Command Palette