Streaming vs Batch Processing: When to Use Which
Your fraud detection model fires 90 seconds after the transaction completes. By then, the money is gone. That's the wrong processing model for the wrong use case — and it's a mistake that happens more often than you'd think.
Choosing between streaming and batch processing isn't about which is technically superior. It's about matching the processing model to the latency requirement of your use case. Here's a clear framework for making that call.
TL;DR
| Dimension | Batch | Streaming |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Throughput | Very high | Moderate (per-event overhead) |
| Complexity | Lower | Significantly higher |
| Cost | Lower (bursty compute) | Higher (always-on infra) |
| Fault tolerance | Rerun the job | Checkpointing, exactly-once semantics |
| Best for | Reports, ETL, ML training | Fraud detection, alerts, live dashboards |
What Is Batch Processing?
Batch processing collects data over a period of time — an hour, a day, a week — and processes it all at once. The paradigm is simple: read a bounded dataset, transform it, write results.
Most data warehouses are built on batch. Your nightly dbt run, your weekly reporting pipeline, your monthly ML retraining job — these are all batch.
# PySpark batch job example — daily aggregation
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum, to_date
spark = SparkSession.builder.appName("daily_orders").getOrCreate()
df = spark.read.parquet("s3://datalake/orders/date=2024-01-15/")
daily_summary = (
df
.groupBy(to_date(col("created_at")).alias("order_date"), col("region"))
.agg(
spark_sum("amount").alias("total_revenue"),
spark_sum(col("amount") > 0).cast("long").alias("order_count")
)
)
daily_summary.write.mode("overwrite").parquet("s3://datalake/daily_order_summary/date=2024-01-15/")
Batch is the right default. It's operationally simpler, cheaper to run, and easier to debug. The question is whether your use case can tolerate the latency.
What Is Streaming Processing?
Streaming processing treats data as an unbounded, continuous flow. Each event is processed as it arrives — or within a configurable time window.
The latency profile is fundamentally different: instead of "we'll know in 6 hours," you're in the range of milliseconds to a few seconds.
# PySpark Structured Streaming — real-time Kafka consumer
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, window
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType
spark = SparkSession.builder.appName("fraud_alerts").getOrCreate()
schema = StructType() \
.add("transaction_id", StringType()) \
.add("user_id", StringType()) \
.add("amount", DoubleType()) \
.add("event_time", TimestampType())
raw = (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "transactions")
.load()
)
transactions = raw.select(from_json(col("value").cast("string"), schema).alias("data")).select("data.*")
# 5-minute tumbling window — flag users exceeding $10k
windowed = (
transactions
.withWatermark("event_time", "10 minutes")
.groupBy(col("user_id"), window(col("event_time"), "5 minutes"))
.sum("amount")
.filter(col("sum(amount)") > 10000)
)
query = (
windowed.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("topic", "fraud_alerts")
.option("checkpointLocation", "/checkpoints/fraud_alerts")
.outputMode("append")
.start()
)
query.awaitTermination()
The Three Main Frameworks Compared
Apache Kafka
Kafka is not a stream processor — it's a distributed event log and message broker. It's the ingestion layer: events are produced to Kafka topics, and downstream consumers (Flink, Spark, your own apps) read from those topics.
What Kafka does well:
- Durable, replayable event storage (configurable retention)
- High-throughput ingestion (millions of events/second)
- Decoupling producers from consumers
- Exactly-once delivery semantics (with Kafka Streams or transactional producers)
What Kafka is not:
- A full stream processor (no windowed aggregations, no SQL joins out of the box without Kafka Streams/ksqlDB)
- A database (topic retention is time-based, not query-optimized)
Kafka is the foundation. Almost every production streaming architecture starts here.
Spark Structured Streaming
Spark Structured Streaming is micro-batch under the hood. It processes data in small intervals (configurable trigger intervals, down to ~100ms with continuous processing mode).
Strengths:
- Same DataFrame/SQL API as batch Spark — low learning curve for existing Spark teams
- Tight integration with Delta Lake (streaming writes, MERGE operations)
- Strong ecosystem: MLlib, GraphX, Delta Live Tables
- Runs well on Databricks, EMR, and GCP Dataproc
Weaknesses:
- True sub-second latency is hard to achieve reliably
- Stateful operations (joins, aggregations) require careful watermark tuning
- Operational complexity increases significantly vs. batch
Good fit when: your team already uses Spark for batch and you need "near real-time" (seconds to minutes latency is acceptable).
Apache Flink
Flink is a true stream processor — it processes events one at a time, not in micro-batches. It was built for streaming from day one.
Strengths:
- True event-time processing with low, predictable latency
- Sophisticated stateful operations (session windows, complex event processing)
- Exactly-once guarantees with minimal overhead
- SQL API (
Table API) and Java/Python DataStream API
Weaknesses:
- Steeper learning curve than Spark
- Smaller ecosystem and community than Spark
- Operational complexity (JobManager, TaskManagers, checkpointing)
Good fit when: you need genuine low-latency (sub-second), complex event-driven logic, or you're building event-driven microservices.
-- Flink SQL — tumbling window aggregation
-- (Flink SQL dialect)
CREATE TABLE transactions (
transaction_id STRING,
user_id STRING,
amount DOUBLE,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'transactions',
'properties.bootstrap.servers' = 'broker:9092',
'format' = 'json'
);
SELECT
user_id,
TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
SUM(amount) AS total_amount,
COUNT(*) AS tx_count
FROM transactions
GROUP BY user_id, TUMBLE(event_time, INTERVAL '5' MINUTE);
Framework Comparison
| Feature | Kafka | Spark Structured Streaming | Apache Flink |
|---|---|---|---|
| Primary role | Ingestion / broker | Near-real-time processing | True stream processing |
| Latency | N/A (transport) | 100ms–seconds | Milliseconds |
| Processing model | Event log | Micro-batch | Event-at-a-time |
| SQL support | ksqlDB (limited) | Full Spark SQL | Full Table API SQL |
| State management | Kafka Streams only | RocksDB-backed | Robust RocksDB |
| Learning curve | Medium | Low (if know Spark) | High |
| Cloud-native | Confluent, MSK | Databricks, EMR | Kinesis Data Analytics, self-hosted |
When to Choose Streaming
Streaming is the right choice when:
- Latency < 1 minute is a business requirement, not a nice-to-have (fraud, alerting, live dashboards)
- Downstream systems need to react to events (trigger workflows, update user state, power microservices)
- Data volume makes full reprocessing impractical (you can't afford to re-read 1TB/day on every run)
- Stateful per-user tracking is required in real-time (session tracking, live recommendation updates)
When to Choose Batch
Batch is the right choice when:
- Reports and dashboards can tolerate T+1 or T+few-hours freshness (most business intelligence falls here)
- ML model training — models need full dataset passes, not incremental event-by-event updates
- Complex transformations and joins that are genuinely easier to reason about over bounded datasets
- Cost is a constraint — always-on Flink/Kafka infrastructure is expensive
- Your team is small — streaming doubles operational overhead
The honest truth: most data engineering work is batch. The vast majority of dashboards, reports, and analytical pipelines don't need sub-minute freshness. Default to batch, add streaming only where the latency requirement genuinely demands it.
Lambda vs Kappa Architecture
Two patterns exist for combining both:
Lambda Architecture: Separate batch layer (accurate, delayed) + speed layer (approximate, fresh). Results are merged at query time. Operationally painful — you maintain two codebases.
Kappa Architecture: A single streaming pipeline that handles both real-time and historical reprocessing by replaying the event log. Simpler operationally, but requires a durable event log (Kafka with long retention or Apache Iceberg-backed storage).
Modern teams generally prefer Kappa when they commit to streaming — Lambda complexity rarely pays off.
Common Pitfalls
1. Underestimating operational overhead. A Kafka + Flink stack requires monitoring consumer lag, managing checkpoints, handling rebalances, and debugging stateful failures. Budget for this.
2. Ignoring late data. Events arrive late. Always. Watermarks in Flink and Spark define how long you wait for late arrivals — setting them too tight drops data, too loose increases latency. There's no free lunch.
3. Overbuilding for future requirements. "We might need real-time eventually" is not a reason to build streaming today. Batch is dramatically easier to maintain and debug.
4. Conflating Kafka with stream processing. Kafka moves data; it doesn't transform it. If you're doing complex aggregations in Kafka consumers without Kafka Streams or ksqlDB, you're building a stream processor by accident.
Practical Decision Framework
Do you need results in < 1 minute?
├── No -> Use batch. Done.
└── Yes -> Do you need sub-second latency?
├── No -> Spark Structured Streaming (easier, Spark teams)
└── Yes -> Apache Flink (true streaming, more complexity)
If you're exploring streaming event data or need to run ad-hoc queries against ingested Kafka output, Harbinger Explorer lets you query result datasets directly in the browser using DuckDB WASM — useful for validating window aggregations or spot-checking event distributions without spinning up another Spark cluster. The AI-powered natural language query layer is particularly handy for exploratory analysis of streaming outputs landing in object storage.
The Bottom Line
Streaming is powerful and sometimes genuinely necessary. But it comes with real operational costs that batch doesn't. Know your latency requirement before choosing your architecture — not after.
For most teams: start with batch, monitor your SLAs, and introduce streaming surgically where the latency gap actually hurts.
Next step: If you're running batch pipelines today, read Airflow vs Dagster vs Prefect to make sure your orchestration layer can handle the transition.
Continue Reading
- ETL vs ELT: Which Approach Fits Your Stack?
- Delta Live Tables vs Classic ETL
- Data Pipeline Monitoring: What to Track
[VERIFY]: Kafka continuous processing latency numbers. [VERIFY]: Flink RocksDB state backend as default in Flink 1.18+.
Continue Reading
Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Change Data Capture Explained
A practical guide to CDC patterns — log-based, trigger-based, and polling — with Debezium configuration examples and Kafka Connect integration.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial