Harbinger Explorer

Back to Knowledge Hub
Engineering

Event-Driven Data Architecture with Kafka and CQRS

11 min read·Tags: event-driven, kafka, event-sourcing, cqrs, streaming, data-architecture, real-time

Batch jobs run at midnight and produce data that's 12 hours stale by the time anyone reads it. Meanwhile, your customers are making decisions — buying, churning, clicking — in real time. Event-driven data architecture is the pattern that bridges this gap by making data flow as events happen, not on a schedule.

What Is Event-Driven Data Architecture?

In an event-driven architecture, every meaningful state change in your system is recorded as an immutable event: OrderPlaced, UserSignedUp, PaymentFailed, InventoryUpdated. These events flow through a message broker (most commonly Apache Kafka) and are consumed by multiple downstream systems independently.

For data platforms specifically, this means:

  • Your warehouse can consume events and update tables in near-real-time
  • ML models can react to events without polling a database
  • Operational systems and analytics systems share the same event stream
  • Historical state can be reconstructed by replaying the event log
Application Services
       │ emits events
       ▼
  Apache Kafka (Event Log)
       │
  ┌────┴─────────────────────────────────┐
  │             │                │       │
  ▼             ▼                ▼       ▼
Warehouse   ML Feature        Search   Notification
(Flink/     Store             Index    Service
 Spark)     (Redis/Feast)     (ES)     (SendGrid)

Core Concepts

Event Sourcing

Event Sourcing means your application's state is derived from an append-only log of events — not from a mutable database table. Instead of updating a row (UPDATE orders SET status = 'shipped'), you append an event (OrderShipped { order_id, shipped_at, carrier }).

The current state of any entity is computed by replaying all events for that entity.

Benefits for data teams:

  • Perfect audit trail — every state change is recorded with timestamp and actor
  • Time travel is natural — replay events up to any point in time
  • Schema evolution is explicit — new event types don't break old consumers
  • Rebuilding derived tables from scratch is always possible

The cost: Replaying millions of events to get current state is slow. You need snapshots (periodic state captures) to make reads practical.

CQRS — Command Query Responsibility Segregation

CQRS separates the write path (commands that change state, emitting events) from the read path (queries that read from pre-built read models).

Command Side                    Query Side
(Write)                         (Read)

User Action                     User Query
    │                               │
    ▼                               ▼
Command Handler              Read Model / View
    │                         (denormalized table,
    │ emits event              optimized for query)
    ▼                               ▲
Event Store (Kafka)                 │
    │                               │
    └────────── Projector ──────────┘
                (builds and updates
                 read models from events)

For data platforms, CQRS maps naturally to the Medallion Architecture:

  • Bronze layer = raw event log (event store)
  • Silver layer = cleansed, joined projections (read models)
  • Gold layer = business-ready aggregations (denormalized views)

Apache Kafka as the Backbone

Kafka is the most common event broker for data platforms at scale. Key properties that make it useful:

PropertyWhat It Means for Data Teams
Persistent logConsumers can replay from any offset, enabling backfill
Consumer groupsMultiple teams can consume the same event independently
PartitioningParallelism by key (e.g., user_id) for ordered processing
RetentionConfigure retention by time or size — days to forever
Schema RegistryEnforce Avro/Protobuf schemas on producers
Exactly-once semanticsConfigurable guarantees for critical financial pipelines

Building an Event-Driven Data Pipeline

1. Event Schema Design

Design events with data consumers in mind. Bad event schemas are the #1 source of pain in event-driven systems.

// Good: self-describing, versioned, no domain-specific abbreviations
{
  "event_type": "order.placed",
  "event_id": "evt_01HXK3M7N8P2Q4R5T6V7W8X9Y0",
  "schema_version": "2.1",
  "occurred_at": "2026-04-03T09:15:00Z",
  "producer": "order-service",
  "payload": {
    "order_id": "ord_abc123",
    "customer_id": "cust_xyz789",
    "items": [
      {"sku": "WIDGET-42", "quantity": 2, "unit_price_cents": 1999}
    ],
    "total_cents": 3998,
    "currency": "EUR"
  }
}
// Bad: abbreviated, no version, no schema info
{
  "t": "ord_plcd",
  "oid": "abc123",
  "cid": "xyz",
  "amt": 39.98
}

2. Kafka Producer (Python)

# Python — Kafka producer with Avro schema
from confluent_kafka import Producer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer
from confluent_kafka.serialization import SerializationContext, MessageField
import json
from datetime import datetime, timezone

KAFKA_BOOTSTRAP = "kafka:9092"
SCHEMA_REGISTRY_URL = "http://schema-registry:8081"

ORDER_PLACED_SCHEMA = '''
{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.harbinger.orders",
  "fields": [
    {"name": "event_id", "type": "string"},
    {"name": "order_id", "type": "string"},
    {"name": "customer_id", "type": "string"},
    {"name": "total_cents", "type": "long"},
    {"name": "occurred_at", "type": "string"}
  ]
}
'''

schema_registry = SchemaRegistryClient({"url": SCHEMA_REGISTRY_URL})
avro_serializer = AvroSerializer(schema_registry, ORDER_PLACED_SCHEMA)
producer = Producer({"bootstrap.servers": KAFKA_BOOTSTRAP})

def publish_order_placed(order_id: str, customer_id: str, total_cents: int):
    event = {
        "event_id": f"evt_{order_id}_{int(datetime.now().timestamp())}",
        "order_id": order_id,
        "customer_id": customer_id,
        "total_cents": total_cents,
        "occurred_at": datetime.now(timezone.utc).isoformat(),
    }
    producer.produce(
        topic="orders.placed",
        key=customer_id,          # partition by customer for ordering
        value=avro_serializer(event, SerializationContext("orders.placed", MessageField.VALUE)),
    )
    producer.flush()

3. Stream Processing with Flink / Spark Structured Streaming

# PySpark Structured Streaming — consume Kafka, write to Delta Lake
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, to_timestamp
from pyspark.sql.types import StructType, StructField, StringType, LongType

spark = SparkSession.builder     .appName("OrderEventsConsumer")     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")     .getOrCreate()

event_schema = StructType([
    StructField("event_id", StringType()),
    StructField("order_id", StringType()),
    StructField("customer_id", StringType()),
    StructField("total_cents", LongType()),
    StructField("occurred_at", StringType()),
])

raw_stream = spark.readStream     .format("kafka")     .option("kafka.bootstrap.servers", "kafka:9092")     .option("subscribe", "orders.placed")     .option("startingOffsets", "latest")     .load()

parsed = raw_stream     .select(from_json(col("value").cast("string"), event_schema).alias("data"))     .select("data.*")     .withColumn("occurred_at", to_timestamp("occurred_at"))

query = parsed.writeStream     .format("delta")     .outputMode("append")     .option("checkpointLocation", "s3://checkpoints/orders-placed/")     .start("s3://lakehouse/silver/orders_placed/")

query.awaitTermination()

Event-Driven Patterns for Data Platforms

Outbox Pattern

The Outbox pattern solves a common reliability problem: you write to your database and want to publish an event to Kafka atomically. If you write to both independently, one can succeed while the other fails.

Solution: Write the event to an outbox table in the same database transaction as your business write. A separate process (Debezium + Change Data Capture) reads the outbox and publishes to Kafka.

Event-Driven Feature Store

ML features derived from streaming events:

  1. UserSignedUp event → Kafka → Flink aggregates 7-day session counts → Redis feature store
  2. Model inference reads from Redis (low-latency, fresh features)
  3. Same events land in Delta Lake for batch training

This pattern avoids training/serving skew — both paths read from the same event source.

Saga Pattern for Distributed Transactions

When an operation spans multiple services (create order → reserve inventory → charge payment), a Saga coordinates via events rather than a distributed transaction:

  • Each service publishes a success or failure event
  • Compensating events roll back previous steps on failure
  • The data platform observes all saga events for a complete audit trail

Trade-Offs to Be Honest About

Event-driven is not universally better. The operational complexity is real.

ConsiderationEvent-DrivenBatch
LatencySecondsMinutes-hours
ComplexityHighLow
DebuggingHard (distributed)Easier (logs)
Ordering guaranteesPer-partition onlyNatural
Exactly-onceConfigurable, hardEasier
Schema changesRequires coordinationColumn add is easy
Team skill requiredKafka expertiseSQL/Python
CostKafka cluster overheadCompute on schedule

Start with batch if your team is small and your business doesn't need real-time. Adding event-driven complexity for a batch use case is a form of over-engineering. See Streaming vs Batch Processing for the decision framework.

When to Use Event-Driven Architecture for Data

Strong signals:

  • You need sub-minute data freshness in dashboards or operational systems
  • Multiple systems need the same events independently (no tight coupling)
  • You're building ML features that need fresh behavioral signals
  • Audit trail and replay capability are requirements

Weak signals (consider batch instead):

  • Daily reporting is sufficient
  • Single downstream consumer
  • Team has no Kafka experience
  • Data volumes are small enough that scheduled queries work fine

Wrapping Up

Event-driven data architecture is powerful when freshness matters and multiple consumers need the same data stream. Event Sourcing gives you a complete, replayable history. CQRS separates write optimization from read optimization. Kafka provides the durable, scalable backbone.

The pattern has real costs: Kafka clusters, schema management, exactly-once semantics, and distributed debugging. Adopt it where the business genuinely needs real-time, not because it sounds modern.

Next step: Identify one batch pipeline where the 24-hour delay causes a business problem. That's your first candidate for an event-driven migration.


Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...