Event Streaming Architecture in the Cloud: A Platform Engineer's Guide
Event Streaming Architecture in the Cloud: A Platform Engineer's Guide
Event streaming is no longer a niche capability reserved for high-frequency trading platforms or social media giants. Today, it is the backbone of modern cloud-native data platforms — enabling real-time analytics, microservice decoupling, change data capture, and operational intelligence at scale. This guide walks through the architectural decisions, deployment patterns, and operational realities of building event streaming infrastructure on the cloud.
Why Event Streaming?
Traditional batch ETL pipelines introduce latency by design. Data collected throughout the day is processed overnight and available in dashboards by morning. For most use cases in the 2010s, this was acceptable. In 2024, the landscape has fundamentally shifted:
- Real-time fraud detection needs sub-second decisioning
- Operational monitoring must surface anomalies before they become incidents
- Microservice choreography requires decoupled, asynchronous communication
- Event sourcing demands an immutable audit log with replay capability
Event streaming solves all of these by treating data as a continuous, ordered log of facts rather than rows in a table.
Core Concepts
Before choosing a platform, it's essential to align on terminology:
| Concept | Description |
|---|---|
| Topic / Stream | Named, append-only log of messages |
| Partition | Ordered sub-unit of a topic enabling parallelism |
| Consumer Group | Set of consumers that collectively process a topic |
| Offset | Position of a message within a partition |
| Schema Registry | Central store for message schemas (Avro, Protobuf, JSON Schema) |
| Exactly-once semantics | Guarantee that each message is processed exactly one time |
| Compaction | Retention strategy that keeps only the latest value per key |
Cloud Platform Comparison
Three major managed event streaming services dominate cloud deployments:
| Feature | Amazon Kinesis | Google Pub/Sub | Azure Event Hubs |
|---|---|---|---|
| Native SDK | AWS SDK | Google Cloud SDK | Azure SDK / AMQP |
| Kafka compatibility | Via MSK | Via Confluent on GCP | Kafka-compatible endpoint |
| Retention | 1–365 days | 7 days (configurable) | 1–90 days |
| Partitioning | Shards (fixed) | Automatic | Partitions (fixed) |
| Schema registry | Glue Schema Registry | Confluent on GCP | Schema Registry (EventHub) |
| Exactly-once | At-least-once | At-least-once | At-least-once |
| Managed Kafka | MSK (fully managed) | N/A (partner) | No |
| Serverless option | Kinesis On-Demand | Pub/Sub (always serverless) | Event Hubs Serverless |
For teams with Kafka expertise and multi-cloud ambitions, Amazon MSK or a self-managed Kafka cluster on Kubernetes (via Strimzi) provides the most portable solution.
Reference Architecture
The following diagram illustrates a production-grade event streaming topology for a geopolitical intelligence platform — the kind of real-time data architecture powering tools like Harbinger Explorer:
Loading diagram...
This architecture separates concerns cleanly:
- Raw topics receive unvalidated, high-volume ingestion
- Stream processors enrich, filter, and validate in flight
- Downstream consumers materialise views appropriate to each service
Provisioning with Terraform
Here's a minimal Terraform configuration for an MSK cluster on AWS:
resource "aws_msk_cluster" "harbinger_events" {
cluster_name = "harbinger-events"
kafka_version = "3.5.1"
number_of_broker_nodes = 3
broker_node_group_info {
instance_type = "kafka.m5.large"
client_subnets = var.private_subnet_ids
security_groups = [aws_security_group.msk.id]
storage_info {
ebs_storage_info {
volume_size = 1000
}
}
}
encryption_info {
encryption_in_transit {
client_broker = "TLS"
in_cluster = true
}
}
configuration_info {
arn = aws_msk_configuration.main.arn
revision = aws_msk_configuration.main.latest_revision
}
tags = {
Environment = "production"
Team = "platform"
}
}
resource "aws_msk_configuration" "main" {
name = "harbinger-kafka-config"
kafka_versions = ["3.5.1"]
server_properties = <<-EOF
auto.create.topics.enable=false
default.replication.factor=3
min.insync.replicas=2
num.partitions=12
log.retention.hours=168
compression.type=lz4
EOF
}
Schema Registry and Schema Evolution
Untyped JSON events are a production disaster waiting to happen. Every schema change becomes a breaking change, and you have no way to query historical messages reliably.
Use Apache Avro with Confluent Schema Registry for:
- Backward compatibility: new consumers can read old messages
- Forward compatibility: old consumers can read new messages
- Schema-on-read: storage is compact binary; schema is resolved at read time
Example Avro schema for a geopolitical event:
{
"type": "record",
"name": "GeopoliticalEvent",
"namespace": "com.harbinger.events",
"fields": [
{"name": "event_id", "type": "string"},
{"name": "timestamp", "type": "long", "logicalType": "timestamp-millis"},
{"name": "country_code", "type": "string"},
{"name": "event_type", "type": {"type": "enum", "name": "EventType", "symbols": ["CONFLICT", "ELECTION", "SANCTION", "PROTEST", "NATURAL_DISASTER"]}},
{"name": "severity", "type": "float"},
{"name": "source_url", "type": ["null", "string"], "default": null}
]
}
Register it via CLI:
curl -X POST http://schema-registry:8081/subjects/geopolitical-events-value/versions -H "Content-Type: application/vnd.schemaregistry.v1+json" -d '{"schema": "{"type":"record","name":"GeopoliticalEvent",...}"}'
Consumer Patterns
Fan-Out
One topic, multiple independent consumer groups. Each group maintains its own offset. Use for:
- Parallel processing pipelines (analytics vs. alerting)
- Different SLAs (real-time vs. batch)
CQRS Read Models
Consume an event stream to materialise query-optimised read models in separate databases. This decouples write throughput from read performance.
Dead Letter Queue (DLQ)
Always configure a DLQ topic for messages that fail processing. Structure it identically to the main topic, with an additional header carrying the error reason. This enables:
- Non-blocking failure handling
- Forensic replay after bug fixes
- Alerting on error rate thresholds
# Kafka Streams error handler config
processing.exception.handler: org.apache.kafka.streams.errors.LogAndContinueExceptionHandler
default.deserialization.exception.handler: org.apache.kafka.streams.errors.LogAndContinueExceptionHandler
Operational Runbook
Key Metrics to Monitor
| Metric | Warning Threshold | Critical Threshold |
|---|---|---|
| Consumer lag (per partition) | > 10,000 | > 100,000 |
| Broker disk utilisation | > 70% | > 85% |
| Under-replicated partitions | > 0 | > 5 |
| Producer error rate | > 0.1% | > 1% |
| Request handler pool idle % | < 30% | < 10% |
Lag Alerting with Prometheus
- alert: KafkaConsumerLagHigh
expr: kafka_consumer_group_lag > 50000
for: 5m
labels:
severity: warning
annotations:
summary: "Consumer group {{ $labels.group }} is lagging on topic {{ $labels.topic }}"
description: "Lag is {{ $value }} messages. Investigate consumer throughput."
Exactly-Once Semantics in Practice
Kafka supports exactly-once end-to-end when:
- Producers use idempotent mode (
enable.idempotence=true) - Producers use transactions for atomic multi-topic writes
- Consumers use
read_committedisolation level
For Flink on AWS (via Kinesis Data Analytics or self-managed), configure:
execution.checkpointing.interval: 60s
execution.checkpointing.mode: EXACTLY_ONCE
state.backend: rocksdb
state.checkpoints.dir: s3://my-bucket/flink-checkpoints
Exactly-once is more expensive (higher latency, more coordination overhead). Reserve it for financial transactions, billing events, and audit trails. For analytics workloads, at-least-once with idempotent downstream sinks (upsert by event_id) is usually the pragmatic choice.
Multi-Region and Disaster Recovery
For global platforms, MirrorMaker 2 (MM2) enables cross-cluster replication:
# Start MirrorMaker 2
./bin/connect-mirror-maker.sh mm2.properties
# mm2.properties
clusters = us-east-1, eu-west-1
us-east-1.bootstrap.servers = broker1.us-east-1:9092
eu-west-1.bootstrap.servers = broker1.eu-west-1:9092
us-east-1->eu-west-1.enabled = true
us-east-1->eu-west-1.topics = .*
replication.factor = 3
For active-active topologies, use topic namespacing to avoid replication loops (us-east-1.topic-name vs. eu-west-1.topic-name).
Cost Optimisation
Event streaming costs can spiral without discipline. Key levers:
- Compression: LZ4 provides good compression ratio with minimal CPU overhead. Expect 3–5x reduction on JSON payloads.
- Tiered storage: MSK Tiered Storage offloads cold partitions to S3, reducing broker EBS costs by up to 60%.
- Partition sizing: Over-partitioning wastes broker resources. Target 1–3 partitions per consumer core at peak load.
- Retention tuning: Set retention to match your SLA, not infinity. Most analytics pipelines only need 7–30 days of raw events.
Conclusion
Event streaming in the cloud requires thoughtful architectural decisions across ingestion, processing, schema management, and operations. The platform that gets these right gains a durable competitive advantage: real-time intelligence, resilient decoupling, and an auditable record of everything that happened.
For teams building geopolitical or market intelligence platforms, event streaming is what enables the continuous, low-latency signal processing that separates a dashboard from a true intelligence tool.
Try Harbinger Explorer free for 7 days — see what real-time geopolitical intelligence looks like when it's powered by a production-grade event streaming backbone.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial