cloud-architecture

Published: Apr 3, 2026

Event Streaming Architecture in the Cloud: A Platform Engineer's Guide

12 min read·Tags: event-streaming, kafka, kinesis, pubsub, real-time, cloud-architecture

Event Streaming Architecture in the Cloud: A Platform Engineer's Guide

Event streaming is no longer a niche capability reserved for high-frequency trading platforms or social media giants. Today, it is the backbone of modern cloud-native data platforms — enabling real-time analytics, microservice decoupling, change data capture, and operational intelligence at scale. This guide walks through the architectural decisions, deployment patterns, and operational realities of building event streaming infrastructure on the cloud.

Why Event Streaming?

Traditional batch ETL pipelines introduce latency by design. Data collected throughout the day is processed overnight and available in dashboards by morning. For most use cases in the 2010s, this was acceptable. In 2024, the landscape has fundamentally shifted:

Real-time fraud detection needs sub-second decisioning
Operational monitoring must surface anomalies before they become incidents
Microservice choreography requires decoupled, asynchronous communication
Event sourcing demands an immutable audit log with replay capability

Event streaming solves all of these by treating data as a continuous, ordered log of facts rather than rows in a table.

Core Concepts

Before choosing a platform, it's essential to align on terminology:

Concept	Description
Topic / Stream	Named, append-only log of messages
Partition	Ordered sub-unit of a topic enabling parallelism
Consumer Group	Set of consumers that collectively process a topic
Offset	Position of a message within a partition
Schema Registry	Central store for message schemas (Avro, Protobuf, JSON Schema)
Exactly-once semantics	Guarantee that each message is processed exactly one time
Compaction	Retention strategy that keeps only the latest value per key

Cloud Platform Comparison

Three major managed event streaming services dominate cloud deployments:

Feature	Amazon Kinesis	Google Pub/Sub	Azure Event Hubs
Native SDK	AWS SDK	Google Cloud SDK	Azure SDK / AMQP
Kafka compatibility	Via MSK	Via Confluent on GCP	Kafka-compatible endpoint
Retention	1–365 days	7 days (configurable)	1–90 days
Partitioning	Shards (fixed)	Automatic	Partitions (fixed)
Schema registry	Glue Schema Registry	Confluent on GCP	Schema Registry (EventHub)
Exactly-once	At-least-once	At-least-once	At-least-once
Managed Kafka	MSK (fully managed)	N/A (partner)	No
Serverless option	Kinesis On-Demand	Pub/Sub (always serverless)	Event Hubs Serverless

For teams with Kafka expertise and multi-cloud ambitions, Amazon MSK or a self-managed Kafka cluster on Kubernetes (via Strimzi) provides the most portable solution.

Reference Architecture

The following diagram illustrates a production-grade event streaming topology for a geopolitical intelligence platform — the kind of real-time data architecture powering tools like Harbinger Explorer:

Loading diagram...

This architecture separates concerns cleanly:

Raw topics receive unvalidated, high-volume ingestion
Stream processors enrich, filter, and validate in flight
Downstream consumers materialise views appropriate to each service

Provisioning with Terraform

Here's a minimal Terraform configuration for an MSK cluster on AWS:

resource "aws_msk_cluster" "harbinger_events" {
  cluster_name           = "harbinger-events"
  kafka_version          = "3.5.1"
  number_of_broker_nodes = 3

  broker_node_group_info {
    instance_type   = "kafka.m5.large"
    client_subnets  = var.private_subnet_ids
    security_groups = [aws_security_group.msk.id]

    storage_info {
      ebs_storage_info {
        volume_size = 1000
      }
    }
  }

  encryption_info {
    encryption_in_transit {
      client_broker = "TLS"
      in_cluster    = true
    }
  }

  configuration_info {
    arn      = aws_msk_configuration.main.arn
    revision = aws_msk_configuration.main.latest_revision
  }

  tags = {
    Environment = "production"
    Team        = "platform"
  }
}

resource "aws_msk_configuration" "main" {
  name              = "harbinger-kafka-config"
  kafka_versions    = ["3.5.1"]
  server_properties = <<-EOF
    auto.create.topics.enable=false
    default.replication.factor=3
    min.insync.replicas=2
    num.partitions=12
    log.retention.hours=168
    compression.type=lz4
  EOF
}

Schema Registry and Schema Evolution

Untyped JSON events are a production disaster waiting to happen. Every schema change becomes a breaking change, and you have no way to query historical messages reliably.

Use Apache Avro with Confluent Schema Registry for:

Backward compatibility: new consumers can read old messages
Forward compatibility: old consumers can read new messages
Schema-on-read: storage is compact binary; schema is resolved at read time

Example Avro schema for a geopolitical event:

{
  "type": "record",
  "name": "GeopoliticalEvent",
  "namespace": "com.harbinger.events",
  "fields": [
    {"name": "event_id", "type": "string"},
    {"name": "timestamp", "type": "long", "logicalType": "timestamp-millis"},
    {"name": "country_code", "type": "string"},
    {"name": "event_type", "type": {"type": "enum", "name": "EventType", "symbols": ["CONFLICT", "ELECTION", "SANCTION", "PROTEST", "NATURAL_DISASTER"]}},
    {"name": "severity", "type": "float"},
    {"name": "source_url", "type": ["null", "string"], "default": null}
  ]
}

curl -X POST http://schema-registry:8081/subjects/geopolitical-events-value/versions   -H "Content-Type: application/vnd.schemaregistry.v1+json"   -d '{"schema": "{"type":"record","name":"GeopoliticalEvent",...}"}'

Consumer Patterns

Fan-Out

One topic, multiple independent consumer groups. Each group maintains its own offset. Use for:

Parallel processing pipelines (analytics vs. alerting)
Different SLAs (real-time vs. batch)

CQRS Read Models

Consume an event stream to materialise query-optimised read models in separate databases. This decouples write throughput from read performance.

Dead Letter Queue (DLQ)

Always configure a DLQ topic for messages that fail processing. Structure it identically to the main topic, with an additional header carrying the error reason. This enables:

Non-blocking failure handling
Forensic replay after bug fixes
Alerting on error rate thresholds

# Kafka Streams error handler config
processing.exception.handler: org.apache.kafka.streams.errors.LogAndContinueExceptionHandler
default.deserialization.exception.handler: org.apache.kafka.streams.errors.LogAndContinueExceptionHandler

Operational Runbook

Key Metrics to Monitor

Metric	Warning Threshold	Critical Threshold
Consumer lag (per partition)	> 10,000	> 100,000
Broker disk utilisation	> 70%	> 85%
Under-replicated partitions	> 0	> 5
Producer error rate	> 0.1%	> 1%
Request handler pool idle %	< 30%	< 10%

Lag Alerting with Prometheus

- alert: KafkaConsumerLagHigh
  expr: kafka_consumer_group_lag > 50000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Consumer group {{ $labels.group }} is lagging on topic {{ $labels.topic }}"
    description: "Lag is {{ $value }} messages. Investigate consumer throughput."

Exactly-Once Semantics in Practice

Kafka supports exactly-once end-to-end when:

Producers use idempotent mode (enable.idempotence=true)
Producers use transactions for atomic multi-topic writes
Consumers use read_committed isolation level

For Flink on AWS (via Kinesis Data Analytics or self-managed), configure:

execution.checkpointing.interval: 60s
execution.checkpointing.mode: EXACTLY_ONCE
state.backend: rocksdb
state.checkpoints.dir: s3://my-bucket/flink-checkpoints

Exactly-once is more expensive (higher latency, more coordination overhead). Reserve it for financial transactions, billing events, and audit trails. For analytics workloads, at-least-once with idempotent downstream sinks (upsert by event_id) is usually the pragmatic choice.

Multi-Region and Disaster Recovery

For global platforms, MirrorMaker 2 (MM2) enables cross-cluster replication:

# Start MirrorMaker 2
./bin/connect-mirror-maker.sh mm2.properties

# mm2.properties
clusters = us-east-1, eu-west-1
us-east-1.bootstrap.servers = broker1.us-east-1:9092
eu-west-1.bootstrap.servers = broker1.eu-west-1:9092
us-east-1->eu-west-1.enabled = true
us-east-1->eu-west-1.topics = .*
replication.factor = 3

For active-active topologies, use topic namespacing to avoid replication loops (us-east-1.topic-name vs. eu-west-1.topic-name).

Cost Optimisation

Event streaming costs can spiral without discipline. Key levers:

Compression: LZ4 provides good compression ratio with minimal CPU overhead. Expect 3–5x reduction on JSON payloads.
Tiered storage: MSK Tiered Storage offloads cold partitions to S3, reducing broker EBS costs by up to 60%.
Partition sizing: Over-partitioning wastes broker resources. Target 1–3 partitions per consumer core at peak load.
Retention tuning: Set retention to match your SLA, not infinity. Most analytics pipelines only need 7–30 days of raw events.

Conclusion

Event streaming in the cloud requires thoughtful architectural decisions across ingestion, processing, schema management, and operations. The platform that gets these right gains a durable competitive advantage: real-time intelligence, resilient decoupling, and an auditable record of everything that happened.

For teams building geopolitical or market intelligence platforms, event streaming is what enables the continuous, low-latency signal processing that separates a dashboard from a true intelligence tool.

Try Harbinger Explorer free for 7 days — see what real-time geopolitical intelligence looks like when it's powered by a production-grade event streaming backbone.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Event Streaming Architecture in the Cloud: A Platform Engineer's Guide

Event Streaming Architecture in the Cloud: A Platform Engineer's Guide

Why Event Streaming?

Core Concepts

Cloud Platform Comparison

Reference Architecture

Provisioning with Terraform

Schema Registry and Schema Evolution

Consumer Patterns

Fan-Out

CQRS Read Models

Dead Letter Queue (DLQ)

Operational Runbook

Key Metrics to Monitor

Lag Alerting with Prometheus

Exactly-Once Semantics in Practice

Multi-Region and Disaster Recovery

Cost Optimisation

Conclusion

Continue Reading

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

Cloud Cost Allocation Strategies for Data Teams

API Gateway Architecture Patterns for Data Platforms

Try Harbinger Explorer for free

Event Streaming Architecture in the Cloud: A Platform Engineer's Guide

Why Event Streaming?

Core Concepts

Cloud Platform Comparison

Reference Architecture

Provisioning with Terraform

Schema Registry and Schema Evolution

Consumer Patterns

Fan-Out

CQRS Read Models

Dead Letter Queue (DLQ)

Operational Runbook

Key Metrics to Monitor

Lag Alerting with Prometheus

Exactly-Once Semantics in Practice

Multi-Region and Disaster Recovery

Cost Optimisation

Conclusion

Continue Reading

GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive

Cloud Cost Allocation Strategies for Data Teams

API Gateway Architecture Patterns for Data Platforms

Try Harbinger Explorer for free

Command Palette