Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Event Streaming Architecture in the Cloud: A Platform Engineer's Guide

12 min read·Tags: event-streaming, kafka, kinesis, pubsub, real-time, cloud-architecture

Event Streaming Architecture in the Cloud: A Platform Engineer's Guide

Event streaming is no longer a niche capability reserved for high-frequency trading platforms or social media giants. Today, it is the backbone of modern cloud-native data platforms — enabling real-time analytics, microservice decoupling, change data capture, and operational intelligence at scale. This guide walks through the architectural decisions, deployment patterns, and operational realities of building event streaming infrastructure on the cloud.


Why Event Streaming?

Traditional batch ETL pipelines introduce latency by design. Data collected throughout the day is processed overnight and available in dashboards by morning. For most use cases in the 2010s, this was acceptable. In 2024, the landscape has fundamentally shifted:

  • Real-time fraud detection needs sub-second decisioning
  • Operational monitoring must surface anomalies before they become incidents
  • Microservice choreography requires decoupled, asynchronous communication
  • Event sourcing demands an immutable audit log with replay capability

Event streaming solves all of these by treating data as a continuous, ordered log of facts rather than rows in a table.


Core Concepts

Before choosing a platform, it's essential to align on terminology:

ConceptDescription
Topic / StreamNamed, append-only log of messages
PartitionOrdered sub-unit of a topic enabling parallelism
Consumer GroupSet of consumers that collectively process a topic
OffsetPosition of a message within a partition
Schema RegistryCentral store for message schemas (Avro, Protobuf, JSON Schema)
Exactly-once semanticsGuarantee that each message is processed exactly one time
CompactionRetention strategy that keeps only the latest value per key

Cloud Platform Comparison

Three major managed event streaming services dominate cloud deployments:

FeatureAmazon KinesisGoogle Pub/SubAzure Event Hubs
Native SDKAWS SDKGoogle Cloud SDKAzure SDK / AMQP
Kafka compatibilityVia MSKVia Confluent on GCPKafka-compatible endpoint
Retention1–365 days7 days (configurable)1–90 days
PartitioningShards (fixed)AutomaticPartitions (fixed)
Schema registryGlue Schema RegistryConfluent on GCPSchema Registry (EventHub)
Exactly-onceAt-least-onceAt-least-onceAt-least-once
Managed KafkaMSK (fully managed)N/A (partner)No
Serverless optionKinesis On-DemandPub/Sub (always serverless)Event Hubs Serverless

For teams with Kafka expertise and multi-cloud ambitions, Amazon MSK or a self-managed Kafka cluster on Kubernetes (via Strimzi) provides the most portable solution.


Reference Architecture

The following diagram illustrates a production-grade event streaming topology for a geopolitical intelligence platform — the kind of real-time data architecture powering tools like Harbinger Explorer:

Loading diagram...

This architecture separates concerns cleanly:

  • Raw topics receive unvalidated, high-volume ingestion
  • Stream processors enrich, filter, and validate in flight
  • Downstream consumers materialise views appropriate to each service

Provisioning with Terraform

Here's a minimal Terraform configuration for an MSK cluster on AWS:

resource "aws_msk_cluster" "harbinger_events" {
  cluster_name           = "harbinger-events"
  kafka_version          = "3.5.1"
  number_of_broker_nodes = 3

  broker_node_group_info {
    instance_type   = "kafka.m5.large"
    client_subnets  = var.private_subnet_ids
    security_groups = [aws_security_group.msk.id]

    storage_info {
      ebs_storage_info {
        volume_size = 1000
      }
    }
  }

  encryption_info {
    encryption_in_transit {
      client_broker = "TLS"
      in_cluster    = true
    }
  }

  configuration_info {
    arn      = aws_msk_configuration.main.arn
    revision = aws_msk_configuration.main.latest_revision
  }

  tags = {
    Environment = "production"
    Team        = "platform"
  }
}

resource "aws_msk_configuration" "main" {
  name              = "harbinger-kafka-config"
  kafka_versions    = ["3.5.1"]
  server_properties = <<-EOF
    auto.create.topics.enable=false
    default.replication.factor=3
    min.insync.replicas=2
    num.partitions=12
    log.retention.hours=168
    compression.type=lz4
  EOF
}

Schema Registry and Schema Evolution

Untyped JSON events are a production disaster waiting to happen. Every schema change becomes a breaking change, and you have no way to query historical messages reliably.

Use Apache Avro with Confluent Schema Registry for:

  • Backward compatibility: new consumers can read old messages
  • Forward compatibility: old consumers can read new messages
  • Schema-on-read: storage is compact binary; schema is resolved at read time

Example Avro schema for a geopolitical event:

{
  "type": "record",
  "name": "GeopoliticalEvent",
  "namespace": "com.harbinger.events",
  "fields": [
    {"name": "event_id", "type": "string"},
    {"name": "timestamp", "type": "long", "logicalType": "timestamp-millis"},
    {"name": "country_code", "type": "string"},
    {"name": "event_type", "type": {"type": "enum", "name": "EventType", "symbols": ["CONFLICT", "ELECTION", "SANCTION", "PROTEST", "NATURAL_DISASTER"]}},
    {"name": "severity", "type": "float"},
    {"name": "source_url", "type": ["null", "string"], "default": null}
  ]
}

Register it via CLI:

curl -X POST http://schema-registry:8081/subjects/geopolitical-events-value/versions   -H "Content-Type: application/vnd.schemaregistry.v1+json"   -d '{"schema": "{"type":"record","name":"GeopoliticalEvent",...}"}'

Consumer Patterns

Fan-Out

One topic, multiple independent consumer groups. Each group maintains its own offset. Use for:

  • Parallel processing pipelines (analytics vs. alerting)
  • Different SLAs (real-time vs. batch)

CQRS Read Models

Consume an event stream to materialise query-optimised read models in separate databases. This decouples write throughput from read performance.

Dead Letter Queue (DLQ)

Always configure a DLQ topic for messages that fail processing. Structure it identically to the main topic, with an additional header carrying the error reason. This enables:

  • Non-blocking failure handling
  • Forensic replay after bug fixes
  • Alerting on error rate thresholds
# Kafka Streams error handler config
processing.exception.handler: org.apache.kafka.streams.errors.LogAndContinueExceptionHandler
default.deserialization.exception.handler: org.apache.kafka.streams.errors.LogAndContinueExceptionHandler

Operational Runbook

Key Metrics to Monitor

MetricWarning ThresholdCritical Threshold
Consumer lag (per partition)> 10,000> 100,000
Broker disk utilisation> 70%> 85%
Under-replicated partitions> 0> 5
Producer error rate> 0.1%> 1%
Request handler pool idle %< 30%< 10%

Lag Alerting with Prometheus

- alert: KafkaConsumerLagHigh
  expr: kafka_consumer_group_lag > 50000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Consumer group {{ $labels.group }} is lagging on topic {{ $labels.topic }}"
    description: "Lag is {{ $value }} messages. Investigate consumer throughput."

Exactly-Once Semantics in Practice

Kafka supports exactly-once end-to-end when:

  1. Producers use idempotent mode (enable.idempotence=true)
  2. Producers use transactions for atomic multi-topic writes
  3. Consumers use read_committed isolation level

For Flink on AWS (via Kinesis Data Analytics or self-managed), configure:

execution.checkpointing.interval: 60s
execution.checkpointing.mode: EXACTLY_ONCE
state.backend: rocksdb
state.checkpoints.dir: s3://my-bucket/flink-checkpoints

Exactly-once is more expensive (higher latency, more coordination overhead). Reserve it for financial transactions, billing events, and audit trails. For analytics workloads, at-least-once with idempotent downstream sinks (upsert by event_id) is usually the pragmatic choice.


Multi-Region and Disaster Recovery

For global platforms, MirrorMaker 2 (MM2) enables cross-cluster replication:

# Start MirrorMaker 2
./bin/connect-mirror-maker.sh mm2.properties

# mm2.properties
clusters = us-east-1, eu-west-1
us-east-1.bootstrap.servers = broker1.us-east-1:9092
eu-west-1.bootstrap.servers = broker1.eu-west-1:9092
us-east-1->eu-west-1.enabled = true
us-east-1->eu-west-1.topics = .*
replication.factor = 3

For active-active topologies, use topic namespacing to avoid replication loops (us-east-1.topic-name vs. eu-west-1.topic-name).


Cost Optimisation

Event streaming costs can spiral without discipline. Key levers:

  • Compression: LZ4 provides good compression ratio with minimal CPU overhead. Expect 3–5x reduction on JSON payloads.
  • Tiered storage: MSK Tiered Storage offloads cold partitions to S3, reducing broker EBS costs by up to 60%.
  • Partition sizing: Over-partitioning wastes broker resources. Target 1–3 partitions per consumer core at peak load.
  • Retention tuning: Set retention to match your SLA, not infinity. Most analytics pipelines only need 7–30 days of raw events.

Conclusion

Event streaming in the cloud requires thoughtful architectural decisions across ingestion, processing, schema management, and operations. The platform that gets these right gains a durable competitive advantage: real-time intelligence, resilient decoupling, and an auditable record of everything that happened.

For teams building geopolitical or market intelligence platforms, event streaming is what enables the continuous, low-latency signal processing that separates a dashboard from a true intelligence tool.


Try Harbinger Explorer free for 7 days — see what real-time geopolitical intelligence looks like when it's powered by a production-grade event streaming backbone.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...