Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Cloud-Native ETL Patterns for Modern Data Platforms

12 min read·Tags: etl, data-engineering, cloud-native, streaming, terraform, airflow

Cloud-Native ETL Patterns for Modern Data Platforms

Modern data platforms have moved well beyond the nightly batch window. Today's architectures are expected to deliver fresh, reliable, and schema-resilient data pipelines that scale automatically, fail gracefully, and cost proportionally to usage. Cloud-native ETL is no longer a nice-to-have—it's the baseline.

This article covers the patterns that separate brittle legacy pipelines from production-grade, cloud-native data platforms. We'll walk through architecture decisions, concrete code examples, and the tradeoffs you'll encounter in the real world.


What "Cloud-Native ETL" Actually Means

Cloud-native ETL isn't just "run your SSIS packages on a VM in AWS." It's a set of design principles:

  • Stateless compute: Workers don't hold state between runs
  • Event-driven triggers: Pipelines react to data arrival, not clocks
  • Managed infrastructure: No patching Kafka clusters at 2 AM
  • Schema-aware: Pipelines evolve with your data model, not against it
  • Idempotent by default: Re-running a pipeline produces the same result
Loading diagram...

Pattern 1: The Medallion Architecture

The most widely adopted pattern in modern lakehouses. Data flows through Bronze → Silver → Gold layers, each with increasing quality guarantees.

LayerPurposeFormatUpdate Frequency
BronzeRaw ingestion, no transformationParquet / DeltaNear-real-time
SilverCleaned, deduplicated, typedDelta15–60 min
GoldBusiness-ready aggregatesDelta / IcebergHourly / Daily

Terraform: S3 Bucket Structure for Medallion

resource "aws_s3_bucket" "data_lake" {
  for_each = toset(["bronze", "silver", "gold"])
  bucket   = "my-platform-${each.key}-${var.environment}"

  tags = {
    Layer       = each.key
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_s3_bucket_versioning" "data_lake_versioning" {
  for_each = aws_s3_bucket.data_lake
  bucket   = each.value.id

  versioning_configuration {
    status = "Enabled"
  }
}

Pattern 2: Change Data Capture (CDC) Ingestion

Polling source databases destroys them. CDC tails the transaction log, capturing inserts, updates, and deletes as a stream of events with zero load on the source.

Debezium + Kafka: Core Config

# debezium-postgres-connector.yaml
name: postgres-cdc-connector
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  database.hostname: prod-postgres.internal
  database.port: "5432"
  database.user: debezium_reader
  database.password: "${secrets.pg_password}"
  database.dbname: transactions
  database.server.name: prod_pg
  table.include.list: public.orders,public.customers,public.payments
  plugin.name: pgoutput
  publication.autocreate.mode: filtered
  slot.name: debezium_slot
  topic.prefix: cdc
  decimal.handling.mode: double
  heartbeat.interval.ms: "5000"

Handling CDC Events Downstream

CDC produces three event types: c (create), u (update), d (delete). Your landing logic must handle all three:

# Spark Structured Streaming job config
spark:
  app.name: cdc-bronze-landing
  streaming:
    trigger: processingTime=30 seconds
    checkpointLocation: s3://my-platform-bronze/checkpoints/cdc/
    outputMode: append
  kafka:
    bootstrap.servers: kafka-broker-1:9092,kafka-broker-2:9092
    subscribe: cdc.public.orders,cdc.public.customers
    startingOffsets: latest
    failOnDataLoss: "false"

Pattern 3: Schema Evolution Without Breaking Pipelines

Schema drift is the silent killer of ETL pipelines. A source system adds a column—your pipeline breaks. The fix is to make your pipelines schema-aware from day one.

Strategy 1: Schema Registry (Apache Avro)

# Register a schema with Confluent Schema Registry
curl -X POST http://schema-registry:8081/subjects/orders-value/versions   -H "Content-Type: application/vnd.schemaregistry.v1+json"   -d '{
    "schema": "{"type":"record","name":"Order","fields":[{"name":"id","type":"string"},{"name":"amount","type":"double"},{"name":"currency","type":["null","string"],"default":null}]}"
  }'

# Check compatibility before deploying schema changes
curl -X POST http://schema-registry:8081/compatibility/subjects/orders-value/versions/latest   -H "Content-Type: application/vnd.schemaregistry.v1+json"   -d @new_schema.json

Strategy 2: Delta Lake Schema Evolution

# Enable automatic schema merging in Delta Lake
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

# Write with mergeSchema option
df.write \
  .format("delta") \
  .mode("append") \
  .option("mergeSchema", "true") \
  .save("s3://my-platform-silver/orders/")
Evolution TypeSchema RegistryDelta Auto-MergeManual Migration
Add nullable column✅ BACKWARD compat✅ Auto
Remove column❌ FULL compat only❌ Fails✅ Manual
Change type❌ Breaking❌ Fails✅ Manual
Add required column❌ Breaking❌ Fails✅ Manual

Pattern 4: Idempotent Loads

Re-runnable pipelines are safe pipelines. Every load should produce the same result whether it runs once or ten times.

UPSERT Pattern with MERGE

-- Delta Lake MERGE for idempotent upserts
MERGE INTO silver.orders AS target
USING (
  SELECT * FROM bronze.orders_staging
  WHERE _ingestion_date = current_date()
) AS source
ON target.order_id = source.order_id
WHEN MATCHED AND source.updated_at > target.updated_at THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *;

Partition-Replace Pattern

For large historical loads, replace entire partitions atomically:

# Spark: overwrite specific partitions only
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

df.write \
  .format("delta") \
  .mode("overwrite") \
  .partitionBy("year", "month", "day") \
  .save("s3://my-platform-silver/events/")

Pattern 5: Event-Driven Orchestration

Cron-based pipelines waste resources and introduce arbitrary latency. Event-driven triggers fire when data arrives.

Airflow with S3 Sensor

# dag_config.yaml
dag_id: bronze_to_silver_orders
schedule_interval: null  # Event-driven only
default_args:
  owner: data-platform
  retries: 3
  retry_delay_minutes: 5
  email_on_failure: true

tasks:
  - id: wait_for_cdc_landing
    type: S3KeySensor
    bucket: my-platform-bronze
    prefix: orders/year={{ ds_nodash[:4] }}/
    poke_interval: 60
    timeout: 3600

  - id: run_silver_transform
    type: DatabricksRunNowOperator
    depends_on: [wait_for_cdc_landing]
    job_id: "{{ var.value.silver_orders_job_id }}"

Terraform: EventBridge Rule for S3-Triggered Lambda

resource "aws_cloudwatch_event_rule" "s3_landing_trigger" {
  name        = "bronze-landing-trigger"
  description = "Trigger ETL pipeline on new S3 objects"

  event_pattern = jsonencode({
    source      = ["aws.s3"]
    detail-type = ["Object Created"]
    detail = {
      bucket = {
        name = [aws_s3_bucket.data_lake["bronze"].id]
      }
      object = {
        key = [{ prefix = "orders/" }]
      }
    }
  })
}

resource "aws_cloudwatch_event_target" "trigger_step_function" {
  rule      = aws_cloudwatch_event_rule.s3_landing_trigger.name
  target_id = "TriggerEtlStateMachine"
  arn       = aws_sfn_state_machine.etl_pipeline.arn
  role_arn  = aws_iam_role.eventbridge_sfn_role.arn
}

Pattern 6: Dead Letter Queues for Failed Records

Never silently drop records. Failed rows should land in a DLQ for inspection and replay.

# Kafka DLQ config for your consumer
spring:
  kafka:
    consumer:
      group-id: etl-silver-transform
      auto-offset-reset: earliest
    listener:
      ack-mode: manual
  cloud:
    stream:
      bindings:
        input:
          destination: cdc.public.orders
          group: etl-silver-transform
          consumer:
            max-attempts: 3
            back-off-initial-interval: 1000
            back-off-multiplier: 2.0
      rabbit:
        bindings:
          input:
            consumer:
              auto-bind-dlq: true
              dlq-ttl: 604800000  # 7 days

Observability for ETL Pipelines

You can't manage what you can't measure. Every ETL pipeline needs:

  • Row counts at each stage (Bronze → Silver → Gold)
  • Lag metrics for streaming jobs
  • Schema drift alerts
  • Data quality scores per pipeline run

Tools like Harbinger Explorer make it easy to correlate pipeline health with downstream data quality—surfacing drift and anomalies before they reach your consumers.

# Prometheus metrics exposed by a typical Spark job
spark_streaming_records_in_total{job="cdc-bronze-landing"} 1402847
spark_streaming_records_out_total{job="cdc-bronze-landing"} 1402831
spark_streaming_processing_lag_seconds{job="cdc-bronze-landing"} 4.2
spark_streaming_batch_duration_seconds{job="cdc-bronze-landing"} 28.1

Choosing the Right Pattern

ScenarioRecommended Pattern
OLTP source, low latencyCDC + Streaming Ingestion
File-based sources (S3, SFTP)S3 Sensor + Batch
High-volume eventsKafka + Spark Structured Streaming
Regulatory / auditMedallion + immutable Bronze
Legacy source, no CDCFull-load with partition replace
Mixed SLAsLambda Architecture (batch + stream)

Summary

Cloud-native ETL is an engineering discipline, not a vendor checkbox. The patterns above—Medallion layering, CDC ingestion, schema evolution strategies, idempotent loads, and event-driven orchestration—form the backbone of any serious data platform.

Start with the Medallion Architecture as your foundation, adopt CDC for OLTP sources, and enforce idempotency from day one. Add observability at every layer and treat schema drift as a first-class concern.


Try Harbinger Explorer free for 7 days and get end-to-end visibility into your cloud data pipelines—from ingestion lag to schema drift alerts, all in one platform.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...