Cloud-Native ETL Patterns for Modern Data Platforms
Cloud-Native ETL Patterns for Modern Data Platforms
Modern data platforms have moved well beyond the nightly batch window. Today's architectures are expected to deliver fresh, reliable, and schema-resilient data pipelines that scale automatically, fail gracefully, and cost proportionally to usage. Cloud-native ETL is no longer a nice-to-have—it's the baseline.
This article covers the patterns that separate brittle legacy pipelines from production-grade, cloud-native data platforms. We'll walk through architecture decisions, concrete code examples, and the tradeoffs you'll encounter in the real world.
What "Cloud-Native ETL" Actually Means
Cloud-native ETL isn't just "run your SSIS packages on a VM in AWS." It's a set of design principles:
- Stateless compute: Workers don't hold state between runs
- Event-driven triggers: Pipelines react to data arrival, not clocks
- Managed infrastructure: No patching Kafka clusters at 2 AM
- Schema-aware: Pipelines evolve with your data model, not against it
- Idempotent by default: Re-running a pipeline produces the same result
Loading diagram...
Pattern 1: The Medallion Architecture
The most widely adopted pattern in modern lakehouses. Data flows through Bronze → Silver → Gold layers, each with increasing quality guarantees.
| Layer | Purpose | Format | Update Frequency |
|---|---|---|---|
| Bronze | Raw ingestion, no transformation | Parquet / Delta | Near-real-time |
| Silver | Cleaned, deduplicated, typed | Delta | 15–60 min |
| Gold | Business-ready aggregates | Delta / Iceberg | Hourly / Daily |
Terraform: S3 Bucket Structure for Medallion
resource "aws_s3_bucket" "data_lake" {
for_each = toset(["bronze", "silver", "gold"])
bucket = "my-platform-${each.key}-${var.environment}"
tags = {
Layer = each.key
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_s3_bucket_versioning" "data_lake_versioning" {
for_each = aws_s3_bucket.data_lake
bucket = each.value.id
versioning_configuration {
status = "Enabled"
}
}
Pattern 2: Change Data Capture (CDC) Ingestion
Polling source databases destroys them. CDC tails the transaction log, capturing inserts, updates, and deletes as a stream of events with zero load on the source.
Debezium + Kafka: Core Config
# debezium-postgres-connector.yaml
name: postgres-cdc-connector
config:
connector.class: io.debezium.connector.postgresql.PostgresConnector
database.hostname: prod-postgres.internal
database.port: "5432"
database.user: debezium_reader
database.password: "${secrets.pg_password}"
database.dbname: transactions
database.server.name: prod_pg
table.include.list: public.orders,public.customers,public.payments
plugin.name: pgoutput
publication.autocreate.mode: filtered
slot.name: debezium_slot
topic.prefix: cdc
decimal.handling.mode: double
heartbeat.interval.ms: "5000"
Handling CDC Events Downstream
CDC produces three event types: c (create), u (update), d (delete). Your landing logic must handle all three:
# Spark Structured Streaming job config
spark:
app.name: cdc-bronze-landing
streaming:
trigger: processingTime=30 seconds
checkpointLocation: s3://my-platform-bronze/checkpoints/cdc/
outputMode: append
kafka:
bootstrap.servers: kafka-broker-1:9092,kafka-broker-2:9092
subscribe: cdc.public.orders,cdc.public.customers
startingOffsets: latest
failOnDataLoss: "false"
Pattern 3: Schema Evolution Without Breaking Pipelines
Schema drift is the silent killer of ETL pipelines. A source system adds a column—your pipeline breaks. The fix is to make your pipelines schema-aware from day one.
Strategy 1: Schema Registry (Apache Avro)
# Register a schema with Confluent Schema Registry
curl -X POST http://schema-registry:8081/subjects/orders-value/versions -H "Content-Type: application/vnd.schemaregistry.v1+json" -d '{
"schema": "{"type":"record","name":"Order","fields":[{"name":"id","type":"string"},{"name":"amount","type":"double"},{"name":"currency","type":["null","string"],"default":null}]}"
}'
# Check compatibility before deploying schema changes
curl -X POST http://schema-registry:8081/compatibility/subjects/orders-value/versions/latest -H "Content-Type: application/vnd.schemaregistry.v1+json" -d @new_schema.json
Strategy 2: Delta Lake Schema Evolution
# Enable automatic schema merging in Delta Lake
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
# Write with mergeSchema option
df.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save("s3://my-platform-silver/orders/")
| Evolution Type | Schema Registry | Delta Auto-Merge | Manual Migration |
|---|---|---|---|
| Add nullable column | ✅ BACKWARD compat | ✅ Auto | ✅ |
| Remove column | ❌ FULL compat only | ❌ Fails | ✅ Manual |
| Change type | ❌ Breaking | ❌ Fails | ✅ Manual |
| Add required column | ❌ Breaking | ❌ Fails | ✅ Manual |
Pattern 4: Idempotent Loads
Re-runnable pipelines are safe pipelines. Every load should produce the same result whether it runs once or ten times.
UPSERT Pattern with MERGE
-- Delta Lake MERGE for idempotent upserts
MERGE INTO silver.orders AS target
USING (
SELECT * FROM bronze.orders_staging
WHERE _ingestion_date = current_date()
) AS source
ON target.order_id = source.order_id
WHEN MATCHED AND source.updated_at > target.updated_at THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *;
Partition-Replace Pattern
For large historical loads, replace entire partitions atomically:
# Spark: overwrite specific partitions only
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("year", "month", "day") \
.save("s3://my-platform-silver/events/")
Pattern 5: Event-Driven Orchestration
Cron-based pipelines waste resources and introduce arbitrary latency. Event-driven triggers fire when data arrives.
Airflow with S3 Sensor
# dag_config.yaml
dag_id: bronze_to_silver_orders
schedule_interval: null # Event-driven only
default_args:
owner: data-platform
retries: 3
retry_delay_minutes: 5
email_on_failure: true
tasks:
- id: wait_for_cdc_landing
type: S3KeySensor
bucket: my-platform-bronze
prefix: orders/year={{ ds_nodash[:4] }}/
poke_interval: 60
timeout: 3600
- id: run_silver_transform
type: DatabricksRunNowOperator
depends_on: [wait_for_cdc_landing]
job_id: "{{ var.value.silver_orders_job_id }}"
Terraform: EventBridge Rule for S3-Triggered Lambda
resource "aws_cloudwatch_event_rule" "s3_landing_trigger" {
name = "bronze-landing-trigger"
description = "Trigger ETL pipeline on new S3 objects"
event_pattern = jsonencode({
source = ["aws.s3"]
detail-type = ["Object Created"]
detail = {
bucket = {
name = [aws_s3_bucket.data_lake["bronze"].id]
}
object = {
key = [{ prefix = "orders/" }]
}
}
})
}
resource "aws_cloudwatch_event_target" "trigger_step_function" {
rule = aws_cloudwatch_event_rule.s3_landing_trigger.name
target_id = "TriggerEtlStateMachine"
arn = aws_sfn_state_machine.etl_pipeline.arn
role_arn = aws_iam_role.eventbridge_sfn_role.arn
}
Pattern 6: Dead Letter Queues for Failed Records
Never silently drop records. Failed rows should land in a DLQ for inspection and replay.
# Kafka DLQ config for your consumer
spring:
kafka:
consumer:
group-id: etl-silver-transform
auto-offset-reset: earliest
listener:
ack-mode: manual
cloud:
stream:
bindings:
input:
destination: cdc.public.orders
group: etl-silver-transform
consumer:
max-attempts: 3
back-off-initial-interval: 1000
back-off-multiplier: 2.0
rabbit:
bindings:
input:
consumer:
auto-bind-dlq: true
dlq-ttl: 604800000 # 7 days
Observability for ETL Pipelines
You can't manage what you can't measure. Every ETL pipeline needs:
- Row counts at each stage (Bronze → Silver → Gold)
- Lag metrics for streaming jobs
- Schema drift alerts
- Data quality scores per pipeline run
Tools like Harbinger Explorer make it easy to correlate pipeline health with downstream data quality—surfacing drift and anomalies before they reach your consumers.
# Prometheus metrics exposed by a typical Spark job
spark_streaming_records_in_total{job="cdc-bronze-landing"} 1402847
spark_streaming_records_out_total{job="cdc-bronze-landing"} 1402831
spark_streaming_processing_lag_seconds{job="cdc-bronze-landing"} 4.2
spark_streaming_batch_duration_seconds{job="cdc-bronze-landing"} 28.1
Choosing the Right Pattern
| Scenario | Recommended Pattern |
|---|---|
| OLTP source, low latency | CDC + Streaming Ingestion |
| File-based sources (S3, SFTP) | S3 Sensor + Batch |
| High-volume events | Kafka + Spark Structured Streaming |
| Regulatory / audit | Medallion + immutable Bronze |
| Legacy source, no CDC | Full-load with partition replace |
| Mixed SLAs | Lambda Architecture (batch + stream) |
Summary
Cloud-native ETL is an engineering discipline, not a vendor checkbox. The patterns above—Medallion layering, CDC ingestion, schema evolution strategies, idempotent loads, and event-driven orchestration—form the backbone of any serious data platform.
Start with the Medallion Architecture as your foundation, adopt CDC for OLTP sources, and enforce idempotency from day one. Add observability at every layer and treat schema drift as a first-class concern.
Try Harbinger Explorer free for 7 days and get end-to-end visibility into your cloud data pipelines—from ingestion lag to schema drift alerts, all in one platform.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial