Disaster Recovery for Data Platforms: RPO, RTO, and Runbooks That Actually Work
Disaster Recovery for Data Platforms: RPO, RTO, and Runbooks That Actually Work
Most data platform DR plans exist as a PDF in a shared drive that was last opened during an audit. When a region goes down and an on-call engineer needs to execute a recovery, they discover the runbook references systems that were renamed eight months ago and assumes access to a bastion host that was decommissioned in a cost-cutting initiative.
This guide is about building DR for data platforms that actually works when you need it.
Start with Failure Mode Analysis
Before designing any recovery mechanism, systematically enumerate failure modes. Data platforms fail in ways that differ from transactional systems:
| Component | Failure Mode | Impact | Detection Signal |
|---|---|---|---|
| Object storage (S3/GCS) | Regional outage | Complete data lake unavailability | CloudWatch/Cloud Monitoring alerts |
| Data warehouse | Compute failure | Query unavailability (data intact) | Warehouse health endpoint |
| Streaming brokers (Kafka) | Broker loss | Consumer lag, message loss risk | Lag monitoring, broker count |
| Orchestrator (Airflow) | Metadata DB failure | No new pipeline runs | Heartbeat monitoring |
| ETL compute (Spark/Databricks) | Cluster provisioning failure | Pipeline backlog | Job queue depth |
| Schema registry | Unavailability | Producer/consumer serialization failure | Registry health check |
| Data catalog / lineage | Outage | Loss of discovery (not data loss) | Catalog health endpoint |
Not all failure modes require DR. Some (warehouse compute failure, orchestrator outage) are availability problems, not data recovery problems. Separate these — they have different playbooks.
Defining RPO and RTO by Data Tier
Don't set a single RPO/RTO for your entire platform. Define tiers based on business criticality:
Loading diagram...
Assign every dataset in your catalog to a tier. This assignment drives replication frequency, backup retention, and recovery priority order. Harbinger Explorer can help maintain this classification at scale by tracking dataset criticality alongside operational metadata.
Object Storage Replication
Cross-Region Replication
For AWS S3:
# Terraform: S3 cross-region replication for data lake
resource "aws_s3_bucket" "data_lake_primary" {
bucket = "company-data-lake-us-east-1"
versioning {
enabled = true
}
}
resource "aws_s3_bucket" "data_lake_replica" {
provider = aws.us_west_2
bucket = "company-data-lake-us-west-2"
versioning {
enabled = true
}
}
resource "aws_iam_role" "replication_role" {
name = "s3-replication-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "s3.amazonaws.com" }
}]
})
}
resource "aws_s3_bucket_replication_configuration" "data_lake" {
role = aws_iam_role.replication_role.arn
bucket = aws_s3_bucket.data_lake_primary.id
rule {
id = "replicate-tier1-data"
status = "Enabled"
filter {
prefix = "tier1/"
}
destination {
bucket = aws_s3_bucket.data_lake_replica.arn
storage_class = "STANDARD"
replication_time {
status = "Enabled"
time {
minutes = 15
}
}
metrics {
status = "Enabled"
event_threshold {
minutes = 15
}
}
}
delete_marker_replication {
status = "Enabled"
}
}
rule {
id = "replicate-tier2-data"
status = "Enabled"
filter {
prefix = "tier2/"
}
destination {
bucket = aws_s3_bucket.data_lake_replica.arn
storage_class = "STANDARD_IA"
}
}
}
Point-in-Time Recovery with S3 Versioning
Enable versioning on all Tier 1 and Tier 2 buckets, and enforce lifecycle rules to manage version retention:
# Restore a specific S3 prefix to a point in time
#!/bin/bash
BUCKET="company-data-lake-us-east-1"
PREFIX="tier1/orders/2024/"
TARGET_TIME="2024-03-15T10:00:00Z"
RESTORE_BUCKET="company-data-lake-restore-us-east-1"
aws s3api list-object-versions --bucket $BUCKET --prefix $PREFIX --query "Versions[?LastModified<='${TARGET_TIME}']|[?IsLatest==`true`].[Key,VersionId]" --output text | while read key version_id; do
# Copy specific version to restore bucket
aws s3api copy-object --copy-source "$BUCKET/$key?versionId=$version_id" --bucket $RESTORE_BUCKET --key "$key"
echo "Restored: $key (version: $version_id)"
done
Kafka Disaster Recovery
Kafka is the highest-risk component in most data platforms for DR purposes. Message loss during a broker outage is permanent without proper replication.
MirrorMaker 2 for Cross-Region Replication
# MirrorMaker 2 configuration for active-passive DR
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
name: mm2-dr-replication
namespace: kafka
spec:
version: 3.7.0
replicas: 3
connectCluster: dr-region
clusters:
- alias: primary
bootstrapServers: kafka.primary-region.internal:9093
tls:
trustedCertificates:
- secretName: primary-cluster-ca-cert
certificate: ca.crt
authentication:
type: tls
certificateAndKey:
secretName: mm2-primary-user
certificate: user.crt
key: user.key
- alias: dr-region
bootstrapServers: kafka.dr-region.internal:9093
tls:
trustedCertificates:
- secretName: dr-cluster-ca-cert
certificate: ca.crt
authentication:
type: tls
certificateAndKey:
secretName: mm2-dr-user
certificate: user.crt
key: user.key
mirrors:
- sourceCluster: primary
targetCluster: dr-region
sourceConnector:
tasksMax: 10
config:
replication.factor: 3
offset-syncs.topic.replication.factor: 3
sync.topic.acls.enabled: "false"
replication.policy.class: org.apache.kafka.connect.mirror.IdentityReplicationPolicy
heartbeatConnector:
config:
heartbeats.topic.replication.factor: 3
checkpointConnector:
config:
checkpoints.topic.replication.factor: 3
sync.group.offsets.enabled: "true"
sync.group.offsets.interval.seconds: "60"
topicsPattern: "tier1.*|tier2.*"
groupsPattern: ".*"
The IdentityReplicationPolicy preserves topic names without prefixing — critical for consumer group offset synchronization during failover.
Kafka Failover Runbook
#!/bin/bash
# kafka-failover-runbook.sh
# Prerequisites: kubectl access to DR cluster, mm2 sync lag < retention period
set -euo pipefail
echo "=== KAFKA FAILOVER RUNBOOK ==="
echo "Timestamp: $(date -u)"
echo ""
# Step 1: Verify MirrorMaker2 offset sync is current
echo "Step 1: Checking consumer group offset sync lag..."
kubectl exec -n kafka deploy/kafka-toolbox -- kafka-consumer-groups.sh --bootstrap-server kafka.dr-region.internal:9092 --group data-platform-consumer --describe | grep -E "TOPIC|LAG"
read -p "Is lag acceptable? (y/n): " lag_ok
[[ $lag_ok != "y" ]] && echo "ABORT: Lag too high for safe failover" && exit 1
# Step 2: Stop producers on primary (if accessible)
echo "Step 2: Setting producer circuit breaker flag..."
curl -X PATCH https://config.internal/v1/features/kafka_producer_enabled -d '{"region":"primary","value":false}' || echo "WARNING: Could not reach config service"
# Step 3: Wait for in-flight messages to replicate
echo "Step 3: Waiting 60s for final messages to replicate..."
sleep 60
# Step 4: Translate consumer offsets
echo "Step 4: Translating consumer group offsets..."
kubectl exec -n kafka deploy/kafka-toolbox -- kafka-groups.sh --bootstrap-server kafka.dr-region.internal:9092 --reset-offsets --group data-platform-consumer --from-file /checkpoints/primary.data-platform-consumer.offsets --execute
# Step 5: Update DNS to point consumers to DR cluster
echo "Step 5: Updating Kafka bootstrap server in Vault..."
vault kv put secret/kafka/bootstrap servers="kafka.dr-region.internal:9093" region="dr" failover_time="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "=== FAILOVER COMPLETE ==="
echo "Monitor consumer lag at: https://monitoring.internal/d/kafka-lag"
Data Warehouse Recovery
For cloud data warehouses (BigQuery, Snowflake, Redshift), the compute layer recovery is typically automatic. The data recovery concern is around:
- Accidental deletion / DROP TABLE — handled by Time Travel
- Corruption via bad ETL — handled by snapshots + Time Travel
- Regional outage — handled by cross-region backup exports
Automated Snapshot Export (BigQuery)
# Daily export of critical tables to DR bucket
#!/bin/bash
TABLES=("orders" "customers" "products" "transactions")
PROJECT="company-prod"
DATASET="analytics"
DR_BUCKET="gs://company-bq-dr-us-west1"
DATE=$(date +%Y/%m/%d)
for TABLE in "${TABLES[@]}"; do
bq extract --destination_format PARQUET --compression SNAPPY --field_delimiter , "${PROJECT}:${DATASET}.${TABLE}" "${DR_BUCKET}/${TABLE}/${DATE}/${TABLE}_*.parquet"
echo "Exported ${TABLE} to ${DR_BUCKET}/${TABLE}/${DATE}/"
done
Testing Your DR Plan
An untested DR plan is not a DR plan. Run quarterly DR drills with real escalation paths:
| Test Type | Frequency | Scope | Success Criterion |
|---|---|---|---|
| Tabletop exercise | Monthly | Team reads through runbook | All steps understood, owner per step |
| Component restore test | Quarterly | Restore one non-critical dataset | RTO met, data verified |
| Regional failover drill | Semi-annual | Full DR region activation | RTO met, consumers switched |
| Chaos injection | Quarterly | Inject failure in staging | System self-heals or alerts within SLA |
Document every drill result. A DR plan that consistently meets its RTO in drills has earned stakeholder trust.
Building a DR Dashboard
Operational visibility into DR readiness should be continuous, not just during drills. Monitor:
- Replication lag (S3, Kafka, DB)
- Last successful backup timestamp per dataset
- Time Travel window remaining
- Cross-region connectivity health
Platforms like Harbinger Explorer surface these signals as a unified DR health score, giving platform teams early warning when replication falls behind before it becomes a DR event.
Summary
Data platform disaster recovery is a discipline, not a one-time design. The platforms that recover well are the ones that:
- Classified their data before designing recovery mechanisms
- Implemented replication at every layer (storage, streaming, warehouse)
- Wrote runbooks with actual commands, not prose
- Tested regularly and documented results
Treat your DR plan as living infrastructure — version-controlled, executable, and continuously validated.
Try Harbinger Explorer free for 7 days — monitor your data platform's DR readiness in real time, track replication health, and get alerted before RPO windows are breached.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial