Data Strategy for Cloud Migrations: A Platform Engineer's Playbook
Data Strategy for Cloud Migrations: A Platform Engineer's Playbook
Cloud migration projects fail more often at the data layer than anywhere else. Networking, compute, and IAM get thorough attention — but data is often treated as an afterthought, moved in bulk the night before cutover, and prayed over. This guide exists to change that pattern.
Whether you're lifting a 50TB data warehouse from on-prem Oracle to BigQuery, re-platforming a Kafka estate from bare metal to Amazon MSK, or migrating a fleet of Spark ETL jobs to Databricks on Azure, the underlying data strategy questions remain the same: When do you move what? How do you validate it? And what's your rollback plan?
The Four Phases of a Data Migration
Before writing a single line of Terraform, map your migration to four discrete phases. Skipping phases is how projects end up with phantom data loss at 2 AM.
Loading diagram...
Phase 1 — Inventory & Classification
Every byte of data your systems produce falls into one of four categories:
| Classification | Description | Migration Risk | Example |
|---|---|---|---|
| Hot | Actively read/written, latency-sensitive | High | OLTP tables, event streams |
| Warm | Read frequently, written in batch | Medium | Aggregated reports, feature stores |
| Cold | Archived, rarely read | Low | Compliance archives, raw event logs |
| Transient | Cache, temp tables, in-flight state | N/A (rebuild) | Redis caches, Kafka consumer offsets |
Classify before you move anything. Hot data needs a live replication strategy. Cold data can be bulk-copied off-hours. Transient data is rebuilt on the target.
Use a combination of query logs, column-level lineage tools, and manual interviews with data consumers to produce this inventory. Harbinger Explorer can accelerate this by scanning metadata across multi-cloud estates and surfacing dependency graphs automatically.
Phase 2 — Dual-Write & Shadow Mode
For hot data, never hard-cutover. Instead, enter a dual-write phase where writes land on both source and target systems simultaneously, and reads continue from the source.
# Example: Debezium CDC connector for dual-write shadow replication
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: orders-cdc-shadow
labels:
strimzi.io/cluster: migration-connect
spec:
class: io.debezium.connector.postgresql.PostgresConnector
tasksMax: 4
config:
database.hostname: source-postgres.internal
database.port: "5432"
database.user: debezium_reader
database.password: ${env:DEBEZIUM_PASSWORD}
database.dbname: orders
database.server.name: orders_shadow
table.include.list: public.orders,public.order_items,public.customers
slot.name: debezium_shadow_slot
publication.autocreate.mode: filtered
# Write to shadow topic for target ingestion
topic.prefix: shadow.migration
transforms: Reroute
transforms.Reroute.type: io.debezium.transforms.ByLogicalTableRouter
transforms.Reroute.topic.regex: "shadow\.migration\.public\.(.*)"
transforms.Reroute.topic.replacement: "target.ingest.$1"
During shadow mode, run reconciliation jobs on a schedule (hourly minimum) comparing row counts, checksums, and sample record comparisons between source and target.
Phase 3 — Cutover & Validation
Cutover is not a moment — it's a window. Define it explicitly in your runbook:
#!/bin/bash
# migration-cutover.sh — execute in a tmux session with logging
set -euo pipefail
LOG_FILE="/var/log/migration/cutover-$(date +%Y%m%d-%H%M%S).log"
echo "=== CUTOVER START: $(date -u) ===" | tee -a $LOG_FILE
# 1. Drain write traffic to source
echo "Step 1: Enabling write-drain flag in feature flag service..." | tee -a $LOG_FILE
curl -X PATCH https://flags.internal/v1/flags/db_write_drain -H "Content-Type: application/json" -d '{"enabled": true}' | tee -a $LOG_FILE
# 2. Wait for in-flight transactions to settle
echo "Step 2: Waiting 30s for in-flight writes..." | tee -a $LOG_FILE
sleep 30
# 3. Final reconciliation check
echo "Step 3: Running final reconciliation..." | tee -a $LOG_FILE
python3 /opt/migration/reconcile.py --source postgres://source-db --target bigquery://project/dataset --fail-on-diff
# 4. Switch DNS / connection strings
echo "Step 4: Updating connection string secret in Vault..." | tee -a $LOG_FILE
vault kv put secret/db/orders connection_string="postgresql://target-db.internal:5432/orders" migrated_at="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
# 5. Enable reads from target
echo "Step 5: Flipping read flag..." | tee -a $LOG_FILE
curl -X PATCH https://flags.internal/v1/flags/db_read_source -H "Content-Type: application/json" -d '{"enabled": false}'
echo "=== CUTOVER COMPLETE: $(date -u) ===" | tee -a $LOG_FILE
Phase 4 — Decommission & Observability
Don't decommission source systems until you have 30 days of clean production data flowing through the target. Set up cross-system observability:
# Terraform: CloudWatch metric alarms for post-migration data quality
resource "aws_cloudwatch_metric_alarm" "data_freshness" {
alarm_name = "migration-data-freshness-breach"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "MaxAgeMinutes"
namespace = "DataPlatform/Migration"
period = 300
statistic = "Maximum"
threshold = 15
alarm_description = "Data freshness degraded post-migration — possible pipeline stall"
alarm_actions = [aws_sns_topic.oncall.arn]
dimensions = {
Dataset = "orders"
Stage = "production"
}
}
Schema Evolution Strategy
Schema changes during migration are a multiplying complexity factor. Every schema migration becomes three problems: the source schema, the migration mapping, and the target schema.
Use a Schema Registry
Whether you're on Avro, Protobuf, or JSON Schema, run a schema registry on both sides of the migration and enforce backward compatibility:
# confluent-schema-registry config snippet
schema.compatibility.level: BACKWARD_TRANSITIVE
BACKWARD_TRANSITIVE means any new schema version can be read by all old consumers — critical when source and target consumers coexist during shadow mode.
Column Mapping Patterns
| Source Pattern | Target Pattern | Migration Tool |
|---|---|---|
| Camel case columns | Snake case | dbt rename macro |
| Implicit nullability | Explicit NOT NULL | Schema migration script |
| NUMERIC(18,4) | DECIMAL(18,4) | Type casting in Spark |
| Timestamp with TZ | UTC-normalized TIMESTAMP | Spark from_utc_timestamp |
| Composite PK | Surrogate key + composite index | dbt snapshot |
Data Validation Framework
The gold standard is a three-tier validation approach:
- Structural — Schema matches, no missing columns, types compatible
- Statistical — Row counts, null rates, value distributions within tolerance
- Semantic — Business rules hold (e.g., order total = sum of line items)
# Lightweight reconciliation using PySpark (Great Expectations alternative)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, sum as spark_sum, abs as spark_abs
spark = SparkSession.builder.appName("MigrationReconcile").getOrCreate()
source_df = spark.read.format("jdbc").options(
url="jdbc:postgresql://source-db:5432/orders",
dbtable="public.orders",
user="reader",
password=dbutils.secrets.get("migration", "source-db-password")
).load()
target_df = spark.read.format("bigquery").option("table", "project.dataset.orders").load()
# Structural check
assert set(source_df.columns) == set(target_df.columns), "Column mismatch!"
# Statistical check
source_stats = source_df.agg(
count("*").alias("row_count"),
spark_sum("total_amount").alias("total_amount_sum")
).collect()[0]
target_stats = target_df.agg(
count("*").alias("row_count"),
spark_sum("total_amount").alias("total_amount_sum")
).collect()[0]
tolerance = 0.001 # 0.1% tolerance
row_diff_pct = abs(source_stats["row_count"] - target_stats["row_count"]) / source_stats["row_count"]
sum_diff_pct = abs(source_stats["total_amount_sum"] - target_stats["total_amount_sum"]) / source_stats["total_amount_sum"]
assert row_diff_pct < tolerance, f"Row count divergence: {row_diff_pct:.4%}"
assert sum_diff_pct < tolerance, f"Sum divergence: {sum_diff_pct:.4%}"
print("✅ Reconciliation passed")
Rollback Planning
Every migration phase needs a rollback procedure documented before cutover begins. A rollback that hasn't been rehearsed in a staging environment is not a rollback plan — it's a wish.
| Phase | Rollback Trigger | Rollback Action | RTO |
|---|---|---|---|
| Shadow mode | Replication lag > 5min | Disable CDC, fix connector | 10min |
| Cutover | Error rate > 1% | Revert feature flags | 2min |
| Post-cutover | Data quality breach | Re-enable source, re-open shadow | 15min |
Observability Stack for Migration Projects
Post-migration, your observability should answer: "Is the new platform delivering data with the same quality and freshness as the old one?"
Instrument three signal types:
- Pipeline latency — p50/p95/p99 end-to-end job duration
- Data freshness — max age of the latest record in critical tables
- Error rate — failed job runs as a percentage of total runs
If you're managing multiple migrated workloads across teams, a platform-level view becomes essential. Tools like Harbinger Explorer give you a unified operational view across cloud data assets without requiring per-team instrumentation overhead.
Conclusion
A cloud migration data strategy isn't a one-time document — it's a living operational practice spanning months of careful, phased execution. The teams that succeed treat data migration as a product delivery: they define acceptance criteria, run automated validation, and plan for failure.
The key takeaways:
- Classify data before moving any of it
- Use dual-write shadow mode for hot data; never hard-cutover
- Automate reconciliation — manual spot checks don't scale
- Define rollback procedures and rehearse them
- Stay in observability mode for 30 days post-cutover before decommissioning
Try Harbinger Explorer free for 7 days — get unified visibility across your cloud data estate, track migration progress across teams, and catch data quality issues before they reach production.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial