Hybrid Cloud Data Architecture Patterns
Hybrid Cloud Data Architecture Patterns
Pure cloud-native is the goal. Reality is messier. Most enterprises operate a hybrid landscape: mainframes that will outlive everyone in the room, on-premise data centers with regulatory constraints, and cloud workloads growing at pace. Hybrid cloud data architecture is the discipline of making these work together—reliably, securely, and without drowning in operational complexity.
This guide covers the patterns that actually work in production hybrid environments.
Why Hybrid? (And Why It's Harder Than It Looks)
Loading diagram...
| Driver | Keeps Data On-Premise | Pushes Data to Cloud |
|---|---|---|
| Latency | Real-time OT/IoT | Batch analytics |
| Regulation | GDPR, data residency | Dev/test workloads |
| Cost | Existing capex | Burst compute |
| Integration | Legacy system APIs | Modern SaaS |
| Security | Classified data | Non-sensitive workloads |
The Data Gravity Problem
Data gravity: data accumulates services around itself. A 10TB on-premise Oracle DB has decades of stored procedures, ETL jobs, and BI tools attached. You can't "just move it to the cloud." Hybrid architecture respects data gravity while gradually shifting value-add workloads cloud-ward.
Pattern 1: Cloud Bursting for Analytics
Keep your data sources on-premise; run heavy analytics in the cloud using ephemeral compute. Data flows one-way: on-prem → cloud for processing, results flow back.
Loading diagram...
Terraform: AWS Site-to-Site VPN for Secure Transfer
resource "aws_customer_gateway" "on_prem" {
bgp_asn = 65000
ip_address = var.on_prem_public_ip
type = "ipsec.1"
tags = {
Name = "on-prem-datacenter"
}
}
resource "aws_vpn_gateway" "cloud" {
vpc_id = aws_vpc.data_platform.id
tags = {
Name = "data-platform-vpn-gw"
}
}
resource "aws_vpn_connection" "on_prem_to_cloud" {
vpn_gateway_id = aws_vpn_gateway.cloud.id
customer_gateway_id = aws_customer_gateway.on_prem.id
type = "ipsec.1"
static_routes_only = false
tunnel1_ike_versions = ["ikev2"]
tunnel1_phase1_encryption_algorithms = ["AES256"]
tunnel1_phase1_integrity_algorithms = ["SHA2-256"]
tunnel1_phase2_encryption_algorithms = ["AES256-GCM-16"]
}
AWS DataSync for Bulk Transfer
# Create DataSync location for on-premise NFS
aws datasync create-location-nfs --server-hostname on-prem-nas.internal --subdirectory /data/exports --on-prem-config AgentArns=arn:aws:datasync:us-east-1:123456789:agent/agent-0abc
# Create S3 destination
aws datasync create-location-s3 --s3-bucket-arn arn:aws:s3:::my-platform-bronze --subdirectory /on-prem-exports --s3-config BucketAccessRoleArn=arn:aws:iam::123456789:role/DataSyncS3Role
# Create and start transfer task
aws datasync create-task --source-location-arn arn:aws:datasync:...:location/loc-source --destination-location-arn arn:aws:datasync:...:location/loc-dest --name "nightly-export-to-s3" --options VerifyMode=ONLY_FILES_TRANSFERRED,TransferMode=CHANGED
aws datasync start-task-execution --task-arn arn:aws:datasync:...:task/task-0abc
Pattern 2: Federated Query
Query data in-place across cloud and on-premise systems without moving it. Best for ad-hoc analytics where data movement is too slow or expensive.
Loading diagram...
Trino Federated Query Config
# catalog/postgresql.properties (on-prem source)
connector.name=postgresql
connection-url=jdbc:postgresql://on-prem-pg.internal:5432/production
connection-user=${ENV:POSTGRES_USER}
connection-password=${ENV:POSTGRES_PASSWORD}
# catalog/hive.properties (cloud S3)
connector.name=hive
hive.metastore=glue
hive.metastore.glue.region=us-east-1
hive.s3.sse.enabled=true
hive.s3.sse.type=KMS
hive.s3.sse.kms-key-id=alias/data-platform-prod
# catalog/iceberg.properties (cloud lakehouse)
connector.name=iceberg
iceberg.catalog.type=glue
iceberg.file-format=PARQUET
-- Federated join: cloud aggregates + on-prem customer master
SELECT
c.customer_name,
c.segment,
s.total_revenue_30d,
s.order_count_30d
FROM postgresql.public.customers c
JOIN (
SELECT
customer_id,
SUM(amount) AS total_revenue_30d,
COUNT(*) AS order_count_30d
FROM hive.gold.daily_order_summary
WHERE dt >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY 1
) s ON c.id = s.customer_id
WHERE c.region = 'EMEA'
ORDER BY s.total_revenue_30d DESC
LIMIT 100;
Pattern 3: Event-Driven Synchronization
For bi-directional sync between on-premise and cloud, use event streaming as the backbone. Both sides publish and consume events; neither is the master.
Loading diagram...
Kafka MirrorMaker 2 Config
# mm2.properties
clusters = on-prem, cloud
on-prem.bootstrap.servers = kafka-on-prem.internal:9092
cloud.bootstrap.servers = kafka-cloud.us-east-1.amazonaws.com:9094
# Replicate on-prem → cloud
on-prem->cloud.enabled = true
on-prem->cloud.topics = orders\..*,inventory\..*,products\..*
on-prem->cloud.topics.blacklist = .*\.internal
# Replicate cloud → on-prem (predictions only)
cloud->on-prem.enabled = true
cloud->on-prem.topics = predictions\..*,enrichments\..*
# Consumer group offset sync
on-prem->cloud.sync.group.offsets.enabled = true
on-prem->cloud.sync.group.offsets.interval.seconds = 60
# Network compression
compression.type = lz4
# Security
on-prem.security.protocol = PLAINTEXT
cloud.security.protocol = SASL_SSL
cloud.sasl.mechanism = AWS_MSK_IAM
Pattern 4: Data Mesh with Hybrid Domains
A data mesh distributes ownership across domains. In a hybrid environment, some domains naturally live on-premise (core banking, ERP) while others live in the cloud (analytics, ML).
Loading diagram...
Domain Ownership in Terraform
# Each domain has its own AWS account + budget
module "finance_domain" {
source = "./modules/data-domain"
domain_name = "finance"
account_id = var.finance_account_id
data_gravity = "on-premise" # Primary data stays on-prem
cloud_resources = {
s3_landing_bucket = true
glue_catalog = true
athena_workgroup = true
}
data_product_topics = [
"gl_journal_entries",
"cost_center_hierarchy",
"currency_exchange_rates"
]
}
module "customer_domain" {
source = "./modules/data-domain"
domain_name = "customer"
account_id = var.customer_account_id
data_gravity = "cloud" # Primary data in cloud
cloud_resources = {
s3_landing_bucket = true
rds_postgres = true
redshift_cluster = true
}
}
Pattern 5: Progressive Cloud Migration
The "big bang" migration almost always fails. Instead, use a strangler fig pattern: run hybrid while progressively moving workloads.
Migration Phases
| Phase | Duration | What Moves | What Stays |
|---|---|---|---|
| 1 - Shadow | 1-3 months | Analytics replicas | All writes |
| 2 - Read migration | 3-6 months | BI queries → cloud | Transactional writes |
| 3 - Write migration | 6-12 months | New app writes → cloud | Legacy app writes |
| 4 - Cutover | 1-3 months | All traffic | Archive only |
Dual-Write Pattern for Phase 3
# Application config: dual-write to both systems
data_stores:
primary:
type: postgresql
host: on-prem-pg.internal
database: production
role: read_write
secondary:
type: postgresql
host: cloud-rds.us-east-1.rds.amazonaws.com
database: production
role: read_write
lag_threshold_ms: 500 # Alert if secondary lags
dual_write:
enabled: true
mode: synchronous # Both must succeed
fallback: primary_only # On secondary failure
reconciliation_job:
schedule: "0 */6 * * *"
alert_on_drift: true
Network Architecture for Hybrid Data
Loading diagram...
Bandwidth Planning
| Data Volume | Transfer Frequency | Recommended Link | Estimated Cost |
|---|---|---|---|
| < 100 GB/day | Nightly batch | Site-to-Site VPN | ~$50/mo |
| 100 GB - 1 TB/day | Hourly | Direct Connect 1G | ~$200/mo |
| 1-10 TB/day | Streaming | Direct Connect 10G | ~$600/mo |
| > 10 TB/day | Near-real-time | Direct Connect 100G + DataSync | ~$2,000/mo |
Identity Federation in Hybrid Environments
Don't run two identity systems. Federate on-premise AD/LDAP with cloud IAM.
# AWS IAM Identity Center (SSO) with AD connector
resource "aws_ssoadmin_instance" "main" {}
resource "aws_identitystore_group" "data_engineers" {
identity_store_id = tolist(aws_ssoadmin_instance.main.identity_store_ids)[0]
display_name = "DataEngineers"
description = "Data platform engineers - cloud access"
}
# Permission set for data engineers
resource "aws_ssoadmin_permission_set" "data_engineer" {
instance_arn = aws_ssoadmin_instance.main.arn
name = "DataEngineerAccess"
session_duration = "PT8H"
}
resource "aws_ssoadmin_managed_policy_attachment" "data_engineer_s3" {
instance_arn = aws_ssoadmin_instance.main.arn
permission_set_arn = aws_ssoadmin_permission_set.data_engineer.arn
managed_policy_arn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
}
Monitoring Hybrid Data Flows
Observability gets harder when data crosses network boundaries. Tools like Harbinger Explorer provide unified visibility across hybrid environments—tracking pipeline health, data freshness, and transfer latency regardless of whether your data lives on-premise or in the cloud.
# Monitor cross-boundary transfer latency
aws cloudwatch get-metric-statistics --namespace AWS/DataSync --metric-name BytesTransferred --dimensions Name=TaskId,Value=task-0abc123 --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) --period 300 --statistics Sum
Summary
Hybrid cloud data architecture isn't a transitional state—for many enterprises, it's the permanent operating model. Design for it deliberately:
- Understand your data gravity before deciding what moves
- Use event streaming (Kafka + MirrorMaker) for bi-directional sync
- Federate queries with Trino/Athena Federation to query in-place
- Migrate progressively with shadow mode and dual-write patterns
- Federate identity — one IAM to rule them all
- Instrument cross-boundary flows with transfer metrics and freshness SLOs
Try Harbinger Explorer free for 7 days and get unified observability across your entire hybrid data landscape—from on-premise RDBMS to cloud lakehouses, in a single pane of glass.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial