Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

Multi-Cloud Data Strategy: Patterns and Pitfalls

12 min read·Tags: multi-cloud, data-strategy, cloud-architecture, AWS, GCP, Azure, data-mesh

Multi-Cloud Data Strategy: Patterns and Pitfalls

Multi-cloud is no longer a trend — it's the operational reality for most large organisations. Mergers, regulatory requirements, best-of-breed service selection, and vendor risk management all push teams towards running workloads across AWS, GCP, and Azure simultaneously. The data layer is where this complexity hits hardest.

This guide covers the reference architectures that actually work, the anti-patterns that consistently blow up, and the operational disciplines that separate healthy multi-cloud data platforms from expensive chaos.


Why Multi-Cloud Data Architectures Exist

Before diving into patterns, it's worth being honest about the reasons teams end up here:

DriverReality
Vendor lock-in avoidanceTheoretically sound, operationally expensive
Best-of-breed servicesBigQuery for analytics, Snowflake on AWS, Cosmos DB on Azure
M&A integrationAcquired company runs different cloud — you inherit it
Data residency / complianceEU data on Azure, US data on AWS
Disaster recoveryActive-active across clouds as ultimate resilience

Most organisations don't choose multi-cloud — they arrive at it through accumulated decisions. Understanding the actual driver shapes the right architecture.


Reference Pattern 1: The Federated Query Layer

The most pragmatic starting point. Data stays where it lives; compute crosses cloud boundaries only for queries.

Loading diagram...

When to use it: When you need unified reporting across clouds without a full data migration. Harbinger Explorer is useful here for testing federated query API endpoints and verifying that schema metadata from different catalogs is returning consistently structured responses.

Implementation with Trino on Kubernetes:

# trino-values.yaml (Helm)
coordinator:
  resources:
    requests:
      memory: "8Gi"
      cpu: "2"
catalogs:
  s3_lakehouse: |
    connector.name=hive
    hive.metastore.uri=thrift://glue-metastore:9083
    hive.s3.aws-credentials-provider=com.amazonaws.auth.InstanceProfileCredentialsProvider
  bigquery: |
    connector.name=bigquery
    bigquery.project-id=my-gcp-project
    bigquery.credentials-file=/etc/secrets/gcp-sa.json
  adls_lakehouse: |
    connector.name=delta
    delta.hide-non-delta-tables=true
    hive.azure.adl-oauth2-client-id=${AZURE_CLIENT_ID}
    hive.azure.adl-oauth2-credential=${AZURE_CLIENT_SECRET}
    hive.azure.adl-oauth2-refresh-url=https://login.microsoftonline.com/${TENANT_ID}/oauth2/token

Pitfall: Egress costs. A federated query pulling 500 GB across cloud boundaries can cost more than a month of storage. Always push down predicates aggressively and profile query plans before running in production.


Reference Pattern 2: The Hub-and-Spoke Lakehouse

One cloud hosts the canonical data lake (the hub); other clouds have lighter read replicas or purpose-built stores (the spokes). Data flows one-way from hub to spokes.

Loading diagram...

Terraform for cross-cloud replication IAM:

# AWS side — allow GCP service account to read S3
resource "aws_iam_policy" "gcp_replication_read" {
  name = "gcp-replication-read"
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["s3:GetObject", "s3:ListBucket"]
      Resource = [
        aws_s3_bucket.lakehouse.arn,
        "${aws_s3_bucket.lakehouse.arn}/*"
      ]
    }]
  })
}

# Replication task (AWS DMS or custom Airflow DAG)
resource "aws_dms_replication_task" "to_gcp" {
  replication_task_id      = "lakehouse-to-bq"
  migration_type           = "full-load-and-cdc"
  replication_instance_arn = aws_dms_replication_instance.main.replication_instance_arn
  source_endpoint_arn      = aws_dms_endpoint.s3_source.endpoint_arn
  target_endpoint_arn      = aws_dms_endpoint.gcs_target.endpoint_arn
  table_mappings           = file("table-mappings.json")
}

Reference Pattern 3: Data Mesh with Cloud Domain Alignment

For large organisations, the data mesh model maps naturally onto multi-cloud: each domain owns its data product, and the cloud assignment follows domain ownership.

Loading diagram...

This pattern requires a cross-cloud governance plane — typically implemented with tools like Collibra, Atlan, or a custom metadata service. Harbinger Explorer fits well as the API testing layer for validating that each domain's data API contract is honoured consistently across environments.


The Seven Deadly Anti-Patterns

1. The Egress Trap

Moving data between clouds for every query. At $0.08–0.09/GB egress, a 10 TB daily analytical workload costs $800/day just in transfer fees. Fix: Replicate once, query locally.

2. Identity Hell

Three separate IAM systems (IAM, IAM, Entra ID) with no unified identity plane. Engineers manage 3× the roles, 3× the policies. Fix: Federate identity through an IdP (Okta, Azure AD) before writing a single resource policy.

3. Schema Drift

Data copied across clouds diverges in type precision (Parquet INT32 vs BigQuery INT64), null handling, and partitioning schemes. Fix: Contract testing on every cross-cloud data pipeline.

4. Operational Silos

Three separate monitoring stacks, three cost dashboards, three on-call rotations. Fix: Centralise observability — OpenTelemetry → a single backend, unified cost allocation tags.

5. The "Best of Breed" Accumulation Tax

Every team picks the best tool for their cloud. You end up with 14 orchestrators, 6 data catalogs, and 4 transformation frameworks. Fix: Standardise on 2-3 core tools that run cloud-agnostically (Airflow/Dagster for orchestration, dbt for transformation, Apache Iceberg for table format).

6. Network Topology Neglect

Assuming cloud VPNs "just work" at scale. At 100 Gbps+ transfer rates, VPN throughput limits and latency become architectural constraints. Fix: Use cloud interconnects (AWS Direct Connect, Azure ExpressRoute, GCP Dedicated Interconnect) with private peering for data-intensive workloads.

7. Cost Attribution Blindness

No tag strategy, no cross-cloud cost allocation, no team-level showback. Multi-cloud costs invisibly balloon. Fix: Define a mandatory tag taxonomy (env, team, domain, project) before deploying anything.


Operational Disciplines

Unified Tagging Strategy

# Apply consistent tags across clouds
# AWS
aws ec2 create-tags --resources i-1234567890abcdef0   --tags Key=team,Value=data-platform Key=env,Value=prod Key=domain,Value=analytics

# GCP
gcloud compute instances add-labels my-instance   --labels=team=data-platform,env=prod,domain=analytics

# Azure
az resource tag --tags team=data-platform env=prod domain=analytics   --ids /subscriptions/{sub-id}/resourceGroups/{rg}/providers/...

Cross-Cloud Data Quality Contracts

Use Great Expectations or Soda with cloud-agnostic YAML contracts:

# data_contract_orders.yml
dataset: orders
columns:
  - name: order_id
    type: string
    not_null: true
    unique: true
  - name: amount
    type: decimal(18,4)
    min: 0
  - name: created_at
    type: timestamp
    not_null: true
validations:
  - freshness:
      column: created_at
      max_age: 6h
  - row_count:
      min: 1000

Run this contract check in CI/CD for every cross-cloud pipeline before data is considered valid.


Cost Benchmarks

Based on real workloads (1 TB/day processed, 5 TB stored):

ArchitectureMonthly ComputeMonthly EgressTotal/Month
Single cloud$3,200$0~$3,200
Federated queries (naïve)$2,800$4,800~$7,600
Federated queries (optimised)$2,800$320~$3,120
Hub-and-spoke$3,600$180~$3,780
Full data mesh$4,200$240~$4,440

The "federated queries (optimised)" row assumes aggressive predicate pushdown and result caching — achievable but requires significant query engine tuning.


Decision Framework

Loading diagram...

Summary

Multi-cloud data strategy works when you're explicit about why you're multi-cloud and choose the architecture that matches that reason. The federated query pattern is the right starting point for most teams — low migration cost, fast time-to-value. Hub-and-spoke makes sense when you have a clear primary cloud. Data mesh fits large organisations with genuine domain autonomy.

The pitfalls are predictable: egress costs, identity sprawl, schema drift, and operational complexity. Each has a known mitigation. The teams that succeed treat multi-cloud as an operational discipline problem as much as an architecture problem.


Try Harbinger Explorer free for 7 days — validate your multi-cloud API contracts, explore cross-cloud data structures, and test your federated endpoints before they hit production. harbingerexplorer.com


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...