Harbinger Explorer

Back to Knowledge Hub
cloud-architecture
Published:

How to Choose the Right Cloud Database: A Decision Framework for Architects

11 min read·Tags: databases, cloud-architecture, postgresql, dynamodb, bigquery, vector-db

How to Choose the Right Cloud Database: A Decision Framework for Architects

Choosing a cloud database is one of the highest-leverage architectural decisions a platform team makes. Get it right and you have a foundation that scales, performs, and stays maintainable for years. Get it wrong and you're facing a painful, expensive migration at the worst possible time — when your user base is growing.

This guide provides a structured decision framework and deep-dive comparisons across every major database category.


The Decision Framework

Before evaluating specific products, answer these five questions:

Loading diagram...

Category 1: Relational Databases (OLTP)

Best for: Transactional workloads, complex queries, referential integrity, financial data, user accounts.

Managed Options Comparison

DatabaseBest CloudMax StorageConnection LimitsManaged?
Amazon Aurora PostgreSQLAWSUnlimited (serverless v2)5,000Full
Cloud SQL PostgreSQLGCP64 TB10,000Full
Azure Database for PostgreSQLAzure64 TB~1,000Full
NeonMulti-cloudUnlimitedUnlimited (serverless)Full
PlanetScaleMulti-cloudUnlimitedUnlimitedFull

When to Use PostgreSQL

PostgreSQL is the default choice for most new applications. It handles:

  • JSONB: semi-structured data without schema migration
  • Full-text search: tsvector for lightweight search without Elasticsearch
  • Time-series: with TimescaleDB extension
  • Vector search: with pgvector extension
  • Geospatial: with PostGIS extension

The ability to extend PostgreSQL often delays the need for specialised stores by years.

Terraform: Aurora Serverless v2

resource "aws_rds_cluster" "harbinger" {
  cluster_identifier     = "harbinger-prod"
  engine                 = "aurora-postgresql"
  engine_version         = "15.4"
  database_name          = "harbinger"
  master_username        = "admin"
  manage_master_user_password = true  # Secrets Manager integration
  
  serverlessv2_scaling_configuration {
    min_capacity = 0.5
    max_capacity = 64
  }

  backup_retention_period = 14
  preferred_backup_window = "02:00-03:00"
  
  enabled_cloudwatch_logs_exports = ["postgresql"]
  
  deletion_protection = true
  skip_final_snapshot = false
  final_snapshot_identifier = "harbinger-prod-final"

  tags = {
    Environment = "production"
  }
}

Category 2: NoSQL Document / Key-Value

Best for: High-throughput key-value access, flexible schemas, global distribution, event sourcing.

DynamoDB vs Firestore vs MongoDB Atlas

FeatureDynamoDBFirestoreMongoDB Atlas
CloudAWS onlyGCP / multiMulti-cloud
Pricing modelOn-demand / provisionedPer operationInstance-based
Query flexibilitySingle-table design requiredRich queriesFull MongoDB query API
Real-time subscriptionsVia Streams + LambdaNativeChange Streams
Global distributionGlobal TablesNative multi-regionAtlas Global Clusters
Secondary indexesGSI/LSIComposite indexesAny field
Max item size400 KB1 MB16 MB

DynamoDB Single-Table Design Pattern

# Storing multiple entity types in one DynamoDB table
# using composite keys and GSIs

table_schema = {
    "TableName": "harbinger-events",
    "KeySchema": [
        {"AttributeName": "PK", "KeyType": "HASH"},   # e.g. COUNTRY#US
        {"AttributeName": "SK", "KeyType": "RANGE"},  # e.g. EVENT#2024-01-15#uuid
    ],
    "GlobalSecondaryIndexes": [
        {
            "IndexName": "EventTypeIndex",
            "KeySchema": [
                {"AttributeName": "GSI1PK", "KeyType": "HASH"},  # EVENT_TYPE#CONFLICT
                {"AttributeName": "GSI1SK", "KeyType": "RANGE"}, # DATE#2024-01-15
            ],
            "Projection": {"ProjectionType": "ALL"},
        }
    ],
    "BillingMode": "PAY_PER_REQUEST",
}

# Access patterns this enables:
# 1. Get all events for a country: PK=COUNTRY#US, SK begins_with EVENT#
# 2. Get events by type: GSI1PK=EVENT_TYPE#CONFLICT, GSI1SK between dates
# 3. Get single event: PK=COUNTRY#US, SK=EVENT#2024-01-15#abc123

Category 3: Time-Series Databases

Best for: Metrics, IoT sensor data, financial tick data, monitoring telemetry.

Time-Series Store Comparison

StoreWrite ThroughputQuery LanguageRetention PoliciesCloud Native
Amazon TimestreamMillions/secSQL-likeAutomatic tieringAWS
InfluxDB CloudHighFlux / InfluxQLConfigurableMulti
TimescaleDBHighFull SQLSQL-basedMulti
PrometheusModeratePromQLConfigurableAny K8s
OpenTSDBVery highHTTP APIConfigurableSelf-hosted

TimescaleDB Hypertable

-- Create hypertable for geopolitical risk scores
CREATE TABLE risk_scores (
    time        TIMESTAMPTZ NOT NULL,
    country     TEXT NOT NULL,
    risk_score  FLOAT NOT NULL,
    event_count INTEGER,
    source      TEXT
);

-- Convert to hypertable partitioned by time
SELECT create_hypertable('risk_scores', 'time', chunk_time_interval => INTERVAL '1 day');

-- Add compression policy (compress chunks older than 7 days)
ALTER TABLE risk_scores SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'country'
);
SELECT add_compression_policy('risk_scores', INTERVAL '7 days');

-- Add retention policy (drop data older than 2 years)
SELECT add_retention_policy('risk_scores', INTERVAL '2 years');

-- Query: risk score trend with gap-filling
SELECT
    time_bucket_gapfill('1 hour', time) AS bucket,
    country,
    AVG(risk_score) as avg_risk,
    locf(AVG(risk_score)) as filled_risk  -- last observation carried forward
FROM risk_scores
WHERE country = 'UA'
    AND time > NOW() - INTERVAL '30 days'
GROUP BY bucket, country
ORDER BY bucket;

Category 4: Analytical Databases (OLAP)

Best for: Business intelligence, ad-hoc analysis, aggregations over billions of rows.

OLAP Comparison: BigQuery vs Redshift vs Snowflake vs Databricks

FeatureBigQueryRedshiftSnowflakeDatabricks SQL
PricingPer query (TB scanned)Per cluster-hourPer creditPer DBU
ServerlessYes (fully)Serverless optionYes (serverless)Serverless option
Storage formatCapacitor (proprietary)Parquet/ORCFDN (proprietary)Delta Lake
ML integrationBigQuery MLRedshift MLSnowpark MLMLflow native
Streaming ingestBigQuery Storage WriteKinesis FirehoseSnowpipeAutoloader
Git integrationLimitedLimitedLimitedNative (DBX)
Max query result10 GBUnlimitedUnlimitedUnlimited
Time travel7 days5 days90 days30 days (Delta)

Cost Optimisation Strategies

BigQuery:

-- Partition tables to reduce bytes scanned
CREATE TABLE harbinger_prod.events.geopolitical
PARTITION BY DATE(event_date)
CLUSTER BY country_code, event_type
AS SELECT * FROM staging.raw_events;

-- Use partition filters in queries (avoid full table scans)
SELECT country_code, COUNT(*) as event_count
FROM harbinger_prod.events.geopolitical
WHERE DATE(event_date) BETWEEN '2024-01-01' AND '2024-03-31'  -- partition pruning
GROUP BY country_code;

Category 5: Vector Databases

Best for: Semantic search, RAG (retrieval-augmented generation), recommendation systems, similarity matching.

Vector Database Comparison

DatabaseIndexingMax VectorsFilteringCloud Managed
PineconeHNSWBillionsYes (metadata)Yes
WeaviateHNSW + BM25BillionsYesYes (Cloud)
QdrantHNSWBillionsYesYes (Cloud)
pgvectorHNSW / IVFFlat~10M (practical)Full SQLVia PostgreSQL
ChromaHNSWMillionsYesSelf-hosted
MilvusMultipleBillionsYesZilliz Cloud

pgvector for Intelligence Platforms

-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- Store article embeddings alongside metadata
CREATE TABLE article_embeddings (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    article_id  TEXT NOT NULL,
    title       TEXT NOT NULL,
    country     TEXT,
    event_date  DATE,
    embedding   vector(1536),  -- OpenAI text-embedding-3-small dimensions
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- Create HNSW index for fast approximate nearest neighbour search
CREATE INDEX ON article_embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Semantic search: find similar geopolitical events
SELECT
    article_id,
    title,
    country,
    event_date,
    1 - (embedding <=> $1::vector) AS similarity
FROM article_embeddings
WHERE country = 'UA'  -- pre-filter by country
    AND event_date > '2024-01-01'
ORDER BY embedding <=> $1::vector
LIMIT 20;

The Polyglot Persistence Pattern

Most production platforms use multiple database types — the key is matching each store to its strengths:

Loading diagram...

Decision Checklist

Use this checklist before committing to a database:

Workload characteristics

  • Primary access pattern documented (point lookup / range scan / aggregation)
  • Expected read/write ratio
  • Peak QPS estimates (p50, p99)
  • Data volume at launch and 2-year projection

Consistency requirements

  • Do you need ACID transactions? (choose RDBMS)
  • Can you tolerate eventual consistency? (enables NoSQL/distributed)
  • Do you need multi-region writes? (CockroachDB, DynamoDB Global Tables)

Operational

  • Managed service vs. self-hosted?
  • Team familiarity with query language?
  • Migration path if requirements change?
  • Cost model understood (per query vs. per hour vs. per row)?

Conclusion

The right cloud database decision is always contextual. Resist the temptation to standardise on one database for everything — the performance gap between an appropriately chosen store and a general-purpose one can be 10–100x.

Platforms processing diverse data types — like Harbinger Explorer, which correlates news events, risk scores, time-series indicators, and semantic signals — often run 3–5 different database types in production, each optimised for its specific access pattern.


Try Harbinger Explorer free for 7 days — built on a polyglot persistence architecture designed for intelligence-grade performance.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...