How to Choose the Right Cloud Database: A Decision Framework for Architects
How to Choose the Right Cloud Database: A Decision Framework for Architects
Choosing a cloud database is one of the highest-leverage architectural decisions a platform team makes. Get it right and you have a foundation that scales, performs, and stays maintainable for years. Get it wrong and you're facing a painful, expensive migration at the worst possible time — when your user base is growing.
This guide provides a structured decision framework and deep-dive comparisons across every major database category.
The Decision Framework
Before evaluating specific products, answer these five questions:
Loading diagram...
Category 1: Relational Databases (OLTP)
Best for: Transactional workloads, complex queries, referential integrity, financial data, user accounts.
Managed Options Comparison
| Database | Best Cloud | Max Storage | Connection Limits | Managed? |
|---|---|---|---|---|
| Amazon Aurora PostgreSQL | AWS | Unlimited (serverless v2) | 5,000 | Full |
| Cloud SQL PostgreSQL | GCP | 64 TB | 10,000 | Full |
| Azure Database for PostgreSQL | Azure | 64 TB | ~1,000 | Full |
| Neon | Multi-cloud | Unlimited | Unlimited (serverless) | Full |
| PlanetScale | Multi-cloud | Unlimited | Unlimited | Full |
When to Use PostgreSQL
PostgreSQL is the default choice for most new applications. It handles:
- JSONB: semi-structured data without schema migration
- Full-text search:
tsvectorfor lightweight search without Elasticsearch - Time-series: with
TimescaleDBextension - Vector search: with
pgvectorextension - Geospatial: with
PostGISextension
The ability to extend PostgreSQL often delays the need for specialised stores by years.
Terraform: Aurora Serverless v2
resource "aws_rds_cluster" "harbinger" {
cluster_identifier = "harbinger-prod"
engine = "aurora-postgresql"
engine_version = "15.4"
database_name = "harbinger"
master_username = "admin"
manage_master_user_password = true # Secrets Manager integration
serverlessv2_scaling_configuration {
min_capacity = 0.5
max_capacity = 64
}
backup_retention_period = 14
preferred_backup_window = "02:00-03:00"
enabled_cloudwatch_logs_exports = ["postgresql"]
deletion_protection = true
skip_final_snapshot = false
final_snapshot_identifier = "harbinger-prod-final"
tags = {
Environment = "production"
}
}
Category 2: NoSQL Document / Key-Value
Best for: High-throughput key-value access, flexible schemas, global distribution, event sourcing.
DynamoDB vs Firestore vs MongoDB Atlas
| Feature | DynamoDB | Firestore | MongoDB Atlas |
|---|---|---|---|
| Cloud | AWS only | GCP / multi | Multi-cloud |
| Pricing model | On-demand / provisioned | Per operation | Instance-based |
| Query flexibility | Single-table design required | Rich queries | Full MongoDB query API |
| Real-time subscriptions | Via Streams + Lambda | Native | Change Streams |
| Global distribution | Global Tables | Native multi-region | Atlas Global Clusters |
| Secondary indexes | GSI/LSI | Composite indexes | Any field |
| Max item size | 400 KB | 1 MB | 16 MB |
DynamoDB Single-Table Design Pattern
# Storing multiple entity types in one DynamoDB table
# using composite keys and GSIs
table_schema = {
"TableName": "harbinger-events",
"KeySchema": [
{"AttributeName": "PK", "KeyType": "HASH"}, # e.g. COUNTRY#US
{"AttributeName": "SK", "KeyType": "RANGE"}, # e.g. EVENT#2024-01-15#uuid
],
"GlobalSecondaryIndexes": [
{
"IndexName": "EventTypeIndex",
"KeySchema": [
{"AttributeName": "GSI1PK", "KeyType": "HASH"}, # EVENT_TYPE#CONFLICT
{"AttributeName": "GSI1SK", "KeyType": "RANGE"}, # DATE#2024-01-15
],
"Projection": {"ProjectionType": "ALL"},
}
],
"BillingMode": "PAY_PER_REQUEST",
}
# Access patterns this enables:
# 1. Get all events for a country: PK=COUNTRY#US, SK begins_with EVENT#
# 2. Get events by type: GSI1PK=EVENT_TYPE#CONFLICT, GSI1SK between dates
# 3. Get single event: PK=COUNTRY#US, SK=EVENT#2024-01-15#abc123
Category 3: Time-Series Databases
Best for: Metrics, IoT sensor data, financial tick data, monitoring telemetry.
Time-Series Store Comparison
| Store | Write Throughput | Query Language | Retention Policies | Cloud Native |
|---|---|---|---|---|
| Amazon Timestream | Millions/sec | SQL-like | Automatic tiering | AWS |
| InfluxDB Cloud | High | Flux / InfluxQL | Configurable | Multi |
| TimescaleDB | High | Full SQL | SQL-based | Multi |
| Prometheus | Moderate | PromQL | Configurable | Any K8s |
| OpenTSDB | Very high | HTTP API | Configurable | Self-hosted |
TimescaleDB Hypertable
-- Create hypertable for geopolitical risk scores
CREATE TABLE risk_scores (
time TIMESTAMPTZ NOT NULL,
country TEXT NOT NULL,
risk_score FLOAT NOT NULL,
event_count INTEGER,
source TEXT
);
-- Convert to hypertable partitioned by time
SELECT create_hypertable('risk_scores', 'time', chunk_time_interval => INTERVAL '1 day');
-- Add compression policy (compress chunks older than 7 days)
ALTER TABLE risk_scores SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'country'
);
SELECT add_compression_policy('risk_scores', INTERVAL '7 days');
-- Add retention policy (drop data older than 2 years)
SELECT add_retention_policy('risk_scores', INTERVAL '2 years');
-- Query: risk score trend with gap-filling
SELECT
time_bucket_gapfill('1 hour', time) AS bucket,
country,
AVG(risk_score) as avg_risk,
locf(AVG(risk_score)) as filled_risk -- last observation carried forward
FROM risk_scores
WHERE country = 'UA'
AND time > NOW() - INTERVAL '30 days'
GROUP BY bucket, country
ORDER BY bucket;
Category 4: Analytical Databases (OLAP)
Best for: Business intelligence, ad-hoc analysis, aggregations over billions of rows.
OLAP Comparison: BigQuery vs Redshift vs Snowflake vs Databricks
| Feature | BigQuery | Redshift | Snowflake | Databricks SQL |
|---|---|---|---|---|
| Pricing | Per query (TB scanned) | Per cluster-hour | Per credit | Per DBU |
| Serverless | Yes (fully) | Serverless option | Yes (serverless) | Serverless option |
| Storage format | Capacitor (proprietary) | Parquet/ORC | FDN (proprietary) | Delta Lake |
| ML integration | BigQuery ML | Redshift ML | Snowpark ML | MLflow native |
| Streaming ingest | BigQuery Storage Write | Kinesis Firehose | Snowpipe | Autoloader |
| Git integration | Limited | Limited | Limited | Native (DBX) |
| Max query result | 10 GB | Unlimited | Unlimited | Unlimited |
| Time travel | 7 days | 5 days | 90 days | 30 days (Delta) |
Cost Optimisation Strategies
BigQuery:
-- Partition tables to reduce bytes scanned
CREATE TABLE harbinger_prod.events.geopolitical
PARTITION BY DATE(event_date)
CLUSTER BY country_code, event_type
AS SELECT * FROM staging.raw_events;
-- Use partition filters in queries (avoid full table scans)
SELECT country_code, COUNT(*) as event_count
FROM harbinger_prod.events.geopolitical
WHERE DATE(event_date) BETWEEN '2024-01-01' AND '2024-03-31' -- partition pruning
GROUP BY country_code;
Category 5: Vector Databases
Best for: Semantic search, RAG (retrieval-augmented generation), recommendation systems, similarity matching.
Vector Database Comparison
| Database | Indexing | Max Vectors | Filtering | Cloud Managed |
|---|---|---|---|---|
| Pinecone | HNSW | Billions | Yes (metadata) | Yes |
| Weaviate | HNSW + BM25 | Billions | Yes | Yes (Cloud) |
| Qdrant | HNSW | Billions | Yes | Yes (Cloud) |
| pgvector | HNSW / IVFFlat | ~10M (practical) | Full SQL | Via PostgreSQL |
| Chroma | HNSW | Millions | Yes | Self-hosted |
| Milvus | Multiple | Billions | Yes | Zilliz Cloud |
pgvector for Intelligence Platforms
-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;
-- Store article embeddings alongside metadata
CREATE TABLE article_embeddings (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
article_id TEXT NOT NULL,
title TEXT NOT NULL,
country TEXT,
event_date DATE,
embedding vector(1536), -- OpenAI text-embedding-3-small dimensions
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create HNSW index for fast approximate nearest neighbour search
CREATE INDEX ON article_embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Semantic search: find similar geopolitical events
SELECT
article_id,
title,
country,
event_date,
1 - (embedding <=> $1::vector) AS similarity
FROM article_embeddings
WHERE country = 'UA' -- pre-filter by country
AND event_date > '2024-01-01'
ORDER BY embedding <=> $1::vector
LIMIT 20;
The Polyglot Persistence Pattern
Most production platforms use multiple database types — the key is matching each store to its strengths:
Loading diagram...
Decision Checklist
Use this checklist before committing to a database:
Workload characteristics
- Primary access pattern documented (point lookup / range scan / aggregation)
- Expected read/write ratio
- Peak QPS estimates (p50, p99)
- Data volume at launch and 2-year projection
Consistency requirements
- Do you need ACID transactions? (choose RDBMS)
- Can you tolerate eventual consistency? (enables NoSQL/distributed)
- Do you need multi-region writes? (CockroachDB, DynamoDB Global Tables)
Operational
- Managed service vs. self-hosted?
- Team familiarity with query language?
- Migration path if requirements change?
- Cost model understood (per query vs. per hour vs. per row)?
Conclusion
The right cloud database decision is always contextual. Resist the temptation to standardise on one database for everything — the performance gap between an appropriately chosen store and a general-purpose one can be 10–100x.
Platforms processing diverse data types — like Harbinger Explorer, which correlates news events, risk scores, time-series indicators, and semantic signals — often run 3–5 different database types in production, each optimised for its specific access pattern.
Try Harbinger Explorer free for 7 days — built on a polyglot persistence architecture designed for intelligence-grade performance.
Continue Reading
GDPR Compliance for Cloud Data Platforms: A Technical Deep Dive
A comprehensive technical guide to building GDPR-compliant cloud data platforms — covering pseudonymisation architecture, Terraform infrastructure, Kubernetes deployments, right-to-erasure workflows, and cloud provider comparison tables.
Cloud Cost Allocation Strategies for Data Teams
A practitioner's guide to cloud cost allocation for data teams—covering tagging strategies, chargeback models, Spot instance patterns, query cost optimization, and FinOps tooling with real Terraform and CLI examples.
API Gateway Architecture Patterns for Data Platforms
A deep-dive into API gateway architecture patterns for data platforms — covering data serving APIs, rate limiting, authentication, schema versioning, and the gateway-as-data-mesh pattern.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial