Surrogate vs Natural Keys: When to Use Which
Surrogate vs Natural Keys: When to Use Which
You inherited a data warehouse where every dimension table uses the source system's customer_code as its primary key. Then the source system gets migrated and codes are reassigned. Now you're rebuilding half your model. This is the natural key problem.
But surrogate keys have their own failure modes. The debate between surrogate and natural keys is older than most modern data stacks — and the right answer genuinely depends on your context.
TL;DR
| Natural Key | Surrogate Key | |
|---|---|---|
| Definition | A key that exists in the real world (email, SSN, product code) | A system-generated identifier with no business meaning |
| Examples | customer_email, ISIN, order_reference | Auto-increment INT, UUID, hash key |
| Stability | Business-controlled, can change | System-controlled, never changes |
| Debuggability | High — humans can read it | Low — must join back to source |
| Join performance | Depends on data type and cardinality | INT/BIGINT is fast; UUID is slower |
| Recommended for | Lookup tables, reference data, small dimensions | Fact tables, large dimensions, SCD tracking |
Natural Keys: What They Are and Where They Break
A natural key is a column (or combination of columns) that uniquely identifies a record using real-world meaning. An ISIN identifies a financial instrument. An email address identifies a user. A product barcode identifies a product.
Where natural keys work well
-- PostgreSQL: ISO currency codes as natural key
-- Currencies don't change their codes. USD is USD. EUR is EUR.
CREATE TABLE currencies (
currency_code CHAR(3) PRIMARY KEY, -- natural key: ISO 4217
currency_name TEXT NOT NULL,
decimal_places SMALLINT
);
-- Safe join — currency_code is stable and universally understood
SELECT t.amount, c.currency_name
FROM transactions t
JOIN currencies c ON t.currency_code = c.currency_code;
ISO codes, country codes, standard enums — these are genuinely stable. Natural keys here add clarity without risk.
Where natural keys break
Natural keys fail when the business controls the value and the business changes:
- Reuse — A user deletes their account. A new user registers with the same email. Your history is now contaminated.
- Correction — A customer's national ID was entered with a typo. Fixing it cascades through every foreign key that referenced the old value.
- Merge events — Two companies merge. Their
company_idvalues must be consolidated. Every downstream table breaks. - Multi-source integration — System A uses
USER_001; System B usesU-00001. They refer to the same person. Neither is a reliable primary key across the integrated model.
-- PostgreSQL: the multi-source problem
-- You can't use natural keys when two systems have different ID schemes
-- for the same real-world entity
-- Source A
INSERT INTO customers VALUES ('USER_001', 'Alice', 'alice@example.com');
-- Source B (same person, different system)
INSERT INTO customers VALUES ('U-00001', 'Alice', 'alice@example.com');
-- Which one is the primary key? Neither.
-- You need a surrogate to represent the unified entity.
Surrogate Keys: System-Generated, Business-Agnostic
A surrogate key is a meaningless identifier created by the data system. The most common forms:
- Auto-increment integer (
SERIAL,IDENTITY) — fast, compact, sequential - UUID — globally unique, safe for distributed generation, but larger
- Hash key (used in Data Vault) — derived from natural key attributes, deterministic
-- PostgreSQL: surrogate key with IDENTITY
CREATE TABLE customers (
customer_sk BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- surrogate
customer_id TEXT UNIQUE NOT NULL, -- natural key, still stored!
customer_name TEXT,
email TEXT,
created_at TIMESTAMPTZ DEFAULT now()
);
Note that customer_id (the natural key) is still stored as a regular column — not as the primary key. This is the standard pattern: use the surrogate as the join key, keep the natural key for lookups and debugging.
Surrogate keys in SCD Type 2
Slowly Changing Dimensions (SCD Type 2) are impossible without surrogate keys. When a customer changes their address, you insert a new version of the row with a new surrogate key — the old row and new row have the same natural key but different surrogate keys.
-- PostgreSQL: SCD Type 2 with surrogate key
CREATE TABLE dim_customer (
customer_sk BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
customer_id TEXT NOT NULL, -- natural key (same across versions)
customer_name TEXT,
city TEXT,
valid_from DATE NOT NULL,
valid_to DATE, -- NULL means current record
is_current BOOLEAN DEFAULT TRUE
);
-- Customer moves from Berlin to Hamburg: new row, same customer_id
INSERT INTO dim_customer (customer_id, customer_name, city, valid_from, valid_to, is_current)
VALUES
('CUST-042', 'Markus Bauer', 'Berlin', '2023-01-01', '2024-06-30', FALSE),
('CUST-042', 'Markus Bauer', 'Hamburg', '2024-07-01', NULL, TRUE);
Without the surrogate customer_sk, your fact table references would break every time a customer record is versioned.
UUID vs Auto-Increment Integer
Both are surrogate keys, but they behave differently at scale.
| Auto-increment (BIGINT) | UUID (v4) | |
|---|---|---|
| Size | 8 bytes | 16 bytes |
| Join performance | Fast (integer comparison) | Slower (string/byte comparison) |
| Index fragmentation | Low (sequential inserts) | High (random inserts) |
| Distributed generation | Requires coordination | No coordination needed |
| Debuggability | Low | Slightly better (globally unique) |
| Best for | Single-writer systems | Distributed / multi-source systems |
UUID v7 (time-ordered UUIDs) is increasingly popular as it combines global uniqueness with sequential insertion order, reducing index fragmentation.
Hash Keys in Data Vault
Data Vault 2.0 uses hash keys — a specific type of surrogate key derived from the natural key using a deterministic hash function (typically SHA-256 or MD5). The hash is computed at load time and used as the Hub primary key.
# Python: compute a Data Vault hash key
import hashlib
def compute_hash_key(business_key: str) -> str:
# Deterministic hash key for Data Vault Hub
normalized = business_key.strip().upper() # normalize before hashing
return hashlib.sha256(normalized.encode("utf-8")).hexdigest()
# Example
customer_hk = compute_hash_key("CUST-042")
print(customer_hk)
# => 3e7c... (same every time for the same input)
The advantage: hash keys are deterministic across source systems. If System A and System B both load CUST-042, they generate the same hash key — enabling integration without a central sequence generator.
Decision Guide
Use this as a starting point, not a rigid rule:
| Situation | Recommendation |
|---|---|
| Reference/lookup table with stable codes (ISO, enum) | Natural key is fine |
| Fact table in a star schema | Surrogate key (integer) |
| SCD Type 2 dimension | Surrogate key — required |
| Multi-source integration (Data Vault or similar) | Hash key |
| Distributed microservices generating IDs | UUID (v4 or v7) |
| Small config table with < 1000 rows | Natural key is fine |
| Customer/user entity in a transactional DB | Surrogate + natural key stored as separate column |
The Worst of Both Worlds: Composite Natural Keys
A composite natural key combines multiple columns to achieve uniqueness — for example, (order_date, order_number). These are the most fragile choice:
- Every foreign key must carry all columns
- Changes to any component cascade everywhere
- Joins become multi-column, which is harder to index and read
If you inherit composite natural keys, replace them with surrogates at the first integration layer. Store the originals as non-key columns.
Practical Recommendation
In most analytical data models: surrogate keys as primary keys, natural keys stored as attributes. This decouples your model from upstream systems and makes SCD tracking straightforward.
The one exception: stable reference tables (ISO currencies, country codes, calendar dates). Natural keys there are actually clearer and safer than arbitrary integers.
If you're auditing an existing model and need to inspect key distribution, uniqueness, or join fan-outs across tables, Harbinger Explorer lets you query your CSV or Parquet exports directly in the browser — useful for quick ad-hoc profiling without a dedicated BI tool.
Wrapping Up
Neither surrogate nor natural keys are universally correct. Natural keys add semantic clarity but tie your model to business logic that can change. Surrogate keys decouple structure from meaning but add indirection. The pattern that scales best: keep both — surrogate as the join key, natural key as a queryable attribute.
The real mistake is treating this as a one-time decision. As your data model evolves, revisit key strategy whenever you add a new source or hit a stability problem in an existing dimension.
Continue Reading
Continue Reading
Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Change Data Capture Explained
A practical guide to CDC patterns — log-based, trigger-based, and polling — with Debezium configuration examples and Kafka Connect integration.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial