Data Vault Modeling: Hubs, Links, and Satellites Explained
Your star schema looked great at launch. Eighteen months later, you've got 14 source systems feeding it, three teams arguing about who owns the customer dimension, and a change request for a new source that requires restructuring four fact tables. Kimball is fine for stable domains. It wasn't designed for this.
Data Vault is a modeling methodology built specifically for enterprise-scale, multi-source data warehouses where source systems change frequently and auditability is non-negotiable. Here's what it is, how it works, and when to use it.
TL;DR
| Dimension | Kimball (Star Schema) | Data Vault |
|---|---|---|
| Query performance | High (out of the box) | Medium (requires business vault/mart layer) |
| Source system changes | Disruptive | Additive — no existing tables modified |
| Auditability | Limited | Full load date + record source on every row |
| Parallel loading | Moderate | High — tables are independent |
| Learning curve | Low | High |
| Best for | Stable domains, BI-first | Multi-source, regulated, agile environments |
The Three Core Constructs
Data Vault has three fundamental building blocks. Once you understand these, the methodology clicks.
1. Hubs — Business Keys
A Hub represents one business entity (Customer, Product, Order) and stores only the business key — the identifier that the business uses, independent of any source system.
-- Data Vault 2.0 Hub: Customer
-- (Standard SQL / Snowflake dialect)
CREATE TABLE hub_customer (
hub_customer_hk VARCHAR(32) NOT NULL, -- MD5 or SHA-256 hash of business key
customer_id VARCHAR(50) NOT NULL, -- The raw business key
load_date TIMESTAMP NOT NULL, -- When this record was first loaded
record_source VARCHAR(100) NOT NULL, -- Which system sourced this key
CONSTRAINT pk_hub_customer PRIMARY KEY (hub_customer_hk)
);
Rules:
- Hubs never change once a row is inserted (insert-only)
- The hash key is deterministic: same business key -> same hash, always
- No descriptive attributes — those belong in Satellites
2. Links — Relationships
A Link represents a relationship between two or more Hubs. An order placed by a customer? That's a Link between Hub_Customer and Hub_Order.
-- Data Vault 2.0 Link: Order placed by Customer
CREATE TABLE link_customer_order (
link_customer_order_hk VARCHAR(32) NOT NULL, -- Hash of all parent hub keys
hub_customer_hk VARCHAR(32) NOT NULL, -- FK -> hub_customer
hub_order_hk VARCHAR(32) NOT NULL, -- FK -> hub_order
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL,
CONSTRAINT pk_link_customer_order PRIMARY KEY (link_customer_order_hk)
);
Rules:
- Links are also insert-only
- The link hash key is derived from all participating hub keys
- Links model facts about relationships (an order belongs to a customer), not business attributes
3. Satellites — Descriptive Attributes
Satellites store all the descriptive, changing data around a Hub or Link. Customer name, email, address — all in a Satellite hanging off Hub_Customer.
-- Data Vault 2.0 Satellite: Customer attributes from CRM
CREATE TABLE sat_customer_crm (
hub_customer_hk VARCHAR(32) NOT NULL,
load_date TIMESTAMP NOT NULL,
load_end_date TIMESTAMP, -- NULL = current record
hash_diff VARCHAR(32) NOT NULL, -- Hash of all attribute values
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(200),
country VARCHAR(50),
record_source VARCHAR(100) NOT NULL,
CONSTRAINT pk_sat_customer_crm PRIMARY KEY (hub_customer_hk, load_date)
);
Rules:
- Satellites are the only tables that track history (type-2 SCD by default)
- Each source system gets its own Satellite — CRM customer data never mixes with ERP customer data
hash_diffenables efficient change detection on load
The Data Vault Architecture
Loading diagram...
① Source systems load independently into the Raw Vault — no cross-system dependencies. ② The Business Vault adds computed attributes and point-in-time structures. ③ Information Marts expose optimized dimensional models for BI tools.
Legend: 🔵 Source Systems 🟢 Hubs 🟡 Links 🔴 Satellites 🟣 Business Vault
Point-in-Time (PIT) Tables
Multiple Satellites hanging off one Hub create a performance problem: joining across five Satellites to reconstruct a customer record at a point in time requires complex SQL. PIT tables solve this.
-- PIT table: Customer as of a specific date
-- (Standard SQL / Snowflake dialect)
CREATE TABLE pit_customer AS
SELECT
snap.snapshot_date,
h.hub_customer_hk,
-- For each satellite, get the latest load_date <= snapshot_date
MAX(CASE WHEN s_crm.load_date <= snap.snapshot_date THEN s_crm.load_date END)
AS sat_crm_load_date,
MAX(CASE WHEN s_erp.load_date <= snap.snapshot_date THEN s_erp.load_date END)
AS sat_erp_load_date
FROM hub_customer h
CROSS JOIN (
SELECT DISTINCT CAST(load_date AS DATE) AS snapshot_date FROM sat_customer_crm
) snap
LEFT JOIN sat_customer_crm s_crm
ON h.hub_customer_hk = s_crm.hub_customer_hk
AND s_crm.load_date <= snap.snapshot_date
LEFT JOIN sat_customer_erp s_erp
ON h.hub_customer_hk = s_erp.hub_customer_hk
AND s_erp.load_date <= snap.snapshot_date
GROUP BY snap.snapshot_date, h.hub_customer_hk;
PIT tables are precomputed and rebuilt on each load. They make downstream Satellite joins trivial — one join per Satellite using the PIT-resolved load_date.
Hash Key Generation
Hash keys must be deterministic and consistent across all loads. In practice, teams use either MD5 (128-bit) or SHA-256 (256-bit) on the upper-cased, trimmed business key.
# Python — deterministic hash key generation
import hashlib
def generate_hub_hk(business_key: str) -> str:
# Normalize: uppercase, strip whitespace
normalized = str(business_key).upper().strip()
return hashlib.md5(normalized.encode("utf-8")).hexdigest()
def generate_link_hk(*hub_hks: str) -> str:
# Concatenate sorted hub keys to ensure order-independence
combined = "||".join(sorted(hub_hks))
return hashlib.md5(combined.encode("utf-8")).hexdigest()
# Example
customer_hk = generate_hub_hk("CUST-00123") # consistent hash every time
order_hk = generate_hub_hk("ORD-98765")
link_hk = generate_link_hk(customer_hk, order_hk)
Some teams prefer SHA-256 to reduce collision risk in very large datasets. Either works — just be consistent across your entire vault.
Data Vault 2.0 Additions
Dan Linstedt's Data Vault 2.0 (published ~2013) added several constructs beyond the original methodology:
| Construct | Purpose |
|---|---|
| Reference Tables | Shared lookup tables (country codes, status values) shared across Hubs |
| Same-As Links (SAL) | Map duplicate business keys discovered after load (deduplication without modifying Hubs) |
| Computed Satellites | Business-rule-derived data computed from Raw Vault data |
| Effectivity Satellites | Track when a Link relationship was active vs merely existed |
| Business Vault | Computed, joined, rule-applied layer on top of Raw Vault |
For most teams starting out, focus on Hubs, Links, and Satellites first. Add PIT tables when query performance becomes an issue. The other constructs come later.
Why Data Vault for Enterprise?
1. Auditability is built in. Every row carries load_date and record_source. You can always answer "what did our CRM say about customer X on March 3rd, 2022 at 14:32?" — a question that destroys Kimball star schemas.
2. Source systems can be added without modifying existing tables. New ERP system? Add a new Satellite to Hub_Customer. Existing pipelines are untouched. This is the core architectural advantage.
3. Parallel loading. Hubs, Links, and Satellites have no load-order dependencies on each other (within a load cycle). This enables highly parallelized ETL.
4. Handles conflicting data gracefully. CRM says a customer lives in Germany. ERP says France. In a star schema, one source wins. In Data Vault, both live in separate Satellites — unresolved conflict is explicit, not hidden.
When Data Vault Is NOT the Right Choice
- Small, stable domains with 1-2 source systems — the overhead isn't justified. Use Kimball.
- Teams without dedicated data engineering resources — Data Vault requires disciplined tooling and modeling expertise.
- Ad-hoc analytics platforms — If your primary output is exploratory dashboards, the Information Mart layer overhead adds friction without benefit.
- Time to first dashboard matters most — You can't query a Raw Vault directly in most BI tools without a mart layer. Add 4-8 weeks of extra work to get there.
Tooling
Most modern Data Vault implementations use dbt for the transformation layer. Libraries like dbt-datavault4dbt and AutomateDV (formerly dbtvault) provide macros that generate Hub/Link/Satellite loads from YAML configuration, dramatically reducing boilerplate.
See What Is dbt? for a primer on dbt if you're not already using it.
Common Mistakes
1. Putting business logic in the Raw Vault. The Raw Vault is source-faithful — load what the source gives you, transformation-free. Business rules go in the Business Vault or mart layer.
2. Modeling relationships as Hub attributes. "Customer has a preferred_region foreign key" — that relationship belongs in a Link, not a Satellite attribute.
3. One giant Satellite per Hub. Split Satellites by source system and by rate-of-change. Attributes that change frequently (session data) shouldn't be in the same Satellite as slowly-changing attributes (customer name) — it causes excessive history rows.
4. Inconsistent hash key generation. If your Python loader and your dbt macros generate different hashes for the same business key, you'll silently create duplicate Hubs. Standardize hash generation in a shared library.
The Bottom Line
Data Vault trades query simplicity for architectural resilience. For enterprise data warehouses with many source systems, frequent changes, and regulatory audit requirements, that trade is almost always worth it.
For smaller, stable domains: use Kimball. For large, evolving, multi-source enterprise DWHs: Data Vault is the methodology that was actually designed for your problem.
Next step: If you're building a Data Vault on a modern lakehouse, read Medallion Architecture Explained to see how the Raw/Business/Mart layers map onto Bronze/Silver/Gold.
Continue Reading
- What Is dbt? The Tool That Changed Data Transformation
- Slowly Changing Dimensions: A Practical Guide
- Data Lakehouse Architecture Explained
[VERIFY]: AutomateDV library current status and rename to automate-dv. [VERIFY]: dbt-datavault4dbt active maintenance status.
Continue Reading
Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Change Data Capture Explained
A practical guide to CDC patterns — log-based, trigger-based, and polling — with Debezium configuration examples and Kafka Connect integration.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial