Harbinger Explorer

Back to Knowledge Hub
Engineering

Slowly Changing Dimensions Guide

10 min read·Tags: scd, dimensional-modeling, data-warehouse, dbt, etl, slowly-changing-dimensions, data-engineering

Slowly Changing Dimensions Guide

Dimension tables describe the "who, what, where" of your data — customers, products, employees, locations. The problem is that reality changes: customers move, products get renamed, employees change departments. How you handle those changes in your warehouse determines whether historical reports tell the truth or quietly lie.

Slowly Changing Dimensions (SCDs) are the set of techniques for managing those changes in dimension tables. The concept comes from Ralph Kimball's dimensional modeling methodology and remains the most practical framework for data warehouse design. This guide covers SCD Types 1, 2, 3, and 4 with concrete SQL examples and honest trade-offs for each.

Quick Reference

TypeStrategyHistorical DataStorage CostComplexity
Type 1OverwriteLostLowLow
Type 2Add a new rowPreserved (full)HighHigh
Type 3Add a columnPreserved (limited)MediumMedium
Type 4History tablePreserved (full)Medium-HighMedium

The Setup: A Customer Dimension

Throughout this guide, we'll use a dim_customer table. A customer starts in Berlin, then moves to Munich. Their loyalty tier gets upgraded from Silver to Gold. How does each SCD type handle these changes?

Initial state:

customer_keycustomer_idnamecityloyalty_tierupdated_at
1C-100Anna MüllerBerlinSilver2024-01-10 09:00:00

SCD Type 1: Overwrite

Type 1 simply overwrites the old value with the new one. No history is kept. The dimension always reflects current state.

-- PostgreSQL: SCD Type 1 — overwrite on change
UPDATE dim_customer
SET
    city          = 'Munich',
    loyalty_tier  = 'Gold',
    updated_at    = NOW()
WHERE customer_id = 'C-100';

After the update:

customer_keycustomer_idnamecityloyalty_tierupdated_at
1C-100Anna MüllerMunichGold2024-03-15 14:30:00

The Berlin record is gone. Any historical fact table rows that joined to this dimension will now incorrectly show Anna as a Munich customer for her entire history — including orders placed when she was in Berlin.

When to use Type 1:

  • The old value is genuinely meaningless (e.g., correcting a data entry error)
  • Analysts have explicitly agreed that historical accuracy for this attribute doesn't matter
  • Storage is very constrained

When not to use Type 1:

  • For any attribute where "what was true at the time of the transaction" matters
  • For compliance-sensitive data that requires an audit trail

SCD Type 2: Add a New Row

Type 2 is the workhorse of dimensional modeling. Instead of overwriting, you expire the old row and insert a new row for the updated state. Each row gets effective and expiry date columns plus a boolean is_current flag.

-- PostgreSQL: SCD Type 2 — expire old row, insert new row

-- Step 1: Expire the current row
UPDATE dim_customer
SET
    valid_to    = '2024-03-15',
    is_current  = FALSE
WHERE customer_id = 'C-100'
  AND is_current = TRUE;

-- Step 2: Insert the new current row
INSERT INTO dim_customer
    (customer_key, customer_id, name, city, loyalty_tier,
     valid_from, valid_to, is_current, updated_at)
VALUES
    (DEFAULT, 'C-100', 'Anna Müller', 'Munich', 'Gold',
     '2024-03-15', '9999-12-31', TRUE, NOW());

After the update:

customer_keycustomer_idnamecityloyalty_tiervalid_fromvalid_tois_current
1C-100Anna MüllerBerlinSilver2024-01-102024-03-15FALSE
2C-100Anna MüllerMunichGold2024-03-159999-12-31TRUE

Now a fact table row from January correctly joins to customer_key = 1 (Berlin, Silver). A fact row from April correctly joins to customer_key = 2 (Munich, Gold). History is preserved exactly.

Querying SCD Type 2

For current-state queries, filter on is_current = TRUE. For point-in-time queries, join on the date range:

-- PostgreSQL: point-in-time join against SCD Type 2 dimension
SELECT
    f.order_id,
    f.order_date,
    f.amount_usd,
    c.city          AS customer_city_at_time,
    c.loyalty_tier  AS loyalty_tier_at_time
FROM fact_orders f
JOIN dim_customer c
    ON  f.customer_id = c.customer_id
    AND f.order_date BETWEEN c.valid_from AND c.valid_to
WHERE f.order_date >= '2024-01-01';

When to use Type 2:

  • Attributes where historical accuracy matters (geography, tier, status)
  • Regulatory or compliance requirements for audit trails
  • Customer journey and cohort analyses

Trade-offs:

  • Dimension table grows with every change — at scale this can be substantial
  • Queries must account for multiple rows per entity
  • ETL logic is more complex to implement correctly
  • Fact table foreign keys point to surrogate keys, not natural keys

Type 2 in dbt: Snapshots

dbt has first-class support for SCD Type 2 via its snapshot feature. Define the strategy in a snapshot file and dbt handles the expire/insert logic automatically:

# dbt snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at'
    )
}}

SELECT
    customer_id,
    name,
    city,
    loyalty_tier,
    updated_at
FROM {{ source('raw', 'customers') }}

{% endsnapshot %}

Run dbt snapshot and dbt generates the valid_from, valid_to, dbt_scd_id, and dbt_is_current columns automatically. This eliminates the most error-prone parts of SCD Type 2 implementation.

SCD Type 3: Add a Column

Type 3 preserves a limited amount of history by adding a "previous value" column alongside the current value. It's a middle ground — you get one generation of history without the complexity of multiple rows.

-- PostgreSQL: SCD Type 3 — add previous_city column
ALTER TABLE dim_customer
    ADD COLUMN previous_city VARCHAR(100),
    ADD COLUMN city_changed_at TIMESTAMPTZ;

-- Update: shift current to previous, set new current
UPDATE dim_customer
SET
    previous_city   = city,
    city            = 'Munich',
    city_changed_at = NOW(),
    updated_at      = NOW()
WHERE customer_id = 'C-100'
  AND is_current = TRUE;

After the update:

customer_keycustomer_idnamecityprevious_cityloyalty_tier
1C-100Anna MüllerMunichBerlinGold

When to use Type 3:

  • You need to support "compare current vs. previous" reporting without full history
  • The attribute changes rarely and analysts care only about one historical value
  • You want to avoid the complexity of Type 2

When not to use Type 3:

  • When attributes change more than twice (the pattern breaks down — you'd need previous_previous_city)
  • When point-in-time accuracy across arbitrary dates is required

Type 3 is underused in practice. It solves a specific reporting pattern cleanly, but it doesn't generalize.

SCD Type 4: History Table

Type 4 separates current state from history entirely. The main dimension table holds only the current row (like Type 1), while a separate history table captures every previous version.

-- PostgreSQL: SCD Type 4 — separate history table

-- Current state table (always one row per entity)
CREATE TABLE dim_customer_current (
    customer_key BIGSERIAL PRIMARY KEY,
    customer_id  VARCHAR(20) NOT NULL UNIQUE,
    name         VARCHAR(200),
    city         VARCHAR(100),
    loyalty_tier VARCHAR(20),
    updated_at   TIMESTAMPTZ
);

-- History table (one row per historical version)
CREATE TABLE dim_customer_history (
    history_key  BIGSERIAL PRIMARY KEY,
    customer_id  VARCHAR(20) NOT NULL,
    name         VARCHAR(200),
    city         VARCHAR(100),
    loyalty_tier VARCHAR(20),
    valid_from   TIMESTAMPTZ NOT NULL,
    valid_to     TIMESTAMPTZ NOT NULL,
    created_at   TIMESTAMPTZ DEFAULT NOW()
);

-- On change: archive current to history, then update current
BEGIN;

INSERT INTO dim_customer_history
    (customer_id, name, city, loyalty_tier, valid_from, valid_to)
SELECT customer_id, name, city, loyalty_tier, updated_at, NOW()
FROM dim_customer_current
WHERE customer_id = 'C-100';

UPDATE dim_customer_current
SET city = 'Munich', loyalty_tier = 'Gold', updated_at = NOW()
WHERE customer_id = 'C-100';

COMMIT;

Advantages of Type 4:

  • The current-state table stays small and fast for the common case
  • History is isolated and can be queried or archived independently
  • Query performance for non-historical use cases is much better than Type 2

Disadvantages:

  • Queries needing both current and historical data require joins across two tables
  • More complex ETL — changes must be coordinated across two tables atomically

When to use Type 4:

  • Very large dimensions where Type 2 would cause significant performance issues
  • When most queries only need current state and historical queries are rare
  • When you want clear physical separation between operational and historical data

Choosing the Right Type

This decision framework covers most real-world cases:

  1. "Was the old value simply wrong?" → Type 1 (error correction)
  2. "Does historical accuracy matter for this attribute?" → Type 2
  3. "Do we only need to compare current vs. one prior state?" → Type 3
  4. "Is the dimension huge and most queries need only current data?" → Type 4
  5. "Do we need both performance and full history?" → Type 4 (hybrid: current table for fast queries, history table for point-in-time)

In practice, most dimension tables use Type 2 for slowly changing attributes like geography, status, and category — and Type 1 for attributes where corrections are the norm and history is irrelevant.

What About Type 0, 5, 6, and 7?

Kimball defined additional types, though they see limited real-world adoption:

  • Type 0: Never changes (fixed attributes like date of birth)
  • Type 5: Type 1 + a mini-dimension for high-cardinality attributes
  • Type 6: Combined Type 1 + 2 + 3 (current value in both the current row and a "current_X" column on historical rows)
  • Type 7: Dual foreign keys — one to Type 1 (current) and one to Type 2 (historical) — allowing both query styles against the same fact table

Types 6 and 7 are worth knowing if you're working with complex dimensional models, but for most teams Type 2 (with dbt snapshots) covers 90% of use cases.

Exploring SCDs Without a Warehouse

If you're working with SCD data that lands in flat files — exported snapshots, dbt snapshot CSV outputs, or archive dumps — Harbinger Explorer lets you query them directly in the browser with DuckDB WASM. You can upload multiple CSV files representing different snapshot dates and run SQL joins across them to reconstruct point-in-time views, without loading a full warehouse environment.

Conclusion

SCD Type 2 is the default for most dimension attributes where history matters. It preserves point-in-time accuracy, pairs cleanly with dbt snapshots, and gives analysts the full picture for cohort and journey analysis. Type 1 is right for corrections. Types 3 and 4 solve specific edge cases. The key is making a deliberate choice per attribute — not defaulting to overwrite because it's simpler.

For the broader context of where dimension tables fit in your data architecture, read Medallion Architecture Explained and Data Lakehouse Architecture Explained.

Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...