Harbinger Explorer

Back to Knowledge Hub
Engineering

Change Data Capture Explained

10 min read·Tags: cdc, debezium, kafka, data-engineering, streaming, postgres, real-time

Change Data Capture Explained

Your operational database doesn't care about your data warehouse. Every time a customer updates their profile, cancels an order, or changes an email address, that state change lives in PostgreSQL or MySQL — and your analytics layer is immediately out of date. Change Data Capture (CDC) solves this by detecting and streaming row-level changes from a source database to downstream systems in near-real time.

This guide walks through the three main CDC patterns, how Debezium makes log-based CDC production-ready, and how to wire it all into Kafka Connect.

TL;DR

PatternLatencySource ImpactComplexity
Log-based (WAL / binlog)SecondsVery lowHigh setup
Trigger-basedSecondsMediumMedium
Query-based (polling)MinutesHighLow

Log-based CDC is the gold standard for production workloads. The others are stepping stones or fallbacks.

The Three CDC Patterns

1. Log-Based CDC

Every major database maintains a transaction log for crash recovery. PostgreSQL calls it the Write-Ahead Log (WAL). MySQL calls it the binary log (binlog). These logs contain a complete, ordered record of every row-level change — including deletes.

Log-based CDC taps into this stream as a secondary consumer, without modifying the source application at all. It reads the transaction log, extracts change events, and forwards them downstream.

Advantages:

  • No additional load on source queries
  • Captures deletes and updates, not just inserts
  • Sub-second latency is achievable
  • No application-level changes required

Limitations:

  • Requires database-level privileges (replication role)
  • Log retention must be configured — logs get truncated
  • Schema changes require careful handling downstream

PostgreSQL — enabling logical replication:

-- postgresql.conf: set wal_level = logical (requires DB restart)

-- Create a replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

-- Grant replication privilege to the CDC user
ALTER ROLE cdc_user REPLICATION LOGIN;

2. Trigger-Based CDC

Database triggers fire on INSERT, UPDATE, or DELETE and write a copy of the changed row to a dedicated changelog table. A downstream job polls that table and processes new rows.

-- PostgreSQL: trigger-based CDC example
CREATE TABLE customers_changes (
    change_id    BIGSERIAL PRIMARY KEY,
    change_type  VARCHAR(10),
    changed_at   TIMESTAMPTZ DEFAULT NOW(),
    old_data     JSONB,
    new_data     JSONB
);

CREATE OR REPLACE FUNCTION capture_customer_changes()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'DELETE' THEN
        INSERT INTO customers_changes (change_type, old_data)
        VALUES ('DELETE', row_to_json(OLD)::jsonb);
    ELSIF TG_OP = 'UPDATE' THEN
        INSERT INTO customers_changes (change_type, old_data, new_data)
        VALUES ('UPDATE', row_to_json(OLD)::jsonb, row_to_json(NEW)::jsonb);
    ELSE
        INSERT INTO customers_changes (change_type, new_data)
        VALUES ('INSERT', row_to_json(NEW)::jsonb);
    END IF;
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_customers_cdc
AFTER INSERT OR UPDATE OR DELETE ON customers
FOR EACH ROW EXECUTE FUNCTION capture_customer_changes();

Trigger-based CDC works, but every write to the source table now incurs trigger overhead. The changelog table also needs regular pruning, and the approach doesn't scale gracefully under high write loads.

3. Query-Based CDC (Polling)

The simplest approach: run a query every N minutes that selects rows modified since the last checkpoint, using an updated_at timestamp column.

-- Query-based CDC: fetch rows changed since last run
-- Standard SQL — works on PostgreSQL, MySQL, and most other RDBMS
SELECT *
FROM customers
WHERE updated_at > '2024-01-15 10:00:00'
ORDER BY updated_at ASC;

The fundamental problem: hard deletes are invisible. If a row is removed, there is no updated_at to track. Soft deletes (a deleted_at column) partially solve this, but require discipline across the entire source application.

Query-based CDC is acceptable for small tables, infrequent updates, or environments where you control the source schema. For high-volume production workloads, it's a compromise.

Debezium: Log-Based CDC That Actually Ships

Debezium is an open-source CDC platform built on Apache Kafka. It reads transaction logs from PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, and more — then emits structured change events to Kafka topics.

Each event contains the full before/after state of the row, the operation type (c, u, d, r), and metadata including source table, transaction ID, LSN, and wall-clock timestamp. This is richer and more reliable than anything polling-based can produce.

Debezium PostgreSQL Connector Configuration

{
  "name": "customers-postgres-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres-host",
    "database.port": "5432",
    "database.user": "cdc_user",
    "database.password": "your_password",
    "database.dbname": "production_db",
    "database.server.name": "prod",
    "table.include.list": "public.customers,public.orders",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "debezium_publication",
    "topic.prefix": "prod",
    "snapshot.mode": "initial"
  }
}

This produces Kafka topics named prod.public.customers and prod.public.orders. Every insert, update, and delete on those tables becomes a Kafka message.

The Debezium Event Structure

{
  "op": "u",
  "before": {
    "id": 42,
    "email": "old@example.com",
    "updated_at": 1705312800000
  },
  "after": {
    "id": 42,
    "email": "new@example.com",
    "updated_at": 1705316400000
  },
  "source": {
    "version": "2.5.0.Final",
    "connector": "postgresql",
    "name": "prod",
    "ts_ms": 1705316400123,
    "db": "production_db",
    "schema": "public",
    "table": "customers",
    "txId": 1234567,
    "lsn": 987654321
  }
}

op values: c = create (insert), u = update, d = delete, r = read (from initial snapshot).

Deploying with Kafka Connect

Debezium runs as a Kafka Connect source connector. If you already operate Kafka Connect, deploying Debezium is straightforward: add the connector JAR to the Connect plugin path and POST the configuration to the Connect REST API.

# Register the connector via Kafka Connect REST API
curl -X POST http://kafka-connect:8083/connectors \
  -H "Content-Type: application/json" \
  -d @customers-postgres-connector.json

# Check connector status
curl http://kafka-connect:8083/connectors/customers-postgres-connector/status

From the Kafka topic you can route change events to any sink: object storage (S3, GCS), a data warehouse (Snowflake, BigQuery), or Elasticsearch. The Confluent Hub provides ready-made sink connectors for most targets. [VERIFY: confirm Confluent Hub connector availability for your target system]

Handling Schema Changes

The trickiest long-term challenge is schema evolution. When the source team adds a column, your CDC consumers need to handle both old and new event formats without failing.

Debezium integrates with Confluent Schema Registry (or compatible alternatives like Karapace) to manage Avro, Protobuf, or JSON Schema evolution. With BACKWARD or FULL compatibility mode, adding nullable columns is safe. Dropping or renaming columns is always a breaking change.

Best practice: Treat source schema changes as a deployment event. Use schema registry compatibility checks in your CI pipeline, and coordinate changes with downstream consumers before rolling out. Pairing Debezium with a data contract process closes this loop properly.

CDC vs. Full Loads: When to Use What

ScenarioCDCFull Load
High-volume OLTP tables (>1M changes/day)❌ Too slow
Small reference tables (<10k rows)Overkill✅ Simple
Capturing hard deletes❌ Invisible
Source DB has no replication support
Near-real-time latency required (<1 min)
Team has no Kafka infrastructure❌ Complex

Common Pitfalls

WAL retention too short. If the Debezium connector is offline for longer than the log retention window, it loses its position and must re-snapshot. Set wal_keep_size (PostgreSQL) generously and monitor replication slot lag continuously.

Not planning the initial snapshot. Before streaming new changes, Debezium takes a consistent snapshot of existing data. On large tables this can run for hours. Plan accordingly — and validate the snapshot before enabling real-time streaming.

Ignoring tombstone events. After emitting a delete event, Debezium publishes a null-value tombstone message to the same key. Some consumers fail silently on this. Handle it explicitly.

Null before-state on inserts. For op: c, the before field is always null. Always check the op field before accessing before.

Compacted topics removing deletes. If you use log-compacted Kafka topics, tombstone messages will eventually disappear. For CDC streams that must replay the full history of changes, use retention-based (not compaction-based) topics.

Exploring CDC Data Once It Lands

Once your CDC stream lands in object storage as Parquet or CSV, you still need to reconstruct current state — applying inserts, updates, and deletes in sequence. Tools like dbt handle this with snapshot models. For ad-hoc validation — "did this specific record change today? What did it look like before?" — you want to query the raw change log quickly without spinning up infrastructure.

Harbinger Explorer lets you upload Parquet or CSV files and query them directly in the browser using DuckDB WASM. Ask in plain English: "show me all rows where op equals u and customer_id equals 42" — the AI generates the SQL against your actual schema. It's a fast way to validate CDC correctness during development.

Conclusion

CDC — specifically log-based CDC via Debezium — is the most reliable way to keep downstream systems synchronized with operational databases in real time. It's not trivial to set up, but for high-volume, latency-sensitive pipelines, it's the right tool. Start with a single table, validate the event format end to end, plan for schema changes, and expand from there.

For structuring the CDC data that lands in your lakehouse, read the Medallion Architecture Explained guide — it shows how raw change events fit into a Bronze → Silver → Gold pipeline.

Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...