Data Observability Explained: Freshness, Volume, Schema
Data Observability Explained: Freshness, Volume, Schema, Lineage
The dashboard shows $0 in revenue for the last 48 hours. The pipeline didn't fail. No alerts fired. A WHERE clause silently started dropping all rows due to a null value introduced upstream. Data observability exists to catch exactly this kind of failure.
Observability in software engineering means you can understand the internal state of a system from its external outputs. Applied to data, it means you can answer: Is my data complete? Fresh? Correct? Did anything change unexpectedly?
The Five Pillars of Data Observability
Most frameworks for data observability converge on five core dimensions:
| Pillar | Question it answers | Example failure it catches |
|---|---|---|
| Freshness | Was the data updated recently? | Pipeline ran but loaded 0 rows; silent delay |
| Volume | Is there the expected amount of data? | Table lost 30% of rows overnight |
| Schema | Did the structure change unexpectedly? | Column renamed, type changed, column dropped |
| Distribution | Are values statistically normal? | Null rate jumped from 0.1% to 40% |
| Lineage | Where does data come from and what depends on it? | Upstream table change broke 12 downstream models |
Data testing (dbt tests, Great Expectations) covers specific assertions you define in advance. Data observability goes further — it monitors continuously and detects anomalies you didn't anticipate.
Freshness Monitoring
Freshness is the simplest pillar to implement and often the most valuable. If a table isn't updated when expected, something upstream broke.
-- PostgreSQL: freshness check as a scheduled query
-- Alert if the orders table hasn't been updated in the last 2 hours
SELECT
CASE
WHEN MAX(updated_at) < NOW() - INTERVAL '2 hours'
THEN 'STALE'
ELSE 'FRESH'
END AS freshness_status,
MAX(updated_at) AS last_update,
NOW() - MAX(updated_at) AS data_age
FROM warehouse.orders;
In dbt, freshness is configured at the source level:
# dbt: sources.yml with freshness SLA
sources:
- name: raw
schema: raw_data
tables:
- name: orders
freshness:
warn_after:
count: 1
period: hour
error_after:
count: 3
period: hour
loaded_at_field: updated_at # column to check
Running dbt source freshness compares the latest updated_at against your SLA thresholds and raises warnings or errors accordingly.
Common freshness anti-pattern: Checking freshness only at the source. A pipeline can run successfully but produce an empty target if a filter or join kills all rows. Check freshness at the target table, not just the source.
Volume Monitoring
Volume monitoring detects unexpected drops or spikes in row counts. A sudden 0-row load almost never means the source had no data — it usually means something broke.
# Python: volume anomaly detection using historical average
import sqlalchemy as sa
engine = sa.create_engine("postgresql://...")
def check_volume_anomaly(table: str, date: str, threshold: float = 0.3) -> dict:
# Compare today's row count to the 7-day average.
# Flag if deviation exceeds threshold (default 30%).
with engine.connect() as conn:
result = conn.execute(sa.text(
"WITH daily_counts AS ("
" SELECT DATE(loaded_at) AS load_date, COUNT(*) AS row_count"
f" FROM {table}"
" WHERE loaded_at >= NOW() - INTERVAL '8 days'"
" GROUP BY DATE(loaded_at)"
"),"
"baseline AS ("
" SELECT AVG(row_count) AS avg_count FROM daily_counts"
" WHERE load_date < :check_date"
"),"
"today AS ("
" SELECT row_count AS today_count FROM daily_counts"
" WHERE load_date = :check_date"
")"
"SELECT today.today_count, baseline.avg_count,"
" ABS(today.today_count - baseline.avg_count) / NULLIF(baseline.avg_count, 0) AS deviation"
"FROM today, baseline"
), {"check_date": date}).fetchone()
deviation = float(result.deviation or 0)
return {
"table": table,
"today_count": result.today_count,
"avg_count": result.avg_count,
"deviation_pct": round(deviation * 100, 1),
"anomaly": deviation > threshold
}
check = check_volume_anomaly("warehouse.orders", "2024-03-15")
if check["anomaly"]:
print(f"Volume anomaly: {check['deviation_pct']}% deviation from 7-day average")
Volume monitoring works best as a relative check (deviation from historical baseline), not an absolute threshold. Your order volume on Black Friday is 10x a normal Tuesday. A fixed threshold would fire every peak day.
Schema Monitoring
Schema changes are a silent killer. A column gets renamed in the source system. A new NOT NULL constraint gets added. A type changes from VARCHAR to INT. None of these cause an error in the pipeline — they just corrupt downstream outputs.
# Python: schema snapshot comparison
import json
import sqlalchemy as sa
def snapshot_schema(engine, table: str) -> dict:
# Capture current schema as a dict for comparison
with engine.connect() as conn:
columns = conn.execute(sa.text(
"SELECT column_name, data_type, is_nullable, character_maximum_length"
" FROM information_schema.columns"
" WHERE table_name = :table"
" ORDER BY ordinal_position"
), {"table": table}).fetchall()
return {row.column_name: dict(row._mapping) for row in columns}
def compare_schemas(old: dict, new: dict) -> list:
# Return list of detected schema changes
changes = []
for col in old:
if col not in new:
changes.append(f"DROPPED: column '{col}'")
elif old[col]["data_type"] != new[col]["data_type"]:
changes.append(
f"TYPE CHANGE: '{col}' {old[col]['data_type']} -> {new[col]['data_type']}"
)
for col in new:
if col not in old:
changes.append(f"ADDED: column '{col}'")
return changes
# Usage: compare yesterday's snapshot to today's
engine = sa.create_engine("postgresql://...")
old_schema = json.load(open("schema_snapshots/orders_yesterday.json"))
new_schema = snapshot_schema(engine, "orders")
changes = compare_schemas(old_schema, new_schema)
if changes:
print("Schema changes detected:")
for c in changes:
print(f" - {c}")
Store schema snapshots as files or in a metadata table. Run comparisons on every pipeline run. Alert on any unexpected change — expected changes (new columns added intentionally) can be acknowledged and suppressed.
Distribution Monitoring
Distribution monitoring tracks statistical properties of column values over time. Null rates, distinct counts, min/max values, and mean/median — and alerts when they deviate significantly from historical norms.
-- PostgreSQL: distribution profile for a column
-- Run this on each pipeline execution and store results
SELECT
'orders' AS table_name,
'amount' AS column_name,
NOW() AS profiled_at,
COUNT(*) AS total_rows,
COUNT(amount) AS non_null_count,
COUNT(*) - COUNT(amount) AS null_count,
ROUND(100.0 * (COUNT(*) - COUNT(amount)) / COUNT(*), 2) AS null_pct,
MIN(amount) AS min_val,
MAX(amount) AS max_val,
AVG(amount) AS mean_val,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_val,
STDDEV(amount) AS stddev_val
FROM warehouse.orders
WHERE order_date = CURRENT_DATE;
The power comes from storing these profiles over time and comparing each run's statistics to the historical baseline. A null rate that jumps from 0.2% to 15% is a strong signal that something upstream changed — even if the row count looks normal.
Lineage Monitoring
Data lineage answers two questions: Where did this data come from? and What depends on this data?
When an upstream column changes, lineage tells you which downstream models, reports, and dashboards are affected. Without it, you're debugging blind.
Loading diagram...
① raw.orders and raw.customers are source tables — changes here propagate downstream.
② stg_orders and stg_customers are the first transformation layer.
③ fct_orders joins both staging tables — a breaking change in either source affects this.
④ report_executive_kpis depends on three upstream models — it's the most fragile endpoint.
dbt generates lineage automatically from ref() and source() calls. For non-dbt pipelines, tools like OpenLineage, Marquez, or DataHub provide lineage capture via instrumentation.
Tooling Landscape
| Tool | Freshness | Volume | Schema | Distribution | Lineage | Hosted SaaS |
|---|---|---|---|---|---|---|
| Monte Carlo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Soda Cloud | ✅ | ✅ | ✅ | ✅ | Partial | ✅ |
| dbt + Elementary | ✅ | ✅ | ✅ | ✅ | ✅ (via dbt) | ✅ (Elementary) |
| Great Expectations | Partial | ✅ | Partial | ✅ | ❌ | Self-hosted |
| OpenLineage/Marquez | ❌ | ❌ | ❌ | ❌ | ✅ | Self-hosted |
| Re:data | ✅ | ✅ | ✅ | Partial | Partial | Self-hosted |
For most teams starting out: dbt with Elementary covers all five pillars without a separate tool. For larger teams with multi-engine data stacks: Monte Carlo or a similar dedicated platform is worth the investment.
Building a Minimal Observability Stack
If you're starting from scratch and want coverage without tool sprawl:
- Freshness → dbt source freshness (free, built-in)
- Volume → dbt test
row_count > 0+ custom singular tests with deviation logic - Schema → dbt schema tests (
not_null,accepted_values) + schema snapshot comparison script - Distribution → Elementary (open source dbt package) profiles columns automatically
- Lineage → dbt docs generate (free, built from
ref()calls)
This stack requires no additional infrastructure, costs nothing, and covers the most common failure modes.
Wrapping Up
Data observability is what happens when you stop treating data pipelines as batch jobs that either succeed or fail, and start treating them as systems with internal state you need to understand continuously. Freshness, volume, schema, distribution, and lineage together give you that understanding.
The hardest part isn't building the monitors — it's building the discipline to act on them. An alert that nobody investigates is just noise. Define owners, set SLAs, and treat data quality incidents with the same urgency as production outages.
Continue Reading
Continue Reading
Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Change Data Capture Explained
A practical guide to CDC patterns — log-based, trigger-based, and polling — with Debezium configuration examples and Kafka Connect integration.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial