Data Quality Testing: A Practical Guide for Data Engineers
Data Quality Testing: A Practical Guide for Data Engineers
Bad data costs more than bad code. A broken pipeline throws an error and someone fixes it. Bad data silently corrupts dashboards, misleads stakeholders, and erodes trust in your entire data platform — sometimes for weeks before anyone notices.
Data quality testing is the practice of systematically validating that your data meets defined expectations at every stage of your pipeline. If you're running production data workloads without it, you're flying blind.
Why Data Quality Testing Matters
Here's a scenario every data engineer has lived through: a source API changes its response schema overnight. No error. Your pipeline ingests the data, transforms it, loads it into the warehouse. Three days later, a VP asks why revenue numbers dropped 40%. The answer? A field renamed from total_amount to amount_total, and your transformation silently produced nulls.
Data quality testing catches this on ingestion — not in a board meeting.
The Cost of Skipping Quality Tests
| Risk | Impact | Detection Without Tests |
|---|---|---|
| Schema drift | Null columns, type errors | Days to weeks |
| Duplicate records | Inflated metrics | Often never caught |
| Missing data | Incomplete aggregations | Next reporting cycle |
| Referential integrity breaks | Orphaned records, broken joins | Downstream failures |
| Value range violations | Nonsensical KPIs | Stakeholder complaints |
Core Dimensions of Data Quality
Before writing tests, you need a framework for what to test. These six dimensions cover most real-world scenarios:
| Dimension | Question It Answers | Example Test |
|---|---|---|
| Completeness | Is all expected data present? | NOT NULL checks, row count thresholds |
| Uniqueness | Are there duplicates? | Primary key uniqueness |
| Validity | Do values fall within expected ranges? | status IN ('active', 'inactive', 'pending') |
| Accuracy | Does the data reflect reality? | Cross-source reconciliation |
| Consistency | Do related datasets agree? | orders.customer_id exists in customers.id |
| Timeliness | Is the data fresh enough? | MAX(updated_at) > NOW() - INTERVAL 2 HOURS |
Building a Data Quality Testing Strategy
Layer 1: Schema Validation (Ingestion)
Catch structural problems before data enters your pipeline. This is your cheapest line of defense.
# Python — Schema validation with Pydantic
from pydantic import BaseModel, field_validator
from typing import Optional
from datetime import datetime
class OrderRecord(BaseModel):
order_id: str
customer_id: str
amount: float
currency: str
created_at: datetime
status: str
@field_validator("amount")
@classmethod
def amount_must_be_positive(cls, v: float) -> float:
if v <= 0:
raise ValueError(f"amount must be positive, got {v}")
return v
@field_validator("currency")
@classmethod
def currency_must_be_valid(cls, v: str) -> str:
valid = {"EUR", "USD", "GBP", "CHF"}
if v not in valid:
raise ValueError(f"unexpected currency: {v}")
return v
# Validate each record at ingestion
def validate_batch(records: list[dict]) -> tuple[list, list]:
valid, invalid = [], []
for record in records:
try:
validated = OrderRecord(**record)
valid.append(validated.model_dump())
except Exception as e:
invalid.append({"record": record, "error": str(e)})
return valid, invalid
This catches type mismatches, missing fields, and out-of-range values before they hit your warehouse.
Layer 2: SQL-Based Assertions (Transformation)
After your data lands, run assertions against the transformed tables. These are your core quality gates.
-- PostgreSQL — Common data quality assertions
-- 1. Uniqueness: No duplicate order IDs
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
-- Expected: 0 rows
-- 2. Completeness: No null customer references
SELECT COUNT(*) AS null_customers
FROM orders
WHERE customer_id IS NULL;
-- Expected: 0
-- 3. Referential integrity: Every order references a valid customer
SELECT o.order_id, o.customer_id
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL;
-- Expected: 0 rows
-- 4. Freshness: Data updated within last 2 hours
SELECT
CASE
WHEN MAX(updated_at) < NOW() - INTERVAL '2 hours'
THEN 'STALE'
ELSE 'FRESH'
END AS freshness_status
FROM orders;
-- Expected: FRESH
-- 5. Volume: Row count within expected range
SELECT
CASE
WHEN COUNT(*) BETWEEN 1000 AND 100000
THEN 'OK'
ELSE 'ANOMALY'
END AS volume_check
FROM daily_orders
WHERE order_date = CURRENT_DATE;
Layer 3: Statistical Checks (Aggregation)
For metrics tables and aggregations, go beyond row-level checks. Look for distribution anomalies.
# Python — Statistical quality checks with pandas
import pandas as pd
import numpy as np
def check_metric_quality(
current: pd.DataFrame,
historical: pd.DataFrame,
metric_col: str,
z_threshold: float = 3.0
) -> dict:
"""Flag metrics that deviate significantly from historical norms."""
hist_mean = historical[metric_col].mean()
hist_std = historical[metric_col].std()
current_val = current[metric_col].sum()
z_score = (current_val - hist_mean) / hist_std if hist_std > 0 else 0
return {
"metric": metric_col,
"current_value": current_val,
"historical_mean": round(hist_mean, 2),
"z_score": round(z_score, 2),
"status": "ANOMALY" if abs(z_score) > z_threshold else "OK",
}
# Usage
# results = check_metric_quality(today_df, last_30_days_df, "revenue")
# if results["status"] == "ANOMALY":
# alert_on_call_engineer(results)
Tooling: What to Use
Several frameworks make data quality testing easier. Here's an honest comparison:
| Tool | Best For | Trade-off |
|---|---|---|
| dbt tests | SQL-native teams already using dbt | Limited to SQL, tied to dbt ecosystem |
| Great Expectations | Python-heavy pipelines, rich assertions | Steep learning curve, heavy config |
| Soda | Multi-warehouse checks, SodaCL syntax | Newer, smaller community |
| Elementary | dbt users wanting anomaly detection | dbt-only, adds dashboard overhead |
| Custom scripts | Full control, unusual data sources | Maintenance burden, no standardization |
My take: if you're using dbt, start with built-in tests and add Great Expectations for anything dbt can't express. If you're not on dbt, Soda's YAML-based approach has the gentlest learning curve.
Common Mistakes and Pitfalls
1. Testing only in production. If your quality tests run after data is already in the warehouse, you've already served bad data to downstream consumers. Test at ingestion boundaries.
2. Alert fatigue. Hundreds of low-priority test failures train your team to ignore alerts. Tier your tests: P0 (blocks pipeline), P1 (needs same-day fix), P2 (tracked in backlog).
3. Hardcoded thresholds. "Row count must be > 10,000" breaks the first time your business has a slow day. Use rolling averages and percentage-based thresholds instead.
4. No quarantine strategy. When a test fails, what happens to the data? If you don't have a dead-letter queue or quarantine table, failed records just vanish — and nobody knows.
5. Ignoring test maintenance. Data quality tests need maintenance like any other code. When schemas evolve, tests must follow. Schedule quarterly reviews of your test suite.
Implementing a Quality Gate Pattern
The most effective pattern I've seen in production: treat quality tests as pipeline gates that block downstream processing.
# Python — Quality gate pattern (framework-agnostic)
from dataclasses import dataclass
from enum import Enum
class Severity(Enum):
CRITICAL = "critical" # Blocks pipeline
WARNING = "warning" # Logs alert, continues
INFO = "info" # Logs only
@dataclass
class TestResult:
name: str
passed: bool
severity: Severity
details: str = ""
def run_quality_gate(results: list[TestResult]) -> bool:
"""Returns True if pipeline should proceed."""
critical_failures = [
r for r in results
if not r.passed and r.severity == Severity.CRITICAL
]
warnings = [
r for r in results
if not r.passed and r.severity == Severity.WARNING
]
for w in warnings:
print(f"⚠️ WARNING: {w.name} — {w.details}")
if critical_failures:
for f in critical_failures:
print(f"🚨 CRITICAL: {f.name} — {f.details}")
print(f"Pipeline blocked: {len(critical_failures)} critical failures")
return False
print(f"✅ Quality gate passed ({len(warnings)} warnings)")
return True
# Example usage in an Airflow task or pipeline step:
# results = [
# TestResult("pk_uniqueness", True, Severity.CRITICAL),
# TestResult("freshness_check", False, Severity.WARNING, "Data is 3h old"),
# TestResult("null_check_email", False, Severity.CRITICAL, "42 null emails"),
# ]
# if not run_quality_gate(results):
# raise AirflowFailException("Quality gate failed")
Where Harbinger Explorer Fits
If you're testing quality on data pulled from APIs — which is where a lot of ingestion issues originate — Harbinger Explorer lets you query API responses directly in your browser using DuckDB WASM. You can run SQL assertions against raw API data before it enters your pipeline, catching schema drift and value anomalies at the source without writing a single ingestion script first.
Next Steps
Data quality testing isn't a one-time project. Start with the highest-impact layer: schema validation at ingestion. Add SQL assertions for your most critical tables. Build statistical checks for your key metrics. Most importantly, make quality gates block pipelines — don't just log failures and hope someone reads the logs.
Your immediate action item: pick your three most business-critical tables and write five assertions each (uniqueness, completeness, freshness, referential integrity, volume). Deploy them tomorrow. You'll sleep better.
Continue Reading
Continue Reading
Data Pipeline Monitoring: Catch Failures Before Users Do
A practical guide to monitoring data pipelines — covering execution tracking, data quality checks, performance metrics, and schema change detection with runnable code examples.
Data Lakehouse Architecture Explained
How data lakehouse architecture works, when to use it over a warehouse or lake, and the common pitfalls that trip up data engineering teams.
dbt vs Spark SQL: How to Choose
dbt or Spark SQL for your transformation layer? A side-by-side comparison of features, pricing, and use cases — with code examples for both and honest trade-offs for analytics engineers.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial