EngineeringApr 1, 2026

Data Quality Testing: A Practical Guide for Data Engineers

8 min read·Tags: data quality, testing, data engineering, dbt, great expectations, python, sql, data pipelines

Data Quality Testing: A Practical Guide for Data Engineers

Bad data costs more than bad code. A broken pipeline throws an error and someone fixes it. Bad data silently corrupts dashboards, misleads stakeholders, and erodes trust in your entire data platform — sometimes for weeks before anyone notices.

Data quality testing is the practice of systematically validating that your data meets defined expectations at every stage of your pipeline. If you're running production data workloads without it, you're flying blind.

Why Data Quality Testing Matters

Here's a scenario every data engineer has lived through: a source API changes its response schema overnight. No error. Your pipeline ingests the data, transforms it, loads it into the warehouse. Three days later, a VP asks why revenue numbers dropped 40%. The answer? A field renamed from total_amount to amount_total, and your transformation silently produced nulls.

Data quality testing catches this on ingestion — not in a board meeting.

The Cost of Skipping Quality Tests

Risk	Impact	Detection Without Tests
Schema drift	Null columns, type errors	Days to weeks
Duplicate records	Inflated metrics	Often never caught
Missing data	Incomplete aggregations	Next reporting cycle
Referential integrity breaks	Orphaned records, broken joins	Downstream failures
Value range violations	Nonsensical KPIs	Stakeholder complaints

Core Dimensions of Data Quality

Before writing tests, you need a framework for what to test. These six dimensions cover most real-world scenarios:

Dimension	Question It Answers	Example Test
Completeness	Is all expected data present?	`NOT NULL` checks, row count thresholds
Uniqueness	Are there duplicates?	Primary key uniqueness
Validity	Do values fall within expected ranges?	`status IN ('active', 'inactive', 'pending')`
Accuracy	Does the data reflect reality?	Cross-source reconciliation
Consistency	Do related datasets agree?	`orders.customer_id` exists in `customers.id`
Timeliness	Is the data fresh enough?	`MAX(updated_at) > NOW() - INTERVAL 2 HOURS`

Building a Data Quality Testing Strategy

Layer 1: Schema Validation (Ingestion)

Catch structural problems before data enters your pipeline. This is your cheapest line of defense.

# Python — Schema validation with Pydantic
from pydantic import BaseModel, field_validator
from typing import Optional
from datetime import datetime

class OrderRecord(BaseModel):
    order_id: str
    customer_id: str
    amount: float
    currency: str
    created_at: datetime
    status: str

    @field_validator("amount")
    @classmethod
    def amount_must_be_positive(cls, v: float) -> float:
        if v <= 0:
            raise ValueError(f"amount must be positive, got {v}")
        return v

    @field_validator("currency")
    @classmethod
    def currency_must_be_valid(cls, v: str) -> str:
        valid = {"EUR", "USD", "GBP", "CHF"}
        if v not in valid:
            raise ValueError(f"unexpected currency: {v}")
        return v

# Validate each record at ingestion
def validate_batch(records: list[dict]) -> tuple[list, list]:
    valid, invalid = [], []
    for record in records:
        try:
            validated = OrderRecord(**record)
            valid.append(validated.model_dump())
        except Exception as e:
            invalid.append({"record": record, "error": str(e)})
    return valid, invalid

This catches type mismatches, missing fields, and out-of-range values before they hit your warehouse.

Layer 2: SQL-Based Assertions (Transformation)

After your data lands, run assertions against the transformed tables. These are your core quality gates.

-- PostgreSQL — Common data quality assertions

-- 1. Uniqueness: No duplicate order IDs
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
-- Expected: 0 rows

-- 2. Completeness: No null customer references
SELECT COUNT(*) AS null_customers
FROM orders
WHERE customer_id IS NULL;
-- Expected: 0

-- 3. Referential integrity: Every order references a valid customer
SELECT o.order_id, o.customer_id
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL;
-- Expected: 0 rows

-- 4. Freshness: Data updated within last 2 hours
SELECT
    CASE
        WHEN MAX(updated_at) < NOW() - INTERVAL '2 hours'
        THEN 'STALE'
        ELSE 'FRESH'
    END AS freshness_status
FROM orders;
-- Expected: FRESH

-- 5. Volume: Row count within expected range
SELECT
    CASE
        WHEN COUNT(*) BETWEEN 1000 AND 100000
        THEN 'OK'
        ELSE 'ANOMALY'
    END AS volume_check
FROM daily_orders
WHERE order_date = CURRENT_DATE;

Layer 3: Statistical Checks (Aggregation)

For metrics tables and aggregations, go beyond row-level checks. Look for distribution anomalies.

# Python — Statistical quality checks with pandas
import pandas as pd
import numpy as np

def check_metric_quality(
    current: pd.DataFrame,
    historical: pd.DataFrame,
    metric_col: str,
    z_threshold: float = 3.0
) -> dict:
    """Flag metrics that deviate significantly from historical norms."""

    hist_mean = historical[metric_col].mean()
    hist_std = historical[metric_col].std()
    current_val = current[metric_col].sum()

    z_score = (current_val - hist_mean) / hist_std if hist_std > 0 else 0

    return {
        "metric": metric_col,
        "current_value": current_val,
        "historical_mean": round(hist_mean, 2),
        "z_score": round(z_score, 2),
        "status": "ANOMALY" if abs(z_score) > z_threshold else "OK",
    }

# Usage
# results = check_metric_quality(today_df, last_30_days_df, "revenue")
# if results["status"] == "ANOMALY":
#     alert_on_call_engineer(results)

Tooling: What to Use

Several frameworks make data quality testing easier. Here's an honest comparison:

Tool	Best For	Trade-off
dbt tests	SQL-native teams already using dbt	Limited to SQL, tied to dbt ecosystem
Great Expectations	Python-heavy pipelines, rich assertions	Steep learning curve, heavy config
Soda	Multi-warehouse checks, SodaCL syntax	Newer, smaller community
Elementary	dbt users wanting anomaly detection	dbt-only, adds dashboard overhead
Custom scripts	Full control, unusual data sources	Maintenance burden, no standardization

My take: if you're using dbt, start with built-in tests and add Great Expectations for anything dbt can't express. If you're not on dbt, Soda's YAML-based approach has the gentlest learning curve.

Common Mistakes and Pitfalls

1. Testing only in production. If your quality tests run after data is already in the warehouse, you've already served bad data to downstream consumers. Test at ingestion boundaries.

2. Alert fatigue. Hundreds of low-priority test failures train your team to ignore alerts. Tier your tests: P0 (blocks pipeline), P1 (needs same-day fix), P2 (tracked in backlog).

3. Hardcoded thresholds. "Row count must be > 10,000" breaks the first time your business has a slow day. Use rolling averages and percentage-based thresholds instead.

4. No quarantine strategy. When a test fails, what happens to the data? If you don't have a dead-letter queue or quarantine table, failed records just vanish — and nobody knows.

5. Ignoring test maintenance. Data quality tests need maintenance like any other code. When schemas evolve, tests must follow. Schedule quarterly reviews of your test suite.

Implementing a Quality Gate Pattern

The most effective pattern I've seen in production: treat quality tests as pipeline gates that block downstream processing.

# Python — Quality gate pattern (framework-agnostic)

from dataclasses import dataclass
from enum import Enum

class Severity(Enum):
    CRITICAL = "critical"  # Blocks pipeline
    WARNING = "warning"    # Logs alert, continues
    INFO = "info"          # Logs only

@dataclass
class TestResult:
    name: str
    passed: bool
    severity: Severity
    details: str = ""

def run_quality_gate(results: list[TestResult]) -> bool:
    """Returns True if pipeline should proceed."""
    critical_failures = [
        r for r in results
        if not r.passed and r.severity == Severity.CRITICAL
    ]

    warnings = [
        r for r in results
        if not r.passed and r.severity == Severity.WARNING
    ]

    for w in warnings:
        print(f"⚠️  WARNING: {w.name} — {w.details}")

    if critical_failures:
        for f in critical_failures:
            print(f"🚨 CRITICAL: {f.name} — {f.details}")
        print(f"Pipeline blocked: {len(critical_failures)} critical failures")
        return False

    print(f"✅ Quality gate passed ({len(warnings)} warnings)")
    return True

# Example usage in an Airflow task or pipeline step:
# results = [
#     TestResult("pk_uniqueness", True, Severity.CRITICAL),
#     TestResult("freshness_check", False, Severity.WARNING, "Data is 3h old"),
#     TestResult("null_check_email", False, Severity.CRITICAL, "42 null emails"),
# ]
# if not run_quality_gate(results):
#     raise AirflowFailException("Quality gate failed")

Where Harbinger Explorer Fits

If you're testing quality on data pulled from APIs — which is where a lot of ingestion issues originate — Harbinger Explorer lets you query API responses directly in your browser using DuckDB WASM. You can run SQL assertions against raw API data before it enters your pipeline, catching schema drift and value anomalies at the source without writing a single ingestion script first.

Next Steps

Data quality testing isn't a one-time project. Start with the highest-impact layer: schema validation at ingestion. Add SQL assertions for your most critical tables. Build statistical checks for your key metrics. Most importantly, make quality gates block pipelines — don't just log failures and hope someone reads the logs.

Your immediate action item: pick your three most business-critical tables and write five assertions each (uniqueness, completeness, freshness, referential integrity, volume). Deploy them tomorrow. You'll sleep better.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Data Quality Testing: A Practical Guide for Data Engineers

Data Quality Testing: A Practical Guide for Data Engineers

Why Data Quality Testing Matters

The Cost of Skipping Quality Tests

Core Dimensions of Data Quality

Building a Data Quality Testing Strategy

Layer 1: Schema Validation (Ingestion)

Layer 2: SQL-Based Assertions (Transformation)

Layer 3: Statistical Checks (Aggregation)

Tooling: What to Use

Common Mistakes and Pitfalls

Implementing a Quality Gate Pattern

Where Harbinger Explorer Fits

Next Steps

Continue Reading

Continue Reading

Data Pipeline Monitoring: Catch Failures Before Users Do

Data Lakehouse Architecture Explained

dbt vs Spark SQL: How to Choose

Try Harbinger Explorer for free

Data Quality Testing: A Practical Guide for Data Engineers

Why Data Quality Testing Matters

The Cost of Skipping Quality Tests

Core Dimensions of Data Quality

Building a Data Quality Testing Strategy

Layer 1: Schema Validation (Ingestion)

Layer 2: SQL-Based Assertions (Transformation)

Layer 3: Statistical Checks (Aggregation)

Tooling: What to Use

Common Mistakes and Pitfalls

Implementing a Quality Gate Pattern

Where Harbinger Explorer Fits

Next Steps

Continue Reading

Continue Reading

Data Pipeline Monitoring: Catch Failures Before Users Do

Data Lakehouse Architecture Explained

dbt vs Spark SQL: How to Choose

Try Harbinger Explorer for free

Command Palette