Harbinger Explorer

Back to Knowledge Hub
Engineering

Data Quality Testing: A Practical Guide for Data Engineers

8 min read·Tags: data quality, testing, data engineering, dbt, great expectations, python, sql, data pipelines

Data Quality Testing: A Practical Guide for Data Engineers

Bad data costs more than bad code. A broken pipeline throws an error and someone fixes it. Bad data silently corrupts dashboards, misleads stakeholders, and erodes trust in your entire data platform — sometimes for weeks before anyone notices.

Data quality testing is the practice of systematically validating that your data meets defined expectations at every stage of your pipeline. If you're running production data workloads without it, you're flying blind.

Why Data Quality Testing Matters

Here's a scenario every data engineer has lived through: a source API changes its response schema overnight. No error. Your pipeline ingests the data, transforms it, loads it into the warehouse. Three days later, a VP asks why revenue numbers dropped 40%. The answer? A field renamed from total_amount to amount_total, and your transformation silently produced nulls.

Data quality testing catches this on ingestion — not in a board meeting.

The Cost of Skipping Quality Tests

RiskImpactDetection Without Tests
Schema driftNull columns, type errorsDays to weeks
Duplicate recordsInflated metricsOften never caught
Missing dataIncomplete aggregationsNext reporting cycle
Referential integrity breaksOrphaned records, broken joinsDownstream failures
Value range violationsNonsensical KPIsStakeholder complaints

Core Dimensions of Data Quality

Before writing tests, you need a framework for what to test. These six dimensions cover most real-world scenarios:

DimensionQuestion It AnswersExample Test
CompletenessIs all expected data present?NOT NULL checks, row count thresholds
UniquenessAre there duplicates?Primary key uniqueness
ValidityDo values fall within expected ranges?status IN ('active', 'inactive', 'pending')
AccuracyDoes the data reflect reality?Cross-source reconciliation
ConsistencyDo related datasets agree?orders.customer_id exists in customers.id
TimelinessIs the data fresh enough?MAX(updated_at) > NOW() - INTERVAL 2 HOURS

Building a Data Quality Testing Strategy

Layer 1: Schema Validation (Ingestion)

Catch structural problems before data enters your pipeline. This is your cheapest line of defense.

# Python — Schema validation with Pydantic
from pydantic import BaseModel, field_validator
from typing import Optional
from datetime import datetime

class OrderRecord(BaseModel):
    order_id: str
    customer_id: str
    amount: float
    currency: str
    created_at: datetime
    status: str

    @field_validator("amount")
    @classmethod
    def amount_must_be_positive(cls, v: float) -> float:
        if v <= 0:
            raise ValueError(f"amount must be positive, got {v}")
        return v

    @field_validator("currency")
    @classmethod
    def currency_must_be_valid(cls, v: str) -> str:
        valid = {"EUR", "USD", "GBP", "CHF"}
        if v not in valid:
            raise ValueError(f"unexpected currency: {v}")
        return v

# Validate each record at ingestion
def validate_batch(records: list[dict]) -> tuple[list, list]:
    valid, invalid = [], []
    for record in records:
        try:
            validated = OrderRecord(**record)
            valid.append(validated.model_dump())
        except Exception as e:
            invalid.append({"record": record, "error": str(e)})
    return valid, invalid

This catches type mismatches, missing fields, and out-of-range values before they hit your warehouse.

Layer 2: SQL-Based Assertions (Transformation)

After your data lands, run assertions against the transformed tables. These are your core quality gates.

-- PostgreSQL — Common data quality assertions

-- 1. Uniqueness: No duplicate order IDs
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
-- Expected: 0 rows

-- 2. Completeness: No null customer references
SELECT COUNT(*) AS null_customers
FROM orders
WHERE customer_id IS NULL;
-- Expected: 0

-- 3. Referential integrity: Every order references a valid customer
SELECT o.order_id, o.customer_id
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL;
-- Expected: 0 rows

-- 4. Freshness: Data updated within last 2 hours
SELECT
    CASE
        WHEN MAX(updated_at) < NOW() - INTERVAL '2 hours'
        THEN 'STALE'
        ELSE 'FRESH'
    END AS freshness_status
FROM orders;
-- Expected: FRESH

-- 5. Volume: Row count within expected range
SELECT
    CASE
        WHEN COUNT(*) BETWEEN 1000 AND 100000
        THEN 'OK'
        ELSE 'ANOMALY'
    END AS volume_check
FROM daily_orders
WHERE order_date = CURRENT_DATE;

Layer 3: Statistical Checks (Aggregation)

For metrics tables and aggregations, go beyond row-level checks. Look for distribution anomalies.

# Python — Statistical quality checks with pandas
import pandas as pd
import numpy as np

def check_metric_quality(
    current: pd.DataFrame,
    historical: pd.DataFrame,
    metric_col: str,
    z_threshold: float = 3.0
) -> dict:
    """Flag metrics that deviate significantly from historical norms."""

    hist_mean = historical[metric_col].mean()
    hist_std = historical[metric_col].std()
    current_val = current[metric_col].sum()

    z_score = (current_val - hist_mean) / hist_std if hist_std > 0 else 0

    return {
        "metric": metric_col,
        "current_value": current_val,
        "historical_mean": round(hist_mean, 2),
        "z_score": round(z_score, 2),
        "status": "ANOMALY" if abs(z_score) > z_threshold else "OK",
    }

# Usage
# results = check_metric_quality(today_df, last_30_days_df, "revenue")
# if results["status"] == "ANOMALY":
#     alert_on_call_engineer(results)

Tooling: What to Use

Several frameworks make data quality testing easier. Here's an honest comparison:

ToolBest ForTrade-off
dbt testsSQL-native teams already using dbtLimited to SQL, tied to dbt ecosystem
Great ExpectationsPython-heavy pipelines, rich assertionsSteep learning curve, heavy config
SodaMulti-warehouse checks, SodaCL syntaxNewer, smaller community
Elementarydbt users wanting anomaly detectiondbt-only, adds dashboard overhead
Custom scriptsFull control, unusual data sourcesMaintenance burden, no standardization

My take: if you're using dbt, start with built-in tests and add Great Expectations for anything dbt can't express. If you're not on dbt, Soda's YAML-based approach has the gentlest learning curve.

Common Mistakes and Pitfalls

1. Testing only in production. If your quality tests run after data is already in the warehouse, you've already served bad data to downstream consumers. Test at ingestion boundaries.

2. Alert fatigue. Hundreds of low-priority test failures train your team to ignore alerts. Tier your tests: P0 (blocks pipeline), P1 (needs same-day fix), P2 (tracked in backlog).

3. Hardcoded thresholds. "Row count must be > 10,000" breaks the first time your business has a slow day. Use rolling averages and percentage-based thresholds instead.

4. No quarantine strategy. When a test fails, what happens to the data? If you don't have a dead-letter queue or quarantine table, failed records just vanish — and nobody knows.

5. Ignoring test maintenance. Data quality tests need maintenance like any other code. When schemas evolve, tests must follow. Schedule quarterly reviews of your test suite.

Implementing a Quality Gate Pattern

The most effective pattern I've seen in production: treat quality tests as pipeline gates that block downstream processing.

# Python — Quality gate pattern (framework-agnostic)

from dataclasses import dataclass
from enum import Enum

class Severity(Enum):
    CRITICAL = "critical"  # Blocks pipeline
    WARNING = "warning"    # Logs alert, continues
    INFO = "info"          # Logs only

@dataclass
class TestResult:
    name: str
    passed: bool
    severity: Severity
    details: str = ""

def run_quality_gate(results: list[TestResult]) -> bool:
    """Returns True if pipeline should proceed."""
    critical_failures = [
        r for r in results
        if not r.passed and r.severity == Severity.CRITICAL
    ]

    warnings = [
        r for r in results
        if not r.passed and r.severity == Severity.WARNING
    ]

    for w in warnings:
        print(f"⚠️  WARNING: {w.name} — {w.details}")

    if critical_failures:
        for f in critical_failures:
            print(f"🚨 CRITICAL: {f.name} — {f.details}")
        print(f"Pipeline blocked: {len(critical_failures)} critical failures")
        return False

    print(f"✅ Quality gate passed ({len(warnings)} warnings)")
    return True

# Example usage in an Airflow task or pipeline step:
# results = [
#     TestResult("pk_uniqueness", True, Severity.CRITICAL),
#     TestResult("freshness_check", False, Severity.WARNING, "Data is 3h old"),
#     TestResult("null_check_email", False, Severity.CRITICAL, "42 null emails"),
# ]
# if not run_quality_gate(results):
#     raise AirflowFailException("Quality gate failed")

Where Harbinger Explorer Fits

If you're testing quality on data pulled from APIs — which is where a lot of ingestion issues originate — Harbinger Explorer lets you query API responses directly in your browser using DuckDB WASM. You can run SQL assertions against raw API data before it enters your pipeline, catching schema drift and value anomalies at the source without writing a single ingestion script first.

Next Steps

Data quality testing isn't a one-time project. Start with the highest-impact layer: schema validation at ingestion. Add SQL assertions for your most critical tables. Build statistical checks for your key metrics. Most importantly, make quality gates block pipelines — don't just log failures and hope someone reads the logs.

Your immediate action item: pick your three most business-critical tables and write five assertions each (uniqueness, completeness, freshness, referential integrity, volume). Deploy them tomorrow. You'll sleep better.

Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...