Back to Knowledge Hub
Data Strategy

5 Data Quality Checks Every API Pipeline Needs

4 min read·Tags: Data Quality, Pipeline, Best Practices

You have connected an API, data is flowing, and your dashboard looks great. Then one morning, half your charts are blank because a field was renamed upstream. Sound familiar?

Data quality in API pipelines requires proactive checks, not reactive firefighting. Here are the five checks every pipeline should implement from day one.

Check 1: Schema Validation

The most common cause of pipeline failures is schema drift. APIs change their response structure without notice — fields get renamed, types change, new required fields appear.

What to check:

  • Field names match your expected schema
  • Data types are consistent (a "price" field should always be numeric)
  • Required fields are present and non-null
  • No unexpected new fields that might indicate a breaking change
-- Find rows where expected numeric field contains non-numeric values
SELECT COUNT(*) as invalid_rows
FROM api_data
WHERE TRY_CAST(price AS DOUBLE) IS NULL
  AND price IS NOT NULL;

Automated approach: Define a schema contract (JSON Schema or a simple type map) and validate every API response against it before loading.

Check 2: Freshness Monitoring

Stale data is worse than no data because it creates false confidence. If your API stops returning new records, you need to know immediately.

What to check:

  • Last successful API call timestamp
  • Most recent record timestamp in the data
  • Gap between "now" and the latest data point
-- Check data freshness
SELECT
  MAX(updated_at) as latest_record,
  NOW() - MAX(updated_at) as staleness,
  CASE
    WHEN NOW() - MAX(updated_at) > INTERVAL '24 hours' THEN 'STALE'
    WHEN NOW() - MAX(updated_at) > INTERVAL '6 hours' THEN 'WARNING'
    ELSE 'FRESH'
  END as status
FROM api_data;

Rule of thumb: Set alerting thresholds at 2x your expected update frequency. If data should update hourly, alert after 2 hours of staleness.

Check 3: Completeness (NULL Rate)

APIs frequently return partial data. A record might exist but have critical fields missing. Tracking NULL rates per column over time reveals degradation patterns.

What to check:

  • NULL percentage per column
  • Trend over time (is completeness getting worse?)
  • Columns that should never be NULL
-- Completeness report per column
SELECT
  COUNT(*) as total_rows,
  ROUND(100.0 * COUNT(email) / COUNT(*), 1) as email_pct,
  ROUND(100.0 * COUNT(company) / COUNT(*), 1) as company_pct,
  ROUND(100.0 * COUNT(revenue) / COUNT(*), 1) as revenue_pct
FROM api_contacts;

Threshold: For most analytics use cases, columns with >20% NULL rate deserve investigation. Critical columns (IDs, timestamps) should be at 100%.

Check 4: Uniqueness (Duplicate Detection)

Duplicate records corrupt aggregations and inflate metrics. APIs that use pagination are especially prone to returning the same record multiple times.

What to check:

  • Primary key uniqueness
  • Near-duplicates (same name, different ID)
  • Duplicate rate over time
-- Find duplicate records by primary key
SELECT id, COUNT(*) as occurrences
FROM api_data
GROUP BY id
HAVING COUNT(*) > 1
ORDER BY occurrences DESC;

Prevention: Always deduplicate during ingestion. Use the API's unique identifier as your primary key and implement upsert logic.

Check 5: Range and Distribution Analysis

Values that fall outside expected ranges often indicate upstream bugs. A "percentage" field with values over 100, negative prices, or dates in the year 2099 are all real examples we have seen.

What to check:

  • Min/max values for numeric fields
  • Value distribution (are 80% of records in one category?)
  • Statistical outliers (values >3 standard deviations from the mean)
-- Distribution overview
SELECT
  MIN(amount) as min_amount,
  MAX(amount) as max_amount,
  AVG(amount) as avg_amount,
  STDDEV(amount) as stddev_amount,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) as median
FROM transactions;

How Harbinger Explorer Helps

Harbinger Explorer calculates a Quality Score for every source in your catalog. The score combines:

  • Schema match percentage
  • Freshness status (based on last crawl)
  • Completeness per column
  • Duplicate detection
  • Value range validation

You can see the score on your dashboard and drill into specific issues. When quality drops below your threshold, you know exactly which check failed and why.

Start Simple, Iterate

You do not need a complex data quality framework on day one. Start with these five checks, automate them, and add sophistication over time. The goal is catching problems before they reach your stakeholders — not after.


Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial