SQL Anti-Patterns: Common Mistakes and How to Fix Them
A query that works is not the same as a query that's correct, maintainable, or fast. SQL anti-patterns are the recurring mistakes that slow down pipelines, produce wrong results, and make future engineers want to delete your work. Most of them are easy to avoid once you know what to look for.
1. SELECT * in Production Pipelines
SELECT * feels convenient. In a pipeline, it's a ticking time bomb.
The problem: When a source table adds a column, your downstream SELECT * picks it up automatically — potentially breaking transformations that expect a fixed schema, bloating storage, or passing sensitive columns downstream without intent.
-- ❌ Anti-pattern (PostgreSQL)
CREATE TABLE fct_orders AS
SELECT *
FROM raw.orders
JOIN raw.customers USING (customer_id);
-- ✅ Fix: explicit column list
CREATE TABLE fct_orders AS
SELECT
o.order_id,
o.customer_id,
o.order_total,
o.placed_at,
c.email,
c.country
FROM raw.orders o
JOIN raw.customers c USING (customer_id);
Exception: SELECT * is fine in ad-hoc exploration. Never in a dbt model, a view, or a scheduled pipeline.
2. Correlated Subqueries Instead of JOINs
A correlated subquery runs once per row in the outer query. On a table with 10 million rows, that's 10 million subquery executions.
-- ❌ Anti-pattern (PostgreSQL) — correlated subquery, O(n) subquery executions
SELECT
customer_id,
total_spend,
(SELECT AVG(total_spend) FROM orders) AS avg_spend -- OK, uncorrelated
(SELECT MAX(placed_at) -- CORRELATED — runs per row
FROM orders o2
WHERE o2.customer_id = o1.customer_id) AS last_order_date
FROM orders o1;
-- ✅ Fix: window function
SELECT
customer_id,
total_spend,
AVG(total_spend) OVER () AS avg_spend,
MAX(placed_at) OVER (PARTITION BY customer_id) AS last_order_date
FROM orders;
When correlated subqueries are acceptable: EXISTS checks where early termination matters, or LATERAL joins in PostgreSQL for complex per-row logic. Measure first.
3. Implicit Type Conversions
When you compare a string column to an integer literal, the database silently converts one type to the other. This often disables index usage and can produce wrong results with NULL edge cases.
-- ❌ Anti-pattern (PostgreSQL) — implicit conversion, index on status_code not used
SELECT * FROM orders WHERE status_code = 1;
-- status_code is VARCHAR — DB converts 1 to '1' and scans the table
-- ✅ Fix: match types explicitly
SELECT * FROM orders WHERE status_code = '1';
-- ❌ Anti-pattern — date as string, wrong results in some dialects
SELECT * FROM events WHERE event_date > '2026-01-01';
-- Works in PostgreSQL but silently fails in some MySQL configs
-- ✅ Fix: explicit cast
SELECT * FROM events WHERE event_date > DATE '2026-01-01'; -- PostgreSQL/BigQuery
-- or
SELECT * FROM events WHERE event_date > CAST('2026-01-01' AS DATE); -- universal
In Spark SQL, implicit type coercion is especially dangerous — Spark will try to cast between types and produce null instead of an error when it fails.
4. NOT IN with NULLs
NOT IN with a subquery that can return NULL is one of the most common sources of silent data loss in SQL. The result is always an empty set when NULLs are present.
-- ❌ Anti-pattern (PostgreSQL) — returns 0 rows if cancelled_orders contains any NULL
SELECT order_id
FROM orders
WHERE order_id NOT IN (SELECT order_id FROM cancelled_orders);
-- If cancelled_orders has even one NULL order_id, this returns nothing.
-- SQL logic: NOT IN expands to AND order_id != val1 AND order_id != val2 ...
-- NULL comparison is UNKNOWN, so the whole expression is UNKNOWN (not TRUE).
-- ✅ Fix: use NOT EXISTS
SELECT o.order_id
FROM orders o
WHERE NOT EXISTS (
SELECT 1 FROM cancelled_orders co WHERE co.order_id = o.order_id
);
-- ✅ Alternative: LEFT JOIN / IS NULL pattern
SELECT o.order_id
FROM orders o
LEFT JOIN cancelled_orders co ON o.order_id = co.order_id
WHERE co.order_id IS NULL;
Always use NOT EXISTS or a LEFT JOIN ... IS NULL when excluding rows based on a subquery that might contain NULLs.
5. Aggregating Before Joining
Joining large tables and then aggregating is slower than aggregating first and then joining — but the order matters for correctness.
-- ❌ Anti-pattern (Spark SQL) — join full tables, then aggregate (huge shuffle)
SELECT
c.country,
SUM(o.total_cents) AS revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.country;
-- ✅ Fix: aggregate orders first, then join (smaller table to join)
WITH order_agg AS (
SELECT customer_id, SUM(total_cents) AS total_revenue
FROM orders
GROUP BY customer_id
)
SELECT
c.country,
SUM(oa.total_revenue) AS revenue
FROM order_agg oa
JOIN customers c ON oa.customer_id = c.customer_id
GROUP BY c.country;
The fix reduces the row count before the join. In Spark, this dramatically reduces shuffle volume. In a standard RDBMS, the query planner often does this automatically — but not always. Don't assume.
6. Using HAVING Instead of WHERE for Non-Aggregate Filters
WHERE filters rows before aggregation. HAVING filters after. Using HAVING for non-aggregate predicates forces the database to aggregate all rows and then throw most of them away.
-- ❌ Anti-pattern (PostgreSQL) — country filter applied AFTER aggregation
SELECT country, COUNT(*) AS customer_count
FROM customers
GROUP BY country
HAVING country = 'DE'; -- aggregates all countries, then filters to DE
-- ✅ Fix: filter with WHERE before aggregation
SELECT country, COUNT(*) AS customer_count
FROM customers
WHERE country = 'DE'
GROUP BY country;
HAVING is correct when the predicate involves an aggregate (HAVING COUNT(*) > 100). For plain column filters, always use WHERE.
7. Overusing Subqueries Instead of CTEs
Deep subquery nesting is unreadable and unmaintainable. Modern SQL (PostgreSQL, BigQuery, Snowflake, DuckDB, Spark SQL) all support CTEs — use them.
-- ❌ Anti-pattern — three levels of subquery nesting
SELECT *
FROM (
SELECT customer_id, AVG(order_total) AS avg_order
FROM (
SELECT customer_id, order_id, SUM(line_total) AS order_total
FROM (
SELECT * FROM order_lines WHERE is_cancelled = FALSE
) active_lines
GROUP BY customer_id, order_id
) order_totals
GROUP BY customer_id
) customer_avgs
WHERE avg_order > 100;
-- ✅ Fix: CTEs for readability
-- DuckDB SQL
WITH active_lines AS (
SELECT customer_id, order_id, line_total
FROM order_lines
WHERE is_cancelled = FALSE
),
order_totals AS (
SELECT customer_id, order_id, SUM(line_total) AS order_total
FROM active_lines
GROUP BY customer_id, order_id
),
customer_avgs AS (
SELECT customer_id, AVG(order_total) AS avg_order
FROM order_totals
GROUP BY customer_id
)
SELECT *
FROM customer_avgs
WHERE avg_order > 100;
CTEs are not just style. They're testable (dbt can ref individual CTE layers), cacheable in some engines, and understandable by the next engineer.
8. Non-SARGable Predicates
A predicate is SARGable (Search ARGument able) when the database can use an index to evaluate it. Wrapping a column in a function breaks SARGability.
-- ❌ Anti-pattern (PostgreSQL) — function on column, index on placed_at not used
SELECT * FROM orders WHERE YEAR(placed_at) = 2026;
SELECT * FROM orders WHERE DATE_TRUNC('month', placed_at) = '2026-01-01';
SELECT * FROM orders WHERE UPPER(email) = 'ALICE@EXAMPLE.COM';
-- ✅ Fix: push the transform to the literal, not the column
SELECT * FROM orders
WHERE placed_at >= '2026-01-01' AND placed_at < '2027-01-01';
SELECT * FROM orders
WHERE placed_at >= '2026-01-01' AND placed_at < '2026-02-01';
SELECT * FROM orders WHERE email = LOWER('ALICE@EXAMPLE.COM');
-- or better: lowercase at write time, query lowercase column
9. Forgetting NULL Semantics in Aggregations
COUNT(*) counts all rows including NULLs. COUNT(column) counts non-NULL values. AVG(column) ignores NULLs (which is usually what you want, but sometimes isn't).
-- ❌ Misleading: COUNT(*) vs COUNT(column) confusion (PostgreSQL)
SELECT
COUNT(*) AS total_rows, -- 1000 (includes NULLs)
COUNT(email) AS with_email, -- 800 (NULLs excluded)
AVG(order_total) AS avg_total -- average EXCLUDES null order_total rows
FROM orders;
-- ✅ Be explicit about NULL handling
SELECT
COUNT(*) AS total_rows,
COUNT(email) AS rows_with_email,
COUNT(*) FILTER (WHERE email IS NULL) AS rows_without_email, -- PostgreSQL syntax
AVG(COALESCE(order_total, 0)) AS avg_total_incl_nulls_as_zero
FROM orders;
10. Joining on Non-Unique Keys Without Checking
If you join two tables on a key that isn't unique in both tables, you silently create a Cartesian product for matching rows. Revenue numbers explode. Duplicates appear everywhere.
-- ❌ Anti-pattern — orders.customer_id is unique, but promotions.customer_id is NOT
-- This multiplies rows silently
SELECT o.order_id, o.total, p.promo_code
FROM orders o
JOIN promotions p ON o.customer_id = p.customer_id;
-- If a customer has 3 promotions, every order appears 3 times
-- ✅ Fix: validate uniqueness before joining
-- First, check:
SELECT customer_id, COUNT(*) FROM promotions GROUP BY 1 HAVING COUNT(*) > 1;
-- Then use the right join type (e.g., latest promo only):
WITH latest_promo AS (
SELECT customer_id, promo_code,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
FROM promotions
)
SELECT o.order_id, o.total, lp.promo_code
FROM orders o
LEFT JOIN latest_promo lp ON o.customer_id = lp.customer_id AND lp.rn = 1;
See Data Quality Testing for how to build automated checks that catch duplicate join keys before they reach production.
Quick Reference Cheat Sheet
| Anti-Pattern | Quick Fix |
|---|---|
SELECT * in pipelines | Explicit column list |
| Correlated subquery | Window function or pre-aggregated JOIN |
| Implicit type cast | Match types or use explicit CAST |
NOT IN with NULLs | Use NOT EXISTS instead |
| HAVING for non-aggregate filter | Move to WHERE clause |
| Nested subqueries | Refactor to CTEs |
| Function on column in WHERE | Push transform to literal |
| Ignoring NULL semantics | Use COUNT(col), COALESCE, FILTER |
| Fan-out join | Validate key uniqueness before joining |
A Note on Dialects
SQL is not one language — it's a family of dialects. DATE_TRUNC is PostgreSQL and BigQuery syntax. Spark SQL uses TRUNC(date, 'MONTH'). FILTER (WHERE ...) is PostgreSQL; BigQuery and Spark use COUNTIF. Always specify your dialect when sharing SQL, and always test on the target engine.
If you want to explore and validate SQL across different data sources without spinning up infrastructure, Harbinger Explorer runs DuckDB SQL directly in the browser. It's useful for testing query logic on CSVs or API data before porting to your production warehouse.
Wrapping Up
Most SQL bugs are invisible — they don't throw errors, they just return wrong numbers. The anti-patterns above are especially dangerous because they work until they don't: the wrong row count only matters when someone actually checks the numbers.
Next step: Pick one query in your current pipeline and run it through this checklist. The NOT IN / NULL and fan-out join patterns are the highest-probability bugs to find in existing code.
Continue Reading
Continue Reading
Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Change Data Capture Explained
A practical guide to CDC patterns — log-based, trigger-based, and polling — with Debezium configuration examples and Kafka Connect integration.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial