Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Your CRM has 847,000 customer records. Your analytics say revenue grew 40%. Your finance team says it grew 22%. The difference is duplicates — the same customer counted three times under slightly different names and email addresses. Data deduplication is the discipline that makes these numbers converge.
Three Categories of Deduplication
Not all duplicates are the same. The right strategy depends on why duplicates exist:
| Cause | Example | Best Approach |
|---|---|---|
| Exact technical duplicates | Same row inserted twice | Hash-based dedup |
| Typos, formatting differences | "John Smith" vs "Jon Smith" | Fuzzy matching |
| Cross-system entity merging | Same person in CRM + ERP | Record linkage |
Strategy 1 — Hash-Based Deduplication
Hash-based dedup is the fastest and most reliable approach for exact duplicates. You compute a deterministic hash of the fields that define uniqueness, then deduplicate on the hash.
This is the right tool when duplicates come from:
- Double-ingestion of the same file
- Retry logic without idempotency
- Multiple sources writing the same events
# Python — hash-based deduplication with pandas
import pandas as pd
import hashlib
def hash_record(row: pd.Series, key_columns: list[str]) -> str:
'''Compute SHA-256 hash of selected columns for deduplication.'''
composite_key = "|".join(
str(row[col]).strip().lower() if pd.notna(row[col]) else ""
for col in key_columns
)
return hashlib.sha256(composite_key.encode()).hexdigest()
# Load raw data with potential duplicates
df = pd.read_parquet("s3://raw/orders/2026/04/03/batch_*.parquet")
# Define uniqueness — what makes a record unique in your domain?
KEY_COLS = ["order_id", "customer_id", "placed_at", "total_cents"]
df["dedup_hash"] = df.apply(hash_record, axis=1, key_columns=KEY_COLS)
# Keep first occurrence of each hash
deduped = df.drop_duplicates(subset=["dedup_hash"], keep="first")
print(f"Removed {len(df) - len(deduped):,} duplicate rows ({(len(df) - len(deduped)) / len(df):.1%})")
In SQL (DuckDB / BigQuery / Snowflake):
-- DuckDB SQL — hash-based dedup with ROW_NUMBER window function
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY order_id, customer_id, DATE_TRUNC('second', placed_at)
ORDER BY ingested_at ASC -- keep earliest ingested copy
) AS row_num
FROM raw.orders
)
SELECT * EXCLUDE (row_num)
FROM ranked
WHERE row_num = 1;
Pitfall: Be careful about what you include in the hash key. Including ingested_at or file_path will give every row a unique hash — you'll deduplicate nothing.
Strategy 2 — Fuzzy Matching
Fuzzy matching handles duplicates where records represent the same entity but differ due to typos, abbreviations, formatting, or data entry inconsistencies. Classic example: names and addresses.
Common algorithms:
| Algorithm | What It Measures | Good For |
|---|---|---|
| Levenshtein distance | Edit distance (char inserts/deletes/substitutes) | Short strings, names |
| Jaro-Winkler | Transposition-weighted similarity, prefix bonus | Names, short text |
| Token sort ratio | Word-order-independent similarity | Addresses, company names |
| Soundex / Metaphone | Phonetic similarity | Names pronounced similarly |
# Python — fuzzy deduplication with thefuzz (formerly fuzzywuzzy)
import pandas as pd
from thefuzz import fuzz, process
customers = pd.DataFrame({
"customer_id": ["C001", "C002", "C003", "C004", "C005"],
"name": ["John Smith", "Jon Smith", "Jonathan Smith", "Jane Doe", "J. Doe"],
"email": ["john@acme.com", "jon@acme.com", "jonathan@acme.com", "jane@acme.com", "j.doe@acme.com"],
})
def find_fuzzy_duplicates(df: pd.DataFrame, name_col: str, threshold: int = 85) -> list[tuple]:
'''
Find pairs of records with fuzzy name similarity above threshold.
Returns list of (idx_a, idx_b, similarity_score) tuples.
'''
duplicates = []
names = df[name_col].tolist()
for i in range(len(names)):
for j in range(i + 1, len(names)):
# token_sort_ratio handles word order differences ("Smith John" vs "John Smith")
score = fuzz.token_sort_ratio(names[i], names[j])
if score >= threshold:
duplicates.append((df.index[i], df.index[j], score))
return duplicates
pairs = find_fuzzy_duplicates(customers, "name", threshold=85)
for idx_a, idx_b, score in pairs:
print(f"[{score}%] '{customers.loc[idx_a, 'name']}' ≈ '{customers.loc[idx_b, 'name']}'")
# Output:
# [90%] 'John Smith' ≈ 'Jon Smith'
# [87%] 'John Smith' ≈ 'Jonathan Smith'
Blocking for scale: Running every record against every other record is O(n²) — it doesn't scale past ~100k records. Use blocking to pre-filter candidates:
# Only compare records within the same name soundex block
# This reduces candidate pairs from n² to a manageable subset
import jellyfish
customers["soundex_block"] = customers["name"].apply(
lambda x: jellyfish.soundex(x.split()[0]) # soundex of first name
)
# Now compare only within blocks — reduces pairs dramatically
for block, group in customers.groupby("soundex_block"):
if len(group) > 1:
pairs = find_fuzzy_duplicates(group, "name", threshold=85)
# process pairs within this block...
Pitfall: Fuzzy matching has no ground truth. A threshold of 85 that works for names might incorrectly merge "ACME Corp" and "ACME Labs". Always validate your threshold against labeled examples before running at scale.
Strategy 3 — Record Linkage
Record linkage (also called entity resolution) is the most sophisticated form of deduplication. It merges records from different datasets that refer to the same real-world entity, even when no common key exists.
Use cases:
- Merging a CRM and an ERP that have no shared customer ID
- Linking customer records across acquired companies
- Matching survey respondents to purchase history
The standard framework uses a comparison vector of multiple features:
# Python — probabilistic record linkage with recordlinkage library
import pandas as pd
import recordlinkage
# Two datasets: CRM and ERP customers, no shared ID
crm = pd.DataFrame({
"name": ["Alice Johnson", "Bob Martinez", "Carol White"],
"email": ["alice@example.com", "bob@example.com", "carol@example.com"],
"postcode": ["10115", "10117", "10119"],
})
erp = pd.DataFrame({
"company_contact": ["A. Johnson", "Robert Martinez", "Carol White-Smith"],
"contact_email": ["alice@example.com", "b.martinez@example.com", "carol@example.com"],
"zip": ["10115", "10117", "10119"],
})
# Step 1: Generate candidate pairs (blocking on postcode to reduce n²)
indexer = recordlinkage.Index()
indexer.block("postcode", "zip")
candidate_pairs = indexer.index(crm, erp)
# Step 2: Compare feature by feature
comparer = recordlinkage.Compare()
comparer.string("name", "company_contact", method="jarowinkler", label="name_sim")
comparer.exact("email", "contact_email", label="email_exact")
comparer.exact("postcode", "zip", label="postcode_exact")
features = comparer.compute(candidate_pairs, crm, erp)
# Step 3: Classify matches (rule-based or ML classifier)
# Simple rule: email exact match = definite match
matches = features[features["email_exact"] == 1]
print(f"Matched {len(matches)} record pairs across CRM and ERP")
For large-scale record linkage, libraries like Splink (built by the UK Ministry of Justice) use probabilistic Fellegi-Sunter models and scale to hundreds of millions of records with Spark.
Choosing the Right Strategy
Are duplicates caused by double ingestion / retries?
└─ YES → Hash-based dedup (fast, exact)
Are duplicates caused by data entry variations (typos, abbreviations)?
└─ YES → Fuzzy matching (with blocking for scale)
Are you merging two separate datasets with no shared key?
└─ YES → Record linkage (probabilistic or ML-based)
Do you have a mix of all three?
└─ YES → Pipeline all three: hash first, then fuzzy, then linkage
Building a Deduplication Pipeline
A production deduplication pipeline typically chains strategies:
Raw Data
│
▼
[1] Hash-based dedup (exact duplicates, fast)
│
▼
[2] Standardization (lowercase names, normalize phone formats)
│
▼
[3] Fuzzy matching within blocks (typos, abbreviations)
│
▼
[4] Human review queue (low-confidence matches, 80-90% similarity)
│
▼
[5] Golden record creation (merge surviving fields from matched records)
│
▼
Deduplicated Output
The human review queue is non-negotiable for sensitive data. Automated deduplication at 99% accuracy on 1 million records still produces 10,000 wrong merges.
Common Pitfalls
Threshold tuning without labeled data: Setting a fuzzy match threshold without a labeled validation set is guesswork. Build a labeled sample of ~500 pairs (true matches and non-matches) and measure precision/recall at different thresholds.
Not normalizing before comparing: "ACME Inc." and "acme inc" have edit distance > 0. Always normalize (lowercase, strip punctuation, expand common abbreviations) before fuzzy comparison.
Merging the wrong way: When merging duplicate records into a "golden record," decide field-by-field which source is authoritative. Blindly taking the most recent value loses data.
Ignoring temporal duplicates: The same event can be legitimately duplicated if a customer does the same thing twice. Use timestamps and business context to distinguish true duplicates from repeated actions.
See also Surrogate vs Natural Keys and Data Testing Frameworks for complementary data quality patterns.
Wrapping Up
Deduplication is rarely a one-shot operation. As new data arrives, new duplicates are created. Build it as a pipeline, not a one-time cleanup script. Start with hash-based dedup to catch the easy ones, add fuzzy matching for entity resolution within a dataset, and bring in probabilistic record linkage when merging across systems.
Next step: Query your largest customer or entity table and count records per email domain — duplicate email patterns are usually the fastest way to quantify your deduplication problem.
Continue Reading
[VERIFY]: Splink scaling claims — verify with current Splink documentation.
Continue Reading
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Change Data Capture Explained
A practical guide to CDC patterns — log-based, trigger-based, and polling — with Debezium configuration examples and Kafka Connect integration.
Data Contracts for Teams
A practical guide to data contracts: schema agreements between producers and consumers, with YAML examples, Schema Registry, and dbt enforcement.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial