Harbinger Explorer

Back to Knowledge Hub
Engineering

Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage

10 min read·Tags: data-quality, deduplication, fuzzy-matching, record-linkage, data-engineering, entity-resolution

Your CRM has 847,000 customer records. Your analytics say revenue grew 40%. Your finance team says it grew 22%. The difference is duplicates — the same customer counted three times under slightly different names and email addresses. Data deduplication is the discipline that makes these numbers converge.

Three Categories of Deduplication

Not all duplicates are the same. The right strategy depends on why duplicates exist:

CauseExampleBest Approach
Exact technical duplicatesSame row inserted twiceHash-based dedup
Typos, formatting differences"John Smith" vs "Jon Smith"Fuzzy matching
Cross-system entity mergingSame person in CRM + ERPRecord linkage

Strategy 1 — Hash-Based Deduplication

Hash-based dedup is the fastest and most reliable approach for exact duplicates. You compute a deterministic hash of the fields that define uniqueness, then deduplicate on the hash.

This is the right tool when duplicates come from:

  • Double-ingestion of the same file
  • Retry logic without idempotency
  • Multiple sources writing the same events
# Python — hash-based deduplication with pandas
import pandas as pd
import hashlib

def hash_record(row: pd.Series, key_columns: list[str]) -> str:
    '''Compute SHA-256 hash of selected columns for deduplication.'''
    composite_key = "|".join(
        str(row[col]).strip().lower() if pd.notna(row[col]) else ""
        for col in key_columns
    )
    return hashlib.sha256(composite_key.encode()).hexdigest()

# Load raw data with potential duplicates
df = pd.read_parquet("s3://raw/orders/2026/04/03/batch_*.parquet")

# Define uniqueness — what makes a record unique in your domain?
KEY_COLS = ["order_id", "customer_id", "placed_at", "total_cents"]

df["dedup_hash"] = df.apply(hash_record, axis=1, key_columns=KEY_COLS)

# Keep first occurrence of each hash
deduped = df.drop_duplicates(subset=["dedup_hash"], keep="first")

print(f"Removed {len(df) - len(deduped):,} duplicate rows ({(len(df) - len(deduped)) / len(df):.1%})")

In SQL (DuckDB / BigQuery / Snowflake):

-- DuckDB SQL — hash-based dedup with ROW_NUMBER window function
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id, customer_id, DATE_TRUNC('second', placed_at)
            ORDER BY ingested_at ASC   -- keep earliest ingested copy
        ) AS row_num
    FROM raw.orders
)
SELECT * EXCLUDE (row_num)
FROM ranked
WHERE row_num = 1;

Pitfall: Be careful about what you include in the hash key. Including ingested_at or file_path will give every row a unique hash — you'll deduplicate nothing.

Strategy 2 — Fuzzy Matching

Fuzzy matching handles duplicates where records represent the same entity but differ due to typos, abbreviations, formatting, or data entry inconsistencies. Classic example: names and addresses.

Common algorithms:

AlgorithmWhat It MeasuresGood For
Levenshtein distanceEdit distance (char inserts/deletes/substitutes)Short strings, names
Jaro-WinklerTransposition-weighted similarity, prefix bonusNames, short text
Token sort ratioWord-order-independent similarityAddresses, company names
Soundex / MetaphonePhonetic similarityNames pronounced similarly
# Python — fuzzy deduplication with thefuzz (formerly fuzzywuzzy)
import pandas as pd
from thefuzz import fuzz, process

customers = pd.DataFrame({
    "customer_id": ["C001", "C002", "C003", "C004", "C005"],
    "name": ["John Smith", "Jon Smith", "Jonathan Smith", "Jane Doe", "J. Doe"],
    "email": ["john@acme.com", "jon@acme.com", "jonathan@acme.com", "jane@acme.com", "j.doe@acme.com"],
})

def find_fuzzy_duplicates(df: pd.DataFrame, name_col: str, threshold: int = 85) -> list[tuple]:
    '''
    Find pairs of records with fuzzy name similarity above threshold.
    Returns list of (idx_a, idx_b, similarity_score) tuples.
    '''
    duplicates = []
    names = df[name_col].tolist()

    for i in range(len(names)):
        for j in range(i + 1, len(names)):
            # token_sort_ratio handles word order differences ("Smith John" vs "John Smith")
            score = fuzz.token_sort_ratio(names[i], names[j])
            if score >= threshold:
                duplicates.append((df.index[i], df.index[j], score))

    return duplicates

pairs = find_fuzzy_duplicates(customers, "name", threshold=85)
for idx_a, idx_b, score in pairs:
    print(f"[{score}%] '{customers.loc[idx_a, 'name']}' ≈ '{customers.loc[idx_b, 'name']}'")

# Output:
# [90%] 'John Smith' ≈ 'Jon Smith'
# [87%] 'John Smith' ≈ 'Jonathan Smith'

Blocking for scale: Running every record against every other record is O(n²) — it doesn't scale past ~100k records. Use blocking to pre-filter candidates:

# Only compare records within the same name soundex block
# This reduces candidate pairs from n² to a manageable subset
import jellyfish

customers["soundex_block"] = customers["name"].apply(
    lambda x: jellyfish.soundex(x.split()[0])  # soundex of first name
)

# Now compare only within blocks — reduces pairs dramatically
for block, group in customers.groupby("soundex_block"):
    if len(group) > 1:
        pairs = find_fuzzy_duplicates(group, "name", threshold=85)
        # process pairs within this block...

Pitfall: Fuzzy matching has no ground truth. A threshold of 85 that works for names might incorrectly merge "ACME Corp" and "ACME Labs". Always validate your threshold against labeled examples before running at scale.

Strategy 3 — Record Linkage

Record linkage (also called entity resolution) is the most sophisticated form of deduplication. It merges records from different datasets that refer to the same real-world entity, even when no common key exists.

Use cases:

  • Merging a CRM and an ERP that have no shared customer ID
  • Linking customer records across acquired companies
  • Matching survey respondents to purchase history

The standard framework uses a comparison vector of multiple features:

# Python — probabilistic record linkage with recordlinkage library
import pandas as pd
import recordlinkage

# Two datasets: CRM and ERP customers, no shared ID
crm = pd.DataFrame({
    "name": ["Alice Johnson", "Bob Martinez", "Carol White"],
    "email": ["alice@example.com", "bob@example.com", "carol@example.com"],
    "postcode": ["10115", "10117", "10119"],
})

erp = pd.DataFrame({
    "company_contact": ["A. Johnson", "Robert Martinez", "Carol White-Smith"],
    "contact_email": ["alice@example.com", "b.martinez@example.com", "carol@example.com"],
    "zip": ["10115", "10117", "10119"],
})

# Step 1: Generate candidate pairs (blocking on postcode to reduce n²)
indexer = recordlinkage.Index()
indexer.block("postcode", "zip")
candidate_pairs = indexer.index(crm, erp)

# Step 2: Compare feature by feature
comparer = recordlinkage.Compare()
comparer.string("name", "company_contact", method="jarowinkler", label="name_sim")
comparer.exact("email", "contact_email", label="email_exact")
comparer.exact("postcode", "zip", label="postcode_exact")

features = comparer.compute(candidate_pairs, crm, erp)

# Step 3: Classify matches (rule-based or ML classifier)
# Simple rule: email exact match = definite match
matches = features[features["email_exact"] == 1]
print(f"Matched {len(matches)} record pairs across CRM and ERP")

For large-scale record linkage, libraries like Splink (built by the UK Ministry of Justice) use probabilistic Fellegi-Sunter models and scale to hundreds of millions of records with Spark.

Choosing the Right Strategy

Are duplicates caused by double ingestion / retries?
  └─ YES → Hash-based dedup (fast, exact)

Are duplicates caused by data entry variations (typos, abbreviations)?
  └─ YES → Fuzzy matching (with blocking for scale)

Are you merging two separate datasets with no shared key?
  └─ YES → Record linkage (probabilistic or ML-based)

Do you have a mix of all three?
  └─ YES → Pipeline all three: hash first, then fuzzy, then linkage

Building a Deduplication Pipeline

A production deduplication pipeline typically chains strategies:

Raw Data
   │
   ▼
[1] Hash-based dedup (exact duplicates, fast)
   │
   ▼
[2] Standardization (lowercase names, normalize phone formats)
   │
   ▼
[3] Fuzzy matching within blocks (typos, abbreviations)
   │
   ▼
[4] Human review queue (low-confidence matches, 80-90% similarity)
   │
   ▼
[5] Golden record creation (merge surviving fields from matched records)
   │
   ▼
Deduplicated Output

The human review queue is non-negotiable for sensitive data. Automated deduplication at 99% accuracy on 1 million records still produces 10,000 wrong merges.

Common Pitfalls

Threshold tuning without labeled data: Setting a fuzzy match threshold without a labeled validation set is guesswork. Build a labeled sample of ~500 pairs (true matches and non-matches) and measure precision/recall at different thresholds.

Not normalizing before comparing: "ACME Inc." and "acme inc" have edit distance > 0. Always normalize (lowercase, strip punctuation, expand common abbreviations) before fuzzy comparison.

Merging the wrong way: When merging duplicate records into a "golden record," decide field-by-field which source is authoritative. Blindly taking the most recent value loses data.

Ignoring temporal duplicates: The same event can be legitimately duplicated if a customer does the same thing twice. Use timestamps and business context to distinguish true duplicates from repeated actions.

See also Surrogate vs Natural Keys and Data Testing Frameworks for complementary data quality patterns.

Wrapping Up

Deduplication is rarely a one-shot operation. As new data arrives, new duplicates are created. Build it as a pipeline, not a one-time cleanup script. Start with hash-based dedup to catch the easy ones, add fuzzy matching for entity resolution within a dataset, and bring in probabilistic record linkage when merging across systems.

Next step: Query your largest customer or entity table and count records per email domain — duplicate email patterns are usually the fastest way to quantify your deduplication problem.


Continue Reading


[VERIFY]: Splink scaling claims — verify with current Splink documentation.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...