Harbinger Explorer

Back to Knowledge Hub
Engineering

Data Partitioning Strategies Explained

8 min read·Tags: partitioning, data engineering, spark, hive, sql, performance, lakehouse

Data Partitioning Strategies Explained

Your query scans 900 GB to return 2,000 rows. The fix isn't a faster cluster — it's a partitioning strategy you probably skipped.

Data partitioning is how you physically organize data so your queries only touch the slices they actually need. Get it right and you cut costs, slash latency, and make downstream jobs predictable. Get it wrong and every read becomes a full table scan in disguise.

What Is Data Partitioning?

Partitioning divides a dataset into smaller, independent units — partitions — based on column values. The query engine reads only the relevant partitions instead of the entire table. This is called partition pruning.

Three common partitioning strategies exist at the storage/engine level:

StrategyHow it splits dataBest for
RangeValue falls within a defined rangeDates, time series, sequential IDs
ListValue matches an explicit set of valuesCountry codes, categories, status flags
HashHash of value modulo N bucketsHigh-cardinality columns with no natural range

Hive-style partitioning is a directory naming convention used by Spark, Hive, and most lakehouse formats — it layers on top of these strategies.


Range Partitioning

Range partitioning assigns rows to partitions based on value ranges. The most common use case: partitioning by date.

-- PostgreSQL: range partition by order date
CREATE TABLE orders (
    order_id BIGINT,
    customer_id BIGINT,
    order_date DATE,
    amount NUMERIC
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_2024_q1 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

CREATE TABLE orders_2024_q2 PARTITION OF orders
    FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');

CREATE TABLE orders_2024_q3 PARTITION OF orders
    FOR VALUES FROM ('2024-07-01') TO ('2024-10-01');

CREATE TABLE orders_2024_q4 PARTITION OF orders
    FOR VALUES FROM ('2024-10-01') TO ('2025-01-01');

When you query WHERE order_date BETWEEN '2024-06-01' AND '2024-06-30', the planner reads only orders_2024_q2 — the other three partitions are skipped entirely.

Trade-offs:

  • ✅ Excellent for time-series queries and date-bounded reporting
  • ✅ Easy to archive or drop old data (drop a partition, not millions of rows)
  • ❌ Vulnerable to partition skew if data isn't evenly distributed across ranges
  • ❌ Poor for point lookups on non-partitioned columns

List Partitioning

List partitioning routes each row to a partition based on a discrete value. Useful when your filter column is low-cardinality (a handful of known values).

-- PostgreSQL: list partition by region
CREATE TABLE sales (
    sale_id BIGINT,
    region TEXT,
    product_id INT,
    revenue NUMERIC
) PARTITION BY LIST (region);

CREATE TABLE sales_europe PARTITION OF sales
    FOR VALUES IN ('DE', 'FR', 'NL', 'ES', 'IT');

CREATE TABLE sales_americas PARTITION OF sales
    FOR VALUES IN ('US', 'CA', 'BR', 'MX');

CREATE TABLE sales_apac PARTITION OF sales
    FOR VALUES IN ('JP', 'AU', 'SG', 'IN');

A query like WHERE region = 'DE' will only touch sales_europe. Simple and effective.

Trade-offs:

  • ✅ Clean isolation of logical data domains
  • ✅ Easy per-region access controls and maintenance windows
  • ❌ Adding new values (new regions) requires DDL changes
  • ❌ Useless if queries don't filter on the partition column

Hash Partitioning

Hash partitioning applies a hash function to a column value and routes the row to hash(value) % N buckets. No natural order, no explicit value list — just even distribution.

-- PostgreSQL: hash partition by customer_id
CREATE TABLE events (
    event_id BIGINT,
    customer_id BIGINT,
    event_type TEXT,
    created_at TIMESTAMPTZ
) PARTITION BY HASH (customer_id);

CREATE TABLE events_p0 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);

CREATE TABLE events_p1 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 1);

CREATE TABLE events_p2 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 2);

CREATE TABLE events_p3 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Hash partitioning gives you even data distribution across partitions. But it doesn't help with range queries — WHERE customer_id > 100000 still scans all four partitions.

Trade-offs:

  • ✅ Prevents skew caused by uneven data distribution
  • ✅ Great for parallel processing and write throughput
  • ❌ No partition pruning for range filters
  • ❌ Repartitioning (changing N) requires full data rewrite

Hive-Style Partitioning

Hive-style is a directory naming convention rather than a partitioning algorithm. Data is stored in a folder hierarchy that encodes the partition column values:

s3://my-bucket/events/
├── year=2024/
│   ├── month=01/
│   │   ├── day=01/
│   │   │   └── part-00000.parquet
│   │   └── day=02/
│   └── month=02/
└── year=2025/

Spark, Hive, Trino, Athena, and DuckDB all support this format natively. When you read with a filter like WHERE year=2024 AND month=01, only those directories are scanned.

# PySpark: write with Hive-style partitioning
df.write \
    .mode("overwrite") \
    .partitionBy("year", "month", "day") \
    .parquet("s3://my-bucket/events/")

# PySpark: read with partition pruning
spark.read \
    .parquet("s3://my-bucket/events/") \
    .filter("year = 2024 AND month = 1")

Common mistake: Partitioning by a high-cardinality column like user_id. If you have 10 million users, you get 10 million directories — the file system overhead alone will kill performance. Partition by columns where DISTINCT count < ~10,000 is a reasonable rule of thumb.

Trade-offs:

  • ✅ Portable — works across Spark, Athena, Trino, DuckDB, Hive
  • ✅ Human-readable; easy to inspect data manually
  • ✅ Supports incremental writes without full rewrites
  • ❌ Small files problem: high-frequency writes per partition create many tiny files
  • ❌ Metadata overhead with too many partitions

Choosing the Right Strategy

ScenarioRecommended strategy
Time-series / append-by-dateRange (date) + Hive-style
Multi-region reportingList (region)
High-cardinality join keysHash
Lakehouse on object storageHive-style (range or list underneath)
Large parallel loadsHash (even write distribution)
GDPR deletion by regionList (drop partition = delete by region)

One practical heuristic: the column you partition by should appear in >80% of your WHERE clauses. If it doesn't, the partition won't prune anything meaningful.


Partition Pruning in Practice

Partition pruning only works when the filter is on the exact partition column. Watch out for these common pruning killers:

-- Spark SQL: PRUNING WORKS — direct equality filter
SELECT * FROM events WHERE year = 2024 AND month = 3;

-- Spark SQL: PRUNING FAILS — function applied to partition column
SELECT * FROM events WHERE YEAR(created_at) = 2024;

-- Spark SQL: PRUNING FAILS — non-deterministic expression
SELECT * FROM events WHERE created_at > NOW() - INTERVAL 30 DAYS;

If your partition column is year/month/day (integers), filter against those integers — not derived expressions on timestamp columns.


A Note on Over-Partitioning

More partitions is not always better. Each partition is a metadata entry; each Hive-style partition is a directory listing call. A table with 500,000 partitions will have query planning overhead that drowns out any scan savings.

Signs you've over-partitioned:

  • Query planning takes longer than query execution
  • Many partitions have < 10 MB of data
  • File count grows faster than data volume

When this happens, consider partition coarsening (daily instead of hourly, monthly instead of daily) or adding file compaction to merge small files within partitions. Spark's OPTIMIZE command or Delta Lake's auto-optimize handles this well.


Wrapping Up

The right partitioning strategy depends entirely on your query patterns. Range partitioning excels for time-bound analytics; list partitioning gives you clean data domain isolation; hash partitioning prevents skew in write-heavy workloads. Hive-style gives you portability across the modern data stack.

Before you partition, look at your top 20 queries and identify the filter columns. Partition on those. Then monitor file sizes after a few weeks of data ingestion — small files are the most common sign you've misconfigured something.

If you're exploring a partitioned dataset and want to inspect partition distribution or test pruning behavior quickly, Harbinger Explorer lets you query Parquet and CSV files directly in the browser using DuckDB WASM — useful for ad-hoc checks without spinning up a full Spark cluster.


Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...