Data Partitioning Strategies Explained
Data Partitioning Strategies Explained
Your query scans 900 GB to return 2,000 rows. The fix isn't a faster cluster — it's a partitioning strategy you probably skipped.
Data partitioning is how you physically organize data so your queries only touch the slices they actually need. Get it right and you cut costs, slash latency, and make downstream jobs predictable. Get it wrong and every read becomes a full table scan in disguise.
What Is Data Partitioning?
Partitioning divides a dataset into smaller, independent units — partitions — based on column values. The query engine reads only the relevant partitions instead of the entire table. This is called partition pruning.
Three common partitioning strategies exist at the storage/engine level:
| Strategy | How it splits data | Best for |
|---|---|---|
| Range | Value falls within a defined range | Dates, time series, sequential IDs |
| List | Value matches an explicit set of values | Country codes, categories, status flags |
| Hash | Hash of value modulo N buckets | High-cardinality columns with no natural range |
Hive-style partitioning is a directory naming convention used by Spark, Hive, and most lakehouse formats — it layers on top of these strategies.
Range Partitioning
Range partitioning assigns rows to partitions based on value ranges. The most common use case: partitioning by date.
-- PostgreSQL: range partition by order date
CREATE TABLE orders (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
amount NUMERIC
) PARTITION BY RANGE (order_date);
CREATE TABLE orders_2024_q1 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE orders_2024_q2 PARTITION OF orders
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
CREATE TABLE orders_2024_q3 PARTITION OF orders
FOR VALUES FROM ('2024-07-01') TO ('2024-10-01');
CREATE TABLE orders_2024_q4 PARTITION OF orders
FOR VALUES FROM ('2024-10-01') TO ('2025-01-01');
When you query WHERE order_date BETWEEN '2024-06-01' AND '2024-06-30', the planner reads only orders_2024_q2 — the other three partitions are skipped entirely.
Trade-offs:
- ✅ Excellent for time-series queries and date-bounded reporting
- ✅ Easy to archive or drop old data (drop a partition, not millions of rows)
- ❌ Vulnerable to partition skew if data isn't evenly distributed across ranges
- ❌ Poor for point lookups on non-partitioned columns
List Partitioning
List partitioning routes each row to a partition based on a discrete value. Useful when your filter column is low-cardinality (a handful of known values).
-- PostgreSQL: list partition by region
CREATE TABLE sales (
sale_id BIGINT,
region TEXT,
product_id INT,
revenue NUMERIC
) PARTITION BY LIST (region);
CREATE TABLE sales_europe PARTITION OF sales
FOR VALUES IN ('DE', 'FR', 'NL', 'ES', 'IT');
CREATE TABLE sales_americas PARTITION OF sales
FOR VALUES IN ('US', 'CA', 'BR', 'MX');
CREATE TABLE sales_apac PARTITION OF sales
FOR VALUES IN ('JP', 'AU', 'SG', 'IN');
A query like WHERE region = 'DE' will only touch sales_europe. Simple and effective.
Trade-offs:
- ✅ Clean isolation of logical data domains
- ✅ Easy per-region access controls and maintenance windows
- ❌ Adding new values (new regions) requires DDL changes
- ❌ Useless if queries don't filter on the partition column
Hash Partitioning
Hash partitioning applies a hash function to a column value and routes the row to hash(value) % N buckets. No natural order, no explicit value list — just even distribution.
-- PostgreSQL: hash partition by customer_id
CREATE TABLE events (
event_id BIGINT,
customer_id BIGINT,
event_type TEXT,
created_at TIMESTAMPTZ
) PARTITION BY HASH (customer_id);
CREATE TABLE events_p0 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_p1 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_p2 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_p3 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 3);
Hash partitioning gives you even data distribution across partitions. But it doesn't help with range queries — WHERE customer_id > 100000 still scans all four partitions.
Trade-offs:
- ✅ Prevents skew caused by uneven data distribution
- ✅ Great for parallel processing and write throughput
- ❌ No partition pruning for range filters
- ❌ Repartitioning (changing N) requires full data rewrite
Hive-Style Partitioning
Hive-style is a directory naming convention rather than a partitioning algorithm. Data is stored in a folder hierarchy that encodes the partition column values:
s3://my-bucket/events/
├── year=2024/
│ ├── month=01/
│ │ ├── day=01/
│ │ │ └── part-00000.parquet
│ │ └── day=02/
│ └── month=02/
└── year=2025/
Spark, Hive, Trino, Athena, and DuckDB all support this format natively. When you read with a filter like WHERE year=2024 AND month=01, only those directories are scanned.
# PySpark: write with Hive-style partitioning
df.write \
.mode("overwrite") \
.partitionBy("year", "month", "day") \
.parquet("s3://my-bucket/events/")
# PySpark: read with partition pruning
spark.read \
.parquet("s3://my-bucket/events/") \
.filter("year = 2024 AND month = 1")
Common mistake: Partitioning by a high-cardinality column like user_id. If you have 10 million users, you get 10 million directories — the file system overhead alone will kill performance. Partition by columns where DISTINCT count < ~10,000 is a reasonable rule of thumb.
Trade-offs:
- ✅ Portable — works across Spark, Athena, Trino, DuckDB, Hive
- ✅ Human-readable; easy to inspect data manually
- ✅ Supports incremental writes without full rewrites
- ❌ Small files problem: high-frequency writes per partition create many tiny files
- ❌ Metadata overhead with too many partitions
Choosing the Right Strategy
| Scenario | Recommended strategy |
|---|---|
| Time-series / append-by-date | Range (date) + Hive-style |
| Multi-region reporting | List (region) |
| High-cardinality join keys | Hash |
| Lakehouse on object storage | Hive-style (range or list underneath) |
| Large parallel loads | Hash (even write distribution) |
| GDPR deletion by region | List (drop partition = delete by region) |
One practical heuristic: the column you partition by should appear in >80% of your WHERE clauses. If it doesn't, the partition won't prune anything meaningful.
Partition Pruning in Practice
Partition pruning only works when the filter is on the exact partition column. Watch out for these common pruning killers:
-- Spark SQL: PRUNING WORKS — direct equality filter
SELECT * FROM events WHERE year = 2024 AND month = 3;
-- Spark SQL: PRUNING FAILS — function applied to partition column
SELECT * FROM events WHERE YEAR(created_at) = 2024;
-- Spark SQL: PRUNING FAILS — non-deterministic expression
SELECT * FROM events WHERE created_at > NOW() - INTERVAL 30 DAYS;
If your partition column is year/month/day (integers), filter against those integers — not derived expressions on timestamp columns.
A Note on Over-Partitioning
More partitions is not always better. Each partition is a metadata entry; each Hive-style partition is a directory listing call. A table with 500,000 partitions will have query planning overhead that drowns out any scan savings.
Signs you've over-partitioned:
- Query planning takes longer than query execution
- Many partitions have < 10 MB of data
- File count grows faster than data volume
When this happens, consider partition coarsening (daily instead of hourly, monthly instead of daily) or adding file compaction to merge small files within partitions. Spark's OPTIMIZE command or Delta Lake's auto-optimize handles this well.
Wrapping Up
The right partitioning strategy depends entirely on your query patterns. Range partitioning excels for time-bound analytics; list partitioning gives you clean data domain isolation; hash partitioning prevents skew in write-heavy workloads. Hive-style gives you portability across the modern data stack.
Before you partition, look at your top 20 queries and identify the filter columns. Partition on those. Then monitor file sizes after a few weeks of data ingestion — small files are the most common sign you've misconfigured something.
If you're exploring a partitioned dataset and want to inspect partition distribution or test pruning behavior quickly, Harbinger Explorer lets you query Parquet and CSV files directly in the browser using DuckDB WASM — useful for ad-hoc checks without spinning up a full Spark cluster.
Continue Reading
Continue Reading
Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Change Data Capture Explained
A practical guide to CDC patterns — log-based, trigger-based, and polling — with Debezium configuration examples and Kafka Connect integration.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial