Partitioning Large Facts
When a fact table grows to billions of rows, partitioning is how you keep queries fast.
The problem at scale
A fact table at any large e-commerce or fintech company easily hits billions of rows. At that scale, even simple queries become slow:
- Full table scans take minutes or hours
- Indexes grow huge and slow to maintain
- Backups and ETL run for entire days
- Dropping old data is painful (delete millions of rows = lock the table)
Partitioning physically splits the table into smaller pieces while keeping a single logical table for queries.
1. Time-based partitioning (most common)
Partition by month or year. New data goes into the current partition; old data lives in archived partitions.
Benefits:
- Partition pruning — query for January 2024 scans only that partition
- Easy archival — drop entire old partitions instantly (vs DELETE that locks rows)
- Parallel query processing — different partitions on different disks/nodes
- Faster maintenance — rebuild indexes per partition, not the whole table
2. Range-based partitioning
Partition by a numeric or date range that isn't time. Useful for fact-table volume tiers:
3. List partitioning
Partition by discrete values like region or country:
4. Hash partitioning
Partition by a hash of a column — evenly distributes data without business meaning. Good for parallel processing when there's no natural range:
Choosing the right strategy
| Strategy | When to use |
|---|---|
| Time-based (range on date) | Default for fact tables. Time-series data, easy archival. |
| Range (non-date) | Numeric ranges with clear tiers (small/medium/large). |
| List | Discrete categorical values with known set (regions, channels). |
| Hash | Even distribution for parallelism when no natural range exists. |
| Composite | Combine two strategies (e.g. range by date + hash by customer). |
Partition pruning in action
Partitioning best practices
- Partition on the column you filter most — usually OrderDate for facts
- Don't over-partition — thousands of tiny partitions add overhead
- Align with archival policy — if you retain 5 years, partition monthly (60 partitions, manageable)
- Test partition pruning — bad WHERE clauses can disable pruning entirely
- Compress old partitions — columnar compression on historical data
- Move cold partitions to cheap storage — most warehouses support tiered storage
What modern warehouses do automatically
Snowflake, BigQuery, and Databricks handle most of this implicitly via micro-partitioning and clustering keys — you don't manually define partitions. You set a clustering key (e.g. OrderDate) and the system handles the rest.