Partitioning Large Facts

When a fact table grows to billions of rows, partitioning is how you keep queries fast.

The interview question "What would you do if a fact table becomes too large? How would you partition it?"

The problem at scale

A fact table at any large e-commerce or fintech company easily hits billions of rows. At that scale, even simple queries become slow:

Full table scans take minutes or hours
Indexes grow huge and slow to maintain
Backups and ETL run for entire days
Dropping old data is painful (delete millions of rows = lock the table)

Partitioning physically splits the table into smaller pieces while keeping a single logical table for queries.

1. Time-based partitioning (most common)

Partition by month or year. New data goes into the current partition; old data lives in archived partitions.

SQL — partition by month

CREATE TABLE OrderFacts ( OrderID INT, CustomerID INT, Amount DECIMAL(10,2), OrderDate DATE ) PARTITION BY RANGE (YEAR(OrderDate), MONTH(OrderDate)) ( PARTITION p202301 VALUES LESS THAN (2023,2), PARTITION p202302 VALUES LESS THAN (2023,3), PARTITION p202303 VALUES LESS THAN (2023,4), -- ... etc PARTITION p_future VALUES LESS THAN MAXVALUE );

Benefits:

Partition pruning — query for January 2024 scans only that partition
Easy archival — drop entire old partitions instantly (vs DELETE that locks rows)
Parallel query processing — different partitions on different disks/nodes
Faster maintenance — rebuild indexes per partition, not the whole table

2. Range-based partitioning

Partition by a numeric or date range that isn't time. Useful for fact-table volume tiers:

SQL

PARTITION BY RANGE (SalesAmount) ( PARTITION small VALUES LESS THAN (100), PARTITION medium VALUES LESS THAN (1000), PARTITION large VALUES LESS THAN MAXVALUE );

3. List partitioning

Partition by discrete values like region or country:

SQL

PARTITION BY LIST (Region) ( PARTITION p_north VALUES IN ('North', 'Northeast'), PARTITION p_south VALUES IN ('South', 'Southeast'), PARTITION p_west VALUES IN ('West', 'Southwest') );

4. Hash partitioning

Partition by a hash of a column — evenly distributes data without business meaning. Good for parallel processing when there's no natural range:

SQL

PARTITION BY HASH(CustomerID) PARTITIONS 10;

Choosing the right strategy

Strategy	When to use
Time-based (range on date)	Default for fact tables. Time-series data, easy archival.
Range (non-date)	Numeric ranges with clear tiers (small/medium/large).
List	Discrete categorical values with known set (regions, channels).
Hash	Even distribution for parallelism when no natural range exists.
Composite	Combine two strategies (e.g. range by date + hash by customer).

Partition pruning in action

Query that benefits from pruning

SELECT SUM(Amount) FROM OrderFacts WHERE OrderDate BETWEEN '2024-01-01' AND '2024-01-31'; -- DB engine scans only the p202401 partition, -- ignoring 23 other monthly partitions. Query is 24x faster.

Partitioning best practices

Partition on the column you filter most — usually OrderDate for facts
Don't over-partition — thousands of tiny partitions add overhead
Align with archival policy — if you retain 5 years, partition monthly (60 partitions, manageable)
Test partition pruning — bad WHERE clauses can disable pruning entirely
Compress old partitions — columnar compression on historical data
Move cold partitions to cheap storage — most warehouses support tiered storage

What modern warehouses do automatically

Snowflake, BigQuery, and Databricks handle most of this implicitly via micro-partitioning and clustering keys — you don't manually define partitions. You set a clustering key (e.g. OrderDate) and the system handles the rest.

Snowflake

CREATE TABLE OrderFacts (...) CLUSTER BY (OrderDate, Region);

The structured answer "I'd partition the fact table by OrderDate, monthly. This enables partition pruning (queries scan only relevant months), makes archival easy (drop old partitions instantly), and improves parallelism. For modern cloud warehouses, I'd use clustering keys instead of manual partitions and let the system handle micro-partitioning."

← Many-to-Many with Bridges Late-Arriving Facts →