🛠 Learn ERWIN

Partitioning Large Facts

When a fact table grows to billions of rows, partitioning is how you keep queries fast.

The interview question "What would you do if a fact table becomes too large? How would you partition it?"

The problem at scale

A fact table at any large e-commerce or fintech company easily hits billions of rows. At that scale, even simple queries become slow:

Partitioning physically splits the table into smaller pieces while keeping a single logical table for queries.

1. Time-based partitioning (most common)

Partition by month or year. New data goes into the current partition; old data lives in archived partitions.

SQL — partition by month
CREATE TABLE OrderFacts ( OrderID INT, CustomerID INT, Amount DECIMAL(10,2), OrderDate DATE ) PARTITION BY RANGE (YEAR(OrderDate), MONTH(OrderDate)) ( PARTITION p202301 VALUES LESS THAN (2023,2), PARTITION p202302 VALUES LESS THAN (2023,3), PARTITION p202303 VALUES LESS THAN (2023,4), -- ... etc PARTITION p_future VALUES LESS THAN MAXVALUE );

Benefits:

2. Range-based partitioning

Partition by a numeric or date range that isn't time. Useful for fact-table volume tiers:

SQL
PARTITION BY RANGE (SalesAmount) ( PARTITION small VALUES LESS THAN (100), PARTITION medium VALUES LESS THAN (1000), PARTITION large VALUES LESS THAN MAXVALUE );

3. List partitioning

Partition by discrete values like region or country:

SQL
PARTITION BY LIST (Region) ( PARTITION p_north VALUES IN ('North', 'Northeast'), PARTITION p_south VALUES IN ('South', 'Southeast'), PARTITION p_west VALUES IN ('West', 'Southwest') );

4. Hash partitioning

Partition by a hash of a column — evenly distributes data without business meaning. Good for parallel processing when there's no natural range:

SQL
PARTITION BY HASH(CustomerID) PARTITIONS 10;

Choosing the right strategy

StrategyWhen to use
Time-based (range on date)Default for fact tables. Time-series data, easy archival.
Range (non-date)Numeric ranges with clear tiers (small/medium/large).
ListDiscrete categorical values with known set (regions, channels).
HashEven distribution for parallelism when no natural range exists.
CompositeCombine two strategies (e.g. range by date + hash by customer).

Partition pruning in action

Query that benefits from pruning
SELECT SUM(Amount) FROM OrderFacts WHERE OrderDate BETWEEN '2024-01-01' AND '2024-01-31'; -- DB engine scans only the p202401 partition, -- ignoring 23 other monthly partitions. Query is 24x faster.

Partitioning best practices

What modern warehouses do automatically

Snowflake, BigQuery, and Databricks handle most of this implicitly via micro-partitioning and clustering keys — you don't manually define partitions. You set a clustering key (e.g. OrderDate) and the system handles the rest.

Snowflake
CREATE TABLE OrderFacts (...) CLUSTER BY (OrderDate, Region);
The structured answer "I'd partition the fact table by OrderDate, monthly. This enables partition pruning (queries scan only relevant months), makes archival easy (drop old partitions instantly), and improves parallelism. For modern cloud warehouses, I'd use clustering keys instead of manual partitions and let the system handle micro-partitioning."