Late-Arriving Facts
When a transaction from January 15th shows up on January 22nd — how to handle the delay cleanly.
The scenario
Today is January 22, 2024. A sale that occurred on January 15 just landed in your data pipeline — seven days late. Maybe:
- A point-of-sale system was offline
- A partner integration failed and was retried
- A batch job missed its window
- An employee approved a backdated transaction
How do you load this into your fact table without breaking historical reports?
Two related challenges
Late-arriving fact
The transaction itself arrives late. You need to backfill it to its original date and recompute affected aggregates.
Early-arriving fact
The fact arrives but the dimension row it should join to doesn't exist yet (customer hasn't been ETL'd). Need a placeholder.
Strategy 1: Staging area with reconciliation
Strategy 2: Grace period rule
Define an acceptable lateness window. Facts arriving within the window go straight to the main table; older facts get special handling.
Strategy 3: Handle missing dimensions
Sometimes the fact arrives before the dimension. A late-loaded customer row hasn't shown up yet. Solutions:
3a. Use a default dimension row
Pre-populate dimension tables with a special "Unknown" row (typically SK = -1). Facts referencing missing dimensions point here temporarily.
3b. Inferred dimension members
Create a stub dimension row when a fact arrives referencing an unknown ID. Fill in details when the proper dimension data arrives.
Strategy 4: Update aggregate tables
If you maintain pre-aggregated tables (daily/weekly summaries), a late-arriving fact requires updating those aggregates:
Time-based partitioning impact
Late-arriving facts mean writes to old partitions. This affects:
- Archived partitions — you may need to keep recent partitions writable longer than expected
- Compressed partitions — re-opening for writes negates compression benefits temporarily
- Cache invalidation — cached daily totals must be invalidated and recomputed
Monitoring and SLAs
Best practices summary
- Define an SLA — how late is too late? Most warehouses accept 1-7 days
- Use a default dimension row (SK = -1) for unknown values
- Stub dimensions when needed, then enrich them when data arrives
- Update aggregates incrementally when late facts arrive
- Log late arrivals for monitoring and pipeline debugging
- Alert on extreme lateness — anything beyond your SLA needs investigation
- Document the grace period so downstream consumers know reports may shift slightly