Late-Arriving Facts

When a transaction from January 15th shows up on January 22nd — how to handle the delay cleanly.

The interview question "How would you handle late-arriving facts?"

The scenario

Today is January 22, 2024. A sale that occurred on January 15 just landed in your data pipeline — seven days late. Maybe:

A point-of-sale system was offline
A partner integration failed and was retried
A batch job missed its window
An employee approved a backdated transaction

How do you load this into your fact table without breaking historical reports?

Two related challenges

Late-arriving fact

The transaction itself arrives late. You need to backfill it to its original date and recompute affected aggregates.

Early-arriving fact

The fact arrives but the dimension row it should join to doesn't exist yet (customer hasn't been ETL'd). Need a placeholder.

Strategy 1: Staging area with reconciliation

Source System │ ▼ RawDataLayer (ODS) │ │ Validation + Reconciliation ▼ ┌─ FactTable ◄── On-time facts └─ LateArrivingFacts ◄── Late facts (held for processing) │ │ Nightly batch ▼ Reprocess + Update FactTable │ ▼ Final Reporting Layer

Strategy 2: Grace period rule

Define an acceptable lateness window. Facts arriving within the window go straight to the main table; older facts get special handling.

ETL pseudo-code

LATENESS_WINDOW = 7 days FOR each incoming fact: IF (CurrentDate - TransactionDate) <= 7: INSERT INTO FactTable -- standard load ELSE: INSERT INTO LateArrivingFactsStaging TriggerNightlyBackfillJob()

Strategy 3: Handle missing dimensions

Sometimes the fact arrives before the dimension. A late-loaded customer row hasn't shown up yet. Solutions:

3a. Use a default dimension row

Pre-populate dimension tables with a special "Unknown" row (typically SK = -1). Facts referencing missing dimensions point here temporarily.

CustomerDimension CustomerSK │ CustomerID │ Name │ City -1 │ 'UNKNOWN' │ Unknown │ Unknown ← placeholder 501 │ C001 │ John │ Bangalore 502 │ C002 │ Sara │ Mumbai

3b. Inferred dimension members

Create a stub dimension row when a fact arrives referencing an unknown ID. Fill in details when the proper dimension data arrives.

SQL — insert stub then update

-- Step 1: fact arrives with CustomerID = 'C999' (new customer) INSERT INTO CustomerDimension (CustomerID, IsStub, EffectiveDate) VALUES ('C999', 'Y', CURRENT_DATE); -- Step 2: fact references the stub's surrogate key INSERT INTO OrderFacts (OrderID, CustomerSK, OrderAmount, OrderDate) VALUES (10001, <new_stub_sk>, 499.99, '2024-01-22'); -- Step 3: full customer data arrives later — update the stub in place UPDATE CustomerDimension SET CustomerName = 'New Customer', Email = 'new@example.com', City = 'Delhi', IsStub = 'N' WHERE CustomerID = 'C999';

Strategy 4: Update aggregate tables

If you maintain pre-aggregated tables (daily/weekly summaries), a late-arriving fact requires updating those aggregates:

SQL

-- A $499 sale from January 15 just arrived UPDATE DailySalesAggregate SET TotalSales = TotalSales + 499.99, TransactionCount = TransactionCount + 1 WHERE SalesDate = '2024-01-15';

Time-based partitioning impact

Late-arriving facts mean writes to old partitions. This affects:

Archived partitions — you may need to keep recent partitions writable longer than expected
Compressed partitions — re-opening for writes negates compression benefits temporarily
Cache invalidation — cached daily totals must be invalidated and recomputed

Monitoring and SLAs

SQL — track late arrivals

SELECT DATE(LoadTimestamp) AS LoadDate, (LoadTimestamp - TransactionDate) AS LatenessHours, COUNT(*) AS LateFactCount FROM LateArrivingFactsLog WHERE LoadTimestamp > CURRENT_TIMESTAMP - INTERVAL '7 days' GROUP BY LoadDate, LatenessHours ORDER BY LoadDate DESC;

Best practices summary

Define an SLA — how late is too late? Most warehouses accept 1-7 days
Use a default dimension row (SK = -1) for unknown values
Stub dimensions when needed, then enrich them when data arrives
Update aggregates incrementally when late facts arrive
Log late arrivals for monitoring and pipeline debugging
Alert on extreme lateness — anything beyond your SLA needs investigation
Document the grace period so downstream consumers know reports may shift slightly

The structured answer "I'd define a grace period — say 7 days. Facts within that window load to the main fact table normally. Late facts go to a staging table for nightly reprocessing, with aggregates updated incrementally. For missing dimensions, I'd use a default 'Unknown' row (SK = -1) or create a stub that gets enriched when the real dimension data arrives."

← Partitioning Large Facts