🛠 Learn ERWIN

Late-Arriving Facts

When a transaction from January 15th shows up on January 22nd — how to handle the delay cleanly.

The interview question "How would you handle late-arriving facts?"

The scenario

Today is January 22, 2024. A sale that occurred on January 15 just landed in your data pipeline — seven days late. Maybe:

How do you load this into your fact table without breaking historical reports?

Two related challenges

Late-arriving fact

The transaction itself arrives late. You need to backfill it to its original date and recompute affected aggregates.

Early-arriving fact

The fact arrives but the dimension row it should join to doesn't exist yet (customer hasn't been ETL'd). Need a placeholder.

Strategy 1: Staging area with reconciliation

Source System │ ▼ RawDataLayer (ODS) │ │ Validation + Reconciliation ▼ ┌─ FactTable ◄── On-time facts └─ LateArrivingFacts ◄── Late facts (held for processing) │ │ Nightly batch ▼ Reprocess + Update FactTable │ ▼ Final Reporting Layer

Strategy 2: Grace period rule

Define an acceptable lateness window. Facts arriving within the window go straight to the main table; older facts get special handling.

ETL pseudo-code
LATENESS_WINDOW = 7 days FOR each incoming fact: IF (CurrentDate - TransactionDate) <= 7: INSERT INTO FactTable -- standard load ELSE: INSERT INTO LateArrivingFactsStaging TriggerNightlyBackfillJob()

Strategy 3: Handle missing dimensions

Sometimes the fact arrives before the dimension. A late-loaded customer row hasn't shown up yet. Solutions:

3a. Use a default dimension row

Pre-populate dimension tables with a special "Unknown" row (typically SK = -1). Facts referencing missing dimensions point here temporarily.

CustomerDimension CustomerSK │ CustomerID │ Name │ City -1'UNKNOWN' │ Unknown │ Unknown ← placeholder 501C001 │ John │ Bangalore 502C002 │ Sara │ Mumbai

3b. Inferred dimension members

Create a stub dimension row when a fact arrives referencing an unknown ID. Fill in details when the proper dimension data arrives.

SQL — insert stub then update
-- Step 1: fact arrives with CustomerID = 'C999' (new customer) INSERT INTO CustomerDimension (CustomerID, IsStub, EffectiveDate) VALUES ('C999', 'Y', CURRENT_DATE); -- Step 2: fact references the stub's surrogate key INSERT INTO OrderFacts (OrderID, CustomerSK, OrderAmount, OrderDate) VALUES (10001, <new_stub_sk>, 499.99, '2024-01-22'); -- Step 3: full customer data arrives later — update the stub in place UPDATE CustomerDimension SET CustomerName = 'New Customer', Email = 'new@example.com', City = 'Delhi', IsStub = 'N' WHERE CustomerID = 'C999';

Strategy 4: Update aggregate tables

If you maintain pre-aggregated tables (daily/weekly summaries), a late-arriving fact requires updating those aggregates:

SQL
-- A $499 sale from January 15 just arrived UPDATE DailySalesAggregate SET TotalSales = TotalSales + 499.99, TransactionCount = TransactionCount + 1 WHERE SalesDate = '2024-01-15';

Time-based partitioning impact

Late-arriving facts mean writes to old partitions. This affects:

Monitoring and SLAs

SQL — track late arrivals
SELECT DATE(LoadTimestamp) AS LoadDate, (LoadTimestamp - TransactionDate) AS LatenessHours, COUNT(*) AS LateFactCount FROM LateArrivingFactsLog WHERE LoadTimestamp > CURRENT_TIMESTAMP - INTERVAL '7 days' GROUP BY LoadDate, LatenessHours ORDER BY LoadDate DESC;

Best practices summary

The structured answer "I'd define a grace period — say 7 days. Facts within that window load to the main fact table normally. Late facts go to a staging table for nightly reprocessing, with aggregates updated incrementally. For missing dimensions, I'd use a default 'Unknown' row (SK = -1) or create a stub that gets enriched when the real dimension data arrives."