Surrogate vs Natural Keys
Why every data warehouse table has an auto-generated ID column.
The two options
Natural key
A column (or combination) that exists in the business domain. Examples: email address, ISBN, employee number, SSN.
Surrogate key
An artificial, auto-generated value with no business meaning. Usually a sequential integer or UUID. It exists only to identify rows.
Why surrogate keys win in data warehouses
1. Performance
Integer joins are much faster than text joins. A 4-byte INT compares faster than a 50-character email, especially with millions of rows.
2. Stability
Natural keys change. Emails get updated. Names get hyphenated after marriage. Phone numbers change. Surrogate keys never change.
3. Simplicity with composite keys
Sometimes the natural key is a combination (FirstName + LastName + DOB). Joining 5 tables on that is painful. One integer is clean.
4. Required for Type 2 SCDs
If you need historical versions of a customer (Type 2 SCD), you can't use the natural key as the PK — multiple rows have the same email. You need a surrogate key to distinguish "John at his old address" from "John at his new address".
5. Decoupling from business rules
If the business changes how they assign customer IDs (e.g. migrating from "C001" format to UUIDs), your warehouse doesn't break — the surrogate key remains stable.
6. Security and privacy
SSN as a primary key means it appears in every related table. Surrogate keys keep sensitive data isolated to one place.
When natural keys are OK
- OLTP systems where there's no Type 2 SCD requirement
- Small lookup tables (countries, currencies) where the natural key (3-letter code) is stable and short
- Composite keys in junction/bridge tables where the surrogate would just add overhead
Implementation
Comparison summary
| Aspect | Natural Key | Surrogate Key |
|---|---|---|
| Source | Business attribute | System-generated |
| Meaning | Has business meaning | None |
| Stability | Can change | Never changes |
| Join performance | Slower (often text) | Faster (integer) |
| Composite key risk | Often composite | Always single column |
| Type 2 SCD friendly? | No | Yes |
| Default in DW? | Rarely | Almost always |