🛠 Learn ERWIN

Surrogate vs Natural Keys

Why every data warehouse table has an auto-generated ID column.

The two options

Natural key

A column (or combination) that exists in the business domain. Examples: email address, ISBN, employee number, SSN.

Surrogate key

An artificial, auto-generated value with no business meaning. Usually a sequential integer or UUID. It exists only to identify rows.

Natural key approach Customer ├── Email (PK) ← natural key — business-meaningful ├── Name ├── PhoneNumber └── Address Surrogate key approach Customer ├── CustomerSK (PK) ← surrogate — auto-incremented integer ├── Email ← still stored, but not the key ├── Name ├── PhoneNumber └── Address

Why surrogate keys win in data warehouses

1. Performance

Integer joins are much faster than text joins. A 4-byte INT compares faster than a 50-character email, especially with millions of rows.

2. Stability

Natural keys change. Emails get updated. Names get hyphenated after marriage. Phone numbers change. Surrogate keys never change.

3. Simplicity with composite keys

Sometimes the natural key is a combination (FirstName + LastName + DOB). Joining 5 tables on that is painful. One integer is clean.

4. Required for Type 2 SCDs

If you need historical versions of a customer (Type 2 SCD), you can't use the natural key as the PK — multiple rows have the same email. You need a surrogate key to distinguish "John at his old address" from "John at his new address".

CustomerDim — Type 2 SCD with surrogate keys CustomerSK │ CustomerID │ Name │ Address │ Current 501C001 │ John │ 123 Main │ N 502C001 │ John │ 456 Oak │ Y ← same person, new row CustomerID is the natural key (still unique per real person). CustomerSK is the surrogate — unique per version.

5. Decoupling from business rules

If the business changes how they assign customer IDs (e.g. migrating from "C001" format to UUIDs), your warehouse doesn't break — the surrogate key remains stable.

6. Security and privacy

SSN as a primary key means it appears in every related table. Surrogate keys keep sensitive data isolated to one place.

When natural keys are OK

Implementation

SQL — auto-incrementing surrogate key
CREATE TABLE CustomerDim ( CustomerSK INT PRIMARY KEY AUTO_INCREMENT, -- MySQL -- CustomerSK INT IDENTITY(1,1) PRIMARY KEY, -- SQL Server -- CustomerSK SERIAL PRIMARY KEY, -- PostgreSQL CustomerID VARCHAR(20) NOT NULL, -- natural key CustomerName VARCHAR(100), Email VARCHAR(200), City VARCHAR(100), StartDate DATE, EndDate DATE, IsCurrent CHAR(1) );

Comparison summary

AspectNatural KeySurrogate Key
SourceBusiness attributeSystem-generated
MeaningHas business meaningNone
StabilityCan changeNever changes
Join performanceSlower (often text)Faster (integer)
Composite key riskOften compositeAlways single column
Type 2 SCD friendly?NoYes
Default in DW?RarelyAlmost always
The rule of thumb In OLTP, natural keys are often fine. In data warehouses, default to surrogate keys for every dimension. The 4-byte integer column pays for itself in performance, stability, and SCD flexibility.