Dimensional Modeling in a Lakehouse: What Changes When Constraints Are Optional

Dimensional modeling gave us a set of conventions that have held up remarkably well for thirty years: fact tables at the center, dimension tables around them, foreign keys enforced by the database, constraints preventing bad data from landing. It works because the database does active work to maintain integrity.

Move to a lakehouse — Delta Lake, Unity Catalog, Databricks — and the database stops doing that work. There are no foreign key constraints. There's no NOT NULL enforcement at the table level unless you add it. The rules that were implicit in your RDBMS design are now your responsibility to implement explicitly. Here's what that actually means in practice.

What Changes

In a traditional warehouse, a fact table row that references a nonexistent dimension key is rejected at insert time. The database enforces referential integrity. In Delta Lake, that row lands without complaint. You'll find it — probably at query time, probably when a report produces unexpected results.

This isn't a reason to abandon dimensional modeling. It's a reason to understand that the enforcement layer moves from the storage engine to your pipeline.

The Dimension Table Pattern

-- Unity Catalog: dimensional model for orders
CREATE TABLE prod_analytics.sales.dim_customer (
customer_key BIGINT NOT NULL, -- surrogate key
customer_id STRING NOT NULL, -- natural / business key
customer_name STRING NOT NULL,
region_code STRING NOT NULL,
customer_segment STRING,
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP, -- NULL = current row
is_current BOOLEAN NOT NULL
)
USING DELTA
COMMENT 'Customer dimension — SCD Type 2';

-- Delta doesn't enforce NOT NULL at insert time — validate in pipeline
-- or use DLT expectations to reject violations

SCD Type 2 Without a Surrogate Key Generator

In SQL Server you'd use IDENTITY columns. In Delta Lake, surrogate key generation requires explicit code:

from pyspark.sql.functions import monotonically_increasing_id, col, current_timestamp, lit

def load_scd2_dimension(
source_df,
dim_path: str,
natural_key_col: str,
attribute_cols: list[str]
) -> None:
from delta.tables import DeltaTable

if not DeltaTable.isDeltaTable(spark, dim_path):
# First load — generate surrogate keys
initial_df = (
source_df
.withColumn("customer_key", monotonically_increasing_id())
.withColumn("valid_from", current_timestamp())
.withColumn("valid_to", lit(None).cast("timestamp"))
.withColumn("is_current", lit(True))
)
initial_df.write.format("delta").save(dim_path)
return

dim_table = DeltaTable.forPath(spark, dim_path)

# Detect changes
changed = (
source_df.alias("src")
.join(
dim_table.toDF().filter("is_current = true").alias("dim"),
col(f"src.{natural_key_col}") == col(f"dim.{natural_key_col}"),
"left"
)
.where(" OR ".join(f"src.{c} != dim.{c}" for c in attribute_cols))
)

# Close out changed current rows
(dim_table.alias("dim")
.merge(
changed.alias("chg"),
f"dim.{natural_key_col} = chg.{natural_key_col} AND dim.is_current = true"
)
.whenMatchedUpdate(set={
"valid_to": "current_timestamp()",
"is_current": "false"
})
.execute())

# Insert new current rows for changed records
new_rows = changed.withColumn("customer_key", monotonically_increasing_id() + 1000000) \
.withColumn("valid_from", current_timestamp()) \
.withColumn("valid_to", lit(None).cast("timestamp")) \
.withColumn("is_current", lit(True))
new_rows.write.format("delta").mode("append").save(dim_path)

Referential Integrity as a Pipeline Step

Since Delta doesn't enforce foreign key relationships, add an explicit check before loading fact tables:

def check_referential_integrity(
fact_df,
dim_df,
fact_key: str,
dim_key: str,
source_name: str
) -> tuple:
orphaned = fact_df.join(
dim_df.select(dim_key),
fact_df[fact_key] == dim_df[dim_key],
"left_anti"
)
orphan_count = orphaned.count()
if orphan_count > 0:
# Route orphans to the DLQ or raise depending on tolerance
print(f"WARNING: {orphan_count} fact rows have no matching {source_name} dimension key")
return fact_df.join(dim_df.select(dim_key), fact_df[fact_key] == dim_df[dim_key])

The work the database used to do for free now costs you pipeline code. The trade-off is flexibility and scale — no constraint overhead on billion-row tables, no locks during bulk loads. But the safety guarantees only exist if you build them. As always, I'm here to help.

Read more