Data Vault on Databricks: The Hub-Satellite Pattern for Audit-Ready Pipelines

Dimensional modeling works well when your business questions are known in advance and your schema can be designed around them. It falls apart when the questions change, when you need full historical auditability of every change to every record, or when you're building a centralized enterprise data warehouse that needs to survive schema evolution across multiple source systems over years.

Data Vault was designed for exactly those conditions. It's verbose, it requires more up-front modeling work, and the queries are more complex than a star schema. But for the right use case, it pays off. Here's how the core pattern works on Databricks.

The Three Core Constructs

Hubs store business keys — the unique identifiers that are meaningful to the business. An order hub stores order IDs. A customer hub stores customer IDs. Hubs never change after insertion; they're append-only.

Satellites store descriptive attributes and their history. When a customer's address changes, a new row is inserted into the customer satellite with a new load timestamp. The old row is preserved. Every state the record has ever been in is queryable.

Links store relationships between hubs. The relationship between a customer and an order is stored in a link table that holds foreign keys to both hubs. Links are also append-only.

The Hub Pattern

from pyspark.sql import DataFrame
from pyspark.sql.functions import sha2, concat_ws, lit, current_timestamp
from delta.tables import DeltaTable

def load_hub(
source_df: DataFrame,
hub_path: str,
business_key_col: str,
source_name: str
) -> None:
hub_df = (
source_df
.select(business_key_col)
.distinct()
.withColumn(
"hub_hash_key",
sha2(col(business_key_col).cast("string"), 256)
)
.withColumn("load_timestamp", current_timestamp())
.withColumn("record_source", lit(source_name))
)

if DeltaTable.isDeltaTable(spark, hub_path):
dt = DeltaTable.forPath(spark, hub_path)
(dt.alias("hub")
.merge(hub_df.alias("new"), "hub.hub_hash_key = new.hub_hash_key")
.whenNotMatchedInsertAll()
.execute())
else:
hub_df.write.format("delta").save(hub_path)

The Satellite Pattern

from pyspark.sql.functions import col, sha2, concat_ws, current_timestamp, lit

def load_satellite(
source_df: DataFrame,
sat_path: str,
hub_key_col: str,
attribute_cols: list[str],
source_name: str
) -> None:
sat_df = (
source_df
.select([hub_key_col] + attribute_cols)
.withColumn(
"hub_hash_key",
sha2(col(hub_key_col).cast("string"), 256)
)
.withColumn(
"hashdiff",
sha2(concat_ws("||", *[col(c).cast("string") for c in attribute_cols]), 256)
)
.withColumn("load_timestamp", current_timestamp())
.withColumn("load_end_timestamp", lit(None).cast("timestamp"))
.withColumn("record_source", lit(source_name))
)

if DeltaTable.isDeltaTable(spark, sat_path):
dt = DeltaTable.forPath(spark, sat_path)
# Close out records where the hashdiff changed
dt.alias("sat").merge(
sat_df.alias("new"),
"sat.hub_hash_key = new.hub_hash_key AND sat.load_end_timestamp IS NULL AND sat.hashdiff != new.hashdiff"
).whenMatchedUpdate(set={"load_end_timestamp": "new.load_timestamp"}).execute()
# Insert new rows (new records or changed records)
(dt.alias("sat")
.merge(sat_df.alias("new"),
"sat.hub_hash_key = new.hub_hash_key AND sat.hashdiff = new.hashdiff AND sat.load_end_timestamp IS NULL")
.whenNotMatchedInsertAll()
.execute())
else:
sat_df.write.format("delta").save(sat_path)

What Delta Lake Adds

Data Vault on a traditional RDBMS requires careful transaction management to maintain consistency between hub, satellite, and link loads. Delta Lake's ACID guarantees handle this — each load operation is atomic, so you don't get half-loaded hubs or orphaned satellite records from a failed run. Time travel gives you an additional layer of historical access on top of the satellite's native history.

When Not to Use Data Vault

Data Vault adds modeling complexity and query complexity. If your use case doesn't require full historical auditability, doesn't need to survive radical schema changes, and doesn't integrate ten or more source systems with conflicting business key definitions — a star schema or a well-designed medallion architecture will serve you better with less overhead. Data Vault is not the default. It's the right answer for specific, high-complexity situations. As always, I'm here to help.

Read more