A question I've been asked more than once since moving to the Databricks stack: "We're using Delta Lake — doesn't that already handle data quality?" The answer is: partly. Delta handles one category of data quality problems very well and is silent on a different, equally important category. Understanding the boundary between what Delta Lake does and what Great Expectations does is what tells you how to use them together rather than instead of each other.
What Delta Lake Enforces
Delta Lake schema enforcement prevents you from writing data that violates the table's schema. If the column is defined as LongType and you try to write a StringType, the write fails. If the schema says there are 12 columns and you try to write 13, the write fails. This is enforced at the storage layer — it's not optional, it's not a check you can skip.
Schema evolution with mergeSchema or overwriteSchema lets you change the schema intentionally, but the enforcement means you can't accidentally write incompatible data.
Delta also gives you ACID transactions, time travel, and audit history — all useful for data quality investigations, but not themselves quality enforcement.
What Delta Lake Doesn't Enforce
Delta doesn't know anything about the meaning of your data. It knows the type; it doesn't know the valid range. A magnitude column of DoubleType will accept -999.0 just as readily as 2.5. An event_date column will accept January 1, 1900 just as readily as today's date. A status column will accept any string — not just the three values your application actually uses.
These are business rule violations, not schema violations. Delta can't express them because they depend on domain knowledge that isn't part of the table definition.
Where Great Expectations Picks Up
# Delta Lake schema enforcement: already handled by the table definition
# Great Expectations: business rule layer on top
validator.expect_column_values_to_be_between(
"magnitude", min_value=0.0, max_value=12.0, mostly=0.99
) # Not a type constraint — Delta can't enforce this
validator.expect_column_values_to_not_be_null("event_date")
# Delta schema allows nulls unless the column has NOT NULL constraint
# (Delta didn't support NOT NULL constraints until fairly recently)
validator.expect_column_values_to_be_in_set(
"event_status",
value_set=["active", "closed", "cancelled"]
) # Delta doesn't know what values are semantically valid
validator.expect_column_pair_values_a_to_be_greater_than_b(
"end_datetime", "start_datetime"
) # Cross-column constraint — entirely outside Delta's scope
The Combined Architecture
The pattern that works:
- Delta Lake schema: enforces types and column presence at write time. Your pipeline can't accidentally drop a column or change a type.
- Great Expectations in the silver layer: validates business rules after raw data has been parsed and typed. Catches domain violations before data moves to the gold layer.
- Great Expectations in the gold layer: validates that aggregations and derived metrics are within expected bounds before serving to downstream consumers.
# Bronze → Silver transform
silver_df = parse_and_type_raw_data(bronze_df)
# GE validates business rules at silver
silver_validation = run_checkpoint("storm_silver_checkpoint", silver_df)
if not silver_validation.success:
raise RuntimeError("Silver layer failed business rule validation")
# Write to Delta (schema enforcement kicks in here)
silver_df.write.format("delta").mode("append").saveAsTable("storm.silver_events")
# Silver → Gold aggregation
gold_df = compute_storm_summaries(silver_df)
# GE validates gold layer outputs
gold_validation = run_checkpoint("storm_gold_checkpoint", gold_df)
if not gold_validation.success:
raise RuntimeError("Gold layer metrics out of expected range")
gold_df.write.format("delta").mode("overwrite").saveAsTable("storm.gold_summaries")
Delta and GE are not alternatives to each other. They enforce different things at different layers. Use both. As always, I'm here to help.