Data contracts have become a hot topic in the data engineering community over the last couple of years. The idea isn't new — the concept of a formal agreement between a data producer and a data consumer about the shape, semantics, and quality of data has been implicit in every ETL integration ever built. What's new is making those agreements explicit, versioned, and machine-enforceable rather than living in a wiki page and two engineers' shared memory.
Great Expectations expectation suites are, in practice, data contracts. Here's how to use them that way deliberately.
What a Data Contract Actually Is
A data contract specifies:
- Schema: which columns, what types, what's nullable
- Semantics: what values are valid, what ranges are expected, what relationships must hold
- SLAs: timeliness, freshness, row count expectations
- Ownership: who produces it, who consumes it, who to contact when it breaks
An expectation suite naturally covers the first two. It can partially cover the third (row count expectations, freshness checks via timestamp columns). The fourth requires metadata you manage separately.
Structuring Suites as Formal Contracts
The meta field in GE expectation suites and individual expectations carries arbitrary JSON. Use it to embed contract metadata:
{
"expectation_suite_name": "storm_events_silver",
"meta": {
"contract_version": "2.1.0",
"data_asset": "storm.silver_events",
"producer": "storm-ingestion-pipeline",
"consumers": ["model-training-pipeline", "storm-analytics-dashboard"],
"owner": "[email protected]",
"sla": {
"freshness_hours": 24,
"min_daily_row_count": 5000
},
"changelog": [
{"version": "2.1.0", "date": "2023-08-01", "change": "Added magnitude validation after Q2 data issues"},
{"version": "2.0.0", "date": "2023-01-15", "change": "Added event_type set constraint — breaking change"},
{"version": "1.0.0", "date": "2021-03-08", "change": "Initial contract"}
]
},
"expectations": [...]
}
Consumer-Driven Contract Testing
The most powerful contract pattern in software engineering is consumer-driven: the consumer defines the minimum contract it needs, the producer validates that its output meets that contract. Same principle applies to data.
If your ML training pipeline only needs event_id, event_date, magnitude, and state, define a consumer suite that asserts exactly those columns and their constraints. The producer runs both its own suite (full output contract) and the consumer suite (does the consumer's minimum requirement hold?) before delivery:
def deliver_to_training_pipeline(silver_df):
# Producer suite: full output contract
producer_result = run_checkpoint("storm_silver_producer", silver_df)
if not producer_result.success:
raise RuntimeError("Producer suite failed — do not deliver")
# Consumer suite: minimum contract for training pipeline
consumer_result = run_checkpoint("storm_silver_for_training", silver_df)
if not consumer_result.success:
raise RuntimeError("Consumer contract not met — coordinate with training team")
write_to_training_feature_store(silver_df)
Breaking vs. Non-Breaking Contract Changes
Apply semantic versioning to your expectation suites:
- Patch (1.0.x): tightening a range, adding a
mostlythreshold, fixing a typo in meta - Minor (1.x.0): adding new expectations (consumers need to verify they still pass)
- Major (x.0.0): removing columns, renaming columns, changing types, removing expectations that consumers depended on
Major contract changes require explicit coordination with consumers. The version in the suite meta makes the change type visible in the Git diff. A major version bump in a PR review is a signal that downstream teams need to be notified before merge.
Making data contracts explicit and versioned doesn't eliminate the coordination overhead of schema changes — but it makes that coordination happen deliberately and in advance, rather than reactively when something breaks. As always, I'm here to help.