Building a Data Quality Framework: Architecture Decisions After Five Years with Great Expectations

Five years of using Great Expectations across different projects — pandas-based data science, SQL Server pipelines, Databricks at scale — has produced a set of architectural opinions I keep coming back to. This post is the patterns that have held up, and the mistakes that taught me to adopt them.

Maintain a Separate Data Quality Project

The mistake I made early on: embedding GE configuration inside the pipeline project it served. When the pipeline was refactored, the GE project got dropped or broken. When the pipeline team changed, the GE config became orphaned.

The pattern that works better: a dedicated data quality repository that owns all expectation suites, GE configuration, and validation infrastructure. Pipeline projects reference the suites by name. The data quality project is maintained by the team that owns data quality — which may or may not be the same team that owns the pipeline.

data-quality/
  great_expectations/
    expectations/
      storm_events_raw.json
      storm_events_silver.json
      storm_events_gold.json
      weather_stations_silver.json
    checkpoints/
      storm_daily_checkpoint.yml
      stations_weekly_checkpoint.yml
    great_expectations.yml
  tests/
    test_expectation_suites.py    # verify suites are valid GE JSON
  docs/
    data_contracts.md             # human-readable summary of each suite

Name Suites After Data Assets, Not Pipelines

Early suites I named after pipelines: storm_ingest_validation.json, etl_output_check.json. When the pipeline was reorganized, the suite name became misleading. When the same data asset was produced by multiple pipelines, I had duplicate suites with diverging definitions.

The right naming convention: the suite describes the data asset, not the pipeline. storm_events_silver.json is the contract for the silver-layer storm events data, regardless of which pipeline produced it. Any pipeline that produces that asset should validate against that suite.

Separate Raw Suites from Processed Suites

Raw data suites are lenient — they reflect what the source actually delivers, including known messiness. Processed data suites are strict — they reflect what your transforms are supposed to guarantee.

# storm_events_raw.json — reflects source reality
"expect_column_values_to_not_be_null": {"column": "event_id", "mostly": 0.999}
# 0.1% null tolerance because NOAA has some transcription gaps

# storm_events_silver.json — reflects transform guarantees
"expect_column_values_to_not_be_null": {"column": "event_id", "mostly": 1.0}
# Your transform should have handled the nulls — no tolerance here

Log Validation Results to a Queryable Store

Build a validation result history from day one. Whether that's a Delta table in Unity Catalog, a PostgreSQL table, or an S3 bucket of JSON files — you need to be able to answer "when did this suite first start failing?" and "has our null rate on this column been trending up?"

GE's built-in result stores (local filesystem, S3, Azure Blob) give you the files. Making them queryable requires a bit more work — a simple ETL that reads the result JSON and writes to a table, or using GE's store backends that write directly to a database.

Version Your Suites Like You Version Code

A suite change is a contract change. It should have a commit message explaining why — not just "updated magnitude range" but "widened magnitude range to accommodate new NOAA reporting methodology for EF5 tornadoes." When something goes wrong downstream and you're trying to figure out when the contract changed, the git log of the suite file is your audit trail.

PR review for suite changes is not overkill. Someone with domain knowledge should sign off on a decision to widen a range or add a mostly tolerance to an expectation that previously had none. That decision has downstream consequences. As always, I'm here to help.

Read more