Great Expectations and Unity Catalog: Data Quality Across the Metastore

Unity Catalog changed the Databricks data governance story significantly — a single metastore, fine-grained access controls, cross-workspace data sharing, and lineage tracking. It also changed the infrastructure assumptions for Great Expectations integrations, since Unity Catalog tables live in a different place than the Hive Metastore tables that earlier GE Databricks integrations were designed for.

Here's how the integration works with Unity Catalog, and where the current friction points are.

Reading Unity Catalog Tables for Validation

The cleanest path is to read the Unity Catalog table as a Spark DataFrame and pass it to GE via the RuntimeBatchRequest pattern — the same approach as validating any Spark DataFrame:

import great_expectations as ge
from great_expectations.core.batch import RuntimeBatchRequest

context = ge.get_context()

# Unity Catalog table reference: catalog.schema.table
uc_df = spark.table("storm_catalog.silver.storm_events")

batch_request = RuntimeBatchRequest(
    datasource_name="spark_datasource",
    data_connector_name="runtime_connector",
    data_asset_name="storm_events_silver",
    runtime_parameters={"batch_data": uc_df},
    batch_identifiers={"run_id": "2022-03-14-daily"}
)

validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="storm_silver_suite"
)

# Run expectations as usual
validator.expect_column_values_to_not_be_null("event_id")
validator.expect_column_values_to_be_between("magnitude", 0.0, 12.0, mostly=0.99)
result = validator.validate()
print(result.success)

Storing Validation Results

With Unity Catalog, you can store GE validation results in a Delta table in the catalog itself, making them queryable and auditable alongside the data they describe:

from pyspark.sql import Row
import json
from datetime import datetime

def save_validation_result_to_delta(result, catalog, schema, table):
    result_row = Row(
        run_id=result.batch_id,
        suite_name=result.expectation_suite_name,
        run_time=datetime.now(),
        success=result.success,
        evaluated=result.statistics["evaluated_expectations"],
        successful=result.statistics["successful_expectations"],
        unsuccessful=result.statistics["unsuccessful_expectations"],
        result_json=json.dumps(result.to_json_dict())
    )
    result_df = spark.createDataFrame([result_row])
    result_df.write.format("delta").mode("append").saveAsTable(
        f"{catalog}.{schema}.{table}"
    )

save_validation_result_to_delta(
    result,
    catalog="storm_catalog",
    schema="data_quality",
    table="ge_validation_results"
)

Now your validation history is queryable from SQL:

SELECT run_time, suite_name, success, unsuccessful
FROM storm_catalog.data_quality.ge_validation_results
WHERE suite_name = 'storm_silver_suite'
ORDER BY run_time DESC
LIMIT 30;

Unity Catalog Access Controls and GE

One thing to be explicit about: GE runs in the context of the Databricks cluster's service principal or user credential. The cluster needs SELECT access to any table it's validating. With Unity Catalog's fine-grained access controls, this means your validation jobs need to run under a service principal with appropriate grants — the same access model you'd apply to any data processing job.

Don't run validation jobs under an admin service principal to work around permission issues. The permission model is telling you something about whether your validation job should have access to that data. Respect it. As always, I'm here to help.

Read more