Great Expectations with Databricks: Validating Delta Lake Tables

The last year has been the Databricks year for me. New client, new stack — Spark instead of pandas, Delta Lake instead of SQL Server, notebooks instead of SSDT. And the first question I had to answer was whether the data quality discipline I'd built around Great Expectations would carry over.

The short answer: yes, with some adaptation. Great Expectations supports Spark datasources and can validate Delta Lake tables through the SparkDFDatasource. The expectation API is the same; the execution engine changes.

Setting Up the Spark Datasource

On Databricks, you need to install Great Expectations as a cluster library (via PyPI) and then configure a DataContext. In a Databricks notebook:

%pip install great_expectations

import great_expectations as ge
from great_expectations.core.batch import RuntimeBatchRequest
from great_expectations.data_context import BaseDataContext
from great_expectations.data_context.types.base import (
    DataContextConfig,
    InMemoryStoreBackendDefaults
)

# Use an ephemeral in-memory context for notebook-based validation
context = BaseDataContext(
    project_config=DataContextConfig(
        store_backend_defaults=InMemoryStoreBackendDefaults()
    )
)

# Add a Spark datasource
context.add_datasource(
    name="spark_delta",
    class_name="Datasource",
    execution_engine={
        "class_name": "SparkDFExecutionEngine",
        "force_reuse_spark_context": True
    },
    data_connectors={
        "runtime_connector": {
            "class_name": "RuntimeDataConnector",
            "batch_identifiers": ["run_id"]
        }
    }
)

Validating a Delta Table

# Read the Delta table as a Spark DataFrame
delta_df = spark.read.format("delta").load("/mnt/datalake/storm_events/silver")

# Create a batch request pointing at the DataFrame
batch_request = RuntimeBatchRequest(
    datasource_name="spark_delta",
    data_connector_name="runtime_connector",
    data_asset_name="storm_events_silver",
    runtime_parameters={"batch_data": delta_df},
    batch_identifiers={"run_id": "2021-03-08"}
)

# Get a validator with an expectation suite
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="storm_events_silver"
)

# Define expectations — same API as pandas
validator.expect_column_to_exist("event_id")
validator.expect_column_values_to_not_be_null("event_id")
validator.expect_column_values_to_not_be_null("event_date")
validator.expect_column_values_to_be_between("magnitude",
    min_value=0.0, max_value=12.0, mostly=0.99)
validator.expect_table_row_count_to_be_between(
    min_value=10000, max_value=50000000)

# Save the suite to the in-memory store
validator.save_expectation_suite(discard_failed_expectations=False)

# Validate
checkpoint_result = context.run_checkpoint(
    checkpoint_name="storm_silver_checkpoint",
    validations=[{
        "batch_request": batch_request,
        "expectation_suite_name": "storm_events_silver"
    }]
)
print(f"Validation passed: {checkpoint_result.success}")

What Runs on Spark vs. What Doesn't

Not every GE expectation translates efficiently to Spark. Most column-level expectations (null checks, value ranges, value sets) push down cleanly to Spark operations. Expectations that require ordering or row-by-row comparison — like expect_column_values_to_be_increasing — may trigger a full sort that's expensive on large datasets.

The practical rule: test your expectation suite against a representative sample first and check the Spark query plan if a validation is taking longer than expected. A poorly chosen expectation on a billion-row Delta table can turn a 30-second validation into a 30-minute job.

Delta Lake and GE Together

Delta Lake already enforces schema — it won't let you write a string into an integer column. Great Expectations sits above that layer and enforces business rules that Delta schema enforcement can't express: value ranges, null policies, cardinality constraints, cross-column relationships. The two are complementary, not redundant. Delta guarantees your data fits; GE guarantees your data makes sense. As always, I'm here to help.

Great Expectations with Databricks: Validating Delta Lake Tables

Shannon Lowder

Setting Up the Spark Datasource

Validating a Delta Table

What Runs on Spark vs. What Doesn't

Delta Lake and GE Together

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving