The last year has been the Databricks year for me. New client, new stack — Spark instead of pandas, Delta Lake instead of SQL Server, notebooks instead of SSDT. And the first question I had to answer was whether the data quality discipline I'd built around Great Expectations would carry over.
The short answer: yes, with some adaptation. Great Expectations supports Spark datasources and can validate Delta Lake tables through the SparkDFDatasource. The expectation API is the same; the execution engine changes.
Setting Up the Spark Datasource
On Databricks, you need to install Great Expectations as a cluster library (via PyPI) and then configure a DataContext. In a Databricks notebook:
%pip install great_expectations
import great_expectations as ge
from great_expectations.core.batch import RuntimeBatchRequest
from great_expectations.data_context import BaseDataContext
from great_expectations.data_context.types.base import (
DataContextConfig,
InMemoryStoreBackendDefaults
)
# Use an ephemeral in-memory context for notebook-based validation
context = BaseDataContext(
project_config=DataContextConfig(
store_backend_defaults=InMemoryStoreBackendDefaults()
)
)
# Add a Spark datasource
context.add_datasource(
name="spark_delta",
class_name="Datasource",
execution_engine={
"class_name": "SparkDFExecutionEngine",
"force_reuse_spark_context": True
},
data_connectors={
"runtime_connector": {
"class_name": "RuntimeDataConnector",
"batch_identifiers": ["run_id"]
}
}
)
Validating a Delta Table
# Read the Delta table as a Spark DataFrame
delta_df = spark.read.format("delta").load("/mnt/datalake/storm_events/silver")
# Create a batch request pointing at the DataFrame
batch_request = RuntimeBatchRequest(
datasource_name="spark_delta",
data_connector_name="runtime_connector",
data_asset_name="storm_events_silver",
runtime_parameters={"batch_data": delta_df},
batch_identifiers={"run_id": "2021-03-08"}
)
# Get a validator with an expectation suite
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="storm_events_silver"
)
# Define expectations — same API as pandas
validator.expect_column_to_exist("event_id")
validator.expect_column_values_to_not_be_null("event_id")
validator.expect_column_values_to_not_be_null("event_date")
validator.expect_column_values_to_be_between("magnitude",
min_value=0.0, max_value=12.0, mostly=0.99)
validator.expect_table_row_count_to_be_between(
min_value=10000, max_value=50000000)
# Save the suite to the in-memory store
validator.save_expectation_suite(discard_failed_expectations=False)
# Validate
checkpoint_result = context.run_checkpoint(
checkpoint_name="storm_silver_checkpoint",
validations=[{
"batch_request": batch_request,
"expectation_suite_name": "storm_events_silver"
}]
)
print(f"Validation passed: {checkpoint_result.success}")
What Runs on Spark vs. What Doesn't
Not every GE expectation translates efficiently to Spark. Most column-level expectations (null checks, value ranges, value sets) push down cleanly to Spark operations. Expectations that require ordering or row-by-row comparison — like expect_column_values_to_be_increasing — may trigger a full sort that's expensive on large datasets.
The practical rule: test your expectation suite against a representative sample first and check the Spark query plan if a validation is taking longer than expected. A poorly chosen expectation on a billion-row Delta table can turn a 30-second validation into a 30-minute job.
Delta Lake and GE Together
Delta Lake already enforces schema — it won't let you write a string into an integer column. Great Expectations sits above that layer and enforces business rules that Delta schema enforcement can't express: value ranges, null policies, cardinality constraints, cross-column relationships. The two are complementary, not redundant. Delta guarantees your data fits; GE guarantees your data makes sense. As always, I'm here to help.