Great Expectations Profiler: Auto-Generating Expectations from Your Data

One of the friction points when adopting Great Expectations is the blank-page problem: you have a new dataset, you know you should write expectations for it, and you're not sure where to start. What ranges are reasonable for this column? What's the actual null rate? What values does this categorical column actually contain?

The GE Profiler exists to answer those questions automatically. It runs a statistical analysis of a sample dataset and generates an initial expectation suite based on what it observes. You get a starting point rather than a blank file.

How the Profiler Works

The Profiler samples your dataset (or a representative slice of it) and generates expectations based on observed statistics:

For numeric columns: expect_column_values_to_be_between with min/max derived from observed quantiles (with a tolerance buffer)
For string columns with low cardinality: expect_column_values_to_be_in_set with the observed unique values
For columns with no nulls in the sample: expect_column_values_to_not_be_null
For all columns: expect_column_to_exist
For the table: expect_table_row_count_to_be_between

import great_expectations as ge
from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler

context = ge.get_context()

# Get a batch to profile
batch_request = {
    "datasource_name": "spark_datasource",
    "data_connector_name": "runtime_connector",
    "data_asset_name": "storm_events_silver",
    "runtime_parameters": {"batch_data": silver_df},
    "batch_identifiers": {"run_id": "profiling_run"}
}

validator = context.get_validator(
    batch_request=batch_request,
    create_expectation_suite_with_name="storm_silver_profiled"
)

# Configure the profiler
profiler = UserConfigurableProfiler(
    profile_dataset=validator,
    excluded_expectations=["expect_column_quantile_values_to_be_between"],
    ignored_columns=["_ingestion_timestamp", "_source_file"],
    semantic_types_dict={
        "numeric": ["magnitude", "injuries_direct", "deaths_direct"],
        "value_set": ["event_type", "state", "event_status"]
    }
)

suite, validation_result = profiler.build_suite()
context.save_expectation_suite(suite)
print(f"Generated {len(suite.expectations)} expectations")

What the Profiler Gets Right and What It Doesn't

The Profiler is a starting point, not a finished product. What it gets right:

Column existence expectations — always correct
Null expectations on columns that genuinely can't be null — usually correct if you profile clean data
Value set expectations for stable categoricals — usually correct

What needs review:

Numeric ranges are derived from the sample. If your sample happens to include an outlier, the range will be wider than you want. If your sample is from a low-volume period, the range may be tighter than production reality.
Row count bounds are specific to the sample size and won't generalize to different-sized batches without adjustment.
Business rules the profiler has no way to know about — like "event end must be after event start" — won't appear at all.

The Workflow

Profile a representative sample → get an initial suite
Review each generated expectation — delete ones that are wrong, tighten ranges that are too loose
Add business rule expectations the profiler can't derive
Commit the final suite to source control

The Profiler cuts the blank-page time from an hour to 10 minutes. The remaining 10 minutes is you applying domain knowledge to the generated suite. That's a good division of labor: let the tool do the statistical work, then apply judgment to the result. As always, I'm here to help.

Great Expectations Profiler: Auto-Generating Expectations from Your Data

Shannon Lowder

How the Profiler Works

What the Profiler Gets Right and What It Doesn't

The Workflow

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving