Great Expectations Profiler: Auto-Generating Expectations from Your Data

One of the friction points when adopting Great Expectations is the blank-page problem: you have a new dataset, you know you should write expectations for it, and you're not sure where to start. What ranges are reasonable for this column? What's the actual null rate? What values does this categorical column actually contain?

The GE Profiler exists to answer those questions automatically. It runs a statistical analysis of a sample dataset and generates an initial expectation suite based on what it observes. You get a starting point rather than a blank file.

How the Profiler Works

The Profiler samples your dataset (or a representative slice of it) and generates expectations based on observed statistics:

  • For numeric columns: expect_column_values_to_be_between with min/max derived from observed quantiles (with a tolerance buffer)
  • For string columns with low cardinality: expect_column_values_to_be_in_set with the observed unique values
  • For columns with no nulls in the sample: expect_column_values_to_not_be_null
  • For all columns: expect_column_to_exist
  • For the table: expect_table_row_count_to_be_between
import great_expectations as ge
from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler

context = ge.get_context()

# Get a batch to profile
batch_request = {
    "datasource_name": "spark_datasource",
    "data_connector_name": "runtime_connector",
    "data_asset_name": "storm_events_silver",
    "runtime_parameters": {"batch_data": silver_df},
    "batch_identifiers": {"run_id": "profiling_run"}
}

validator = context.get_validator(
    batch_request=batch_request,
    create_expectation_suite_with_name="storm_silver_profiled"
)

# Configure the profiler
profiler = UserConfigurableProfiler(
    profile_dataset=validator,
    excluded_expectations=["expect_column_quantile_values_to_be_between"],
    ignored_columns=["_ingestion_timestamp", "_source_file"],
    semantic_types_dict={
        "numeric": ["magnitude", "injuries_direct", "deaths_direct"],
        "value_set": ["event_type", "state", "event_status"]
    }
)

suite, validation_result = profiler.build_suite()
context.save_expectation_suite(suite)
print(f"Generated {len(suite.expectations)} expectations")

What the Profiler Gets Right and What It Doesn't

The Profiler is a starting point, not a finished product. What it gets right:

  • Column existence expectations — always correct
  • Null expectations on columns that genuinely can't be null — usually correct if you profile clean data
  • Value set expectations for stable categoricals — usually correct

What needs review:

  • Numeric ranges are derived from the sample. If your sample happens to include an outlier, the range will be wider than you want. If your sample is from a low-volume period, the range may be tighter than production reality.
  • Row count bounds are specific to the sample size and won't generalize to different-sized batches without adjustment.
  • Business rules the profiler has no way to know about — like "event end must be after event start" — won't appear at all.

The Workflow

  1. Profile a representative sample → get an initial suite
  2. Review each generated expectation — delete ones that are wrong, tighten ranges that are too loose
  3. Add business rule expectations the profiler can't derive
  4. Commit the final suite to source control

The Profiler cuts the blank-page time from an hour to 10 minutes. The remaining 10 minutes is you applying domain knowledge to the generated suite. That's a good division of labor: let the tool do the statistical work, then apply judgment to the result. As always, I'm here to help.

Read more