Consumer Group Lag: The Metric That Tells You When Your Streaming Pipeline Is in Trouble

Consumer group lag is the number of messages between where your consumer has committed its offset and where the end of the partition currently is. Zero lag means you are caught up — every message that has been published has been consumed. Rising lag means you are falling behind. Lag at or near the retention window means you are about to start losing messages permanently.

It is the single most important operational metric for a Kafka-based streaming pipeline, and it is frequently the last thing people set up monitoring for.

Understanding Lag Per Partition

Lag is reported per consumer group per partition. A topic with 8 partitions has 8 independent lag values. Overall pipeline health is the sum, but the per-partition view tells you whether the problem is a single hot partition (partition skew, bad partition key design) or a global slowdown affecting all partitions equally.

# Kafka CLI: check consumer group lag
kafka-consumer-groups.sh   --bootstrap-server kafka-broker-01:9092   --group sensor-raw-ingest   --describe

# Sample output:
# GROUP               TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# sensor-raw-ingest   sensor_readings    0          12483920        12483920        0
# sensor-raw-ingest   sensor_readings    1          12401883        12403201        1318
# sensor-raw-ingest   sensor_readings    2          12389447        12389447        0
# sensor-raw-ingest   sensor_readings    3          12501203        12534921        33718

# Partition 3 has 33,718 messages of lag and growing — something is wrong

Uneven lag across partitions (partition 1 and 3 lagging, partitions 0 and 2 at zero) usually means one of two things: partition skew (too many messages going to certain partitions because of a bad partition key) or a consumer that is processing some partitions faster than others (thread imbalance or a slow UDF applied to certain record types).

Monitoring Lag in Databricks Structured Streaming

Databricks exposes streaming query progress as a JSON object accessible from the query object or from the Spark UI. For Kafka sources, the progress includes input rows per second, processed rows per second, and the Kafka offset ranges of the most recent batch — from which you can compute lag.

# Access streaming query progress metrics
query = raw_landed.writeStream     .format("delta")     .option("checkpointLocation", checkpoint_path)     .trigger(processingTime="5 minutes")     .start(output_path)

# After a few batches, check progress
progress = query.lastProgress
print(f"Input rows/sec: {progress['inputRowsPerSecond']}")
print(f"Processed rows/sec: {progress['processedRowsPerSecond']}")

# The key ratio: if processed < input, you are falling behind
if progress['processedRowsPerSecond'] < progress['inputRowsPerSecond'] * 0.9:
    print("WARNING: pipeline is not keeping up with input rate")

For production monitoring, export streaming query metrics to a monitoring system. Databricks has native integration with Ganglia (in clusters) and can push metrics to Azure Monitor or Datadog via the Metrics API. Set alerts on two thresholds: lag growing for more than N consecutive intervals (early warning), and lag exceeding X% of the retention window (critical — data loss imminent).

Lag Thresholds That Actually Matter

Two thresholds worth alerting on:

Sustained growth threshold. Lag growing for three consecutive intervals (whatever your monitoring interval is). This is the early warning. The pipeline might catch up on its own if the source had a traffic spike, but sustained growth means the consumer is not keeping pace with the source rate. Investigate before it gets worse.

Retention proximity threshold. Lag in time (lag in messages divided by messages per second) approaching 50% of your topic retention window. If your topic retains 7 days and your lag represents 3.5 days of messages, you have 3.5 days to fix the problem before you start permanently losing data. This is the critical alert.

Lag as an Architecture Signal

Sustained high lag that does not respond to adding cluster resources is the signal I described in the previous post about monolithic consumers: you have an architecture problem, not a capacity problem. Each message is too expensive to process, and horizontal scaling is not fixing it because the bottleneck is per-record CPU, not cluster throughput.

If you have been running a monolithic pipeline and watching lag climb despite adding workers, this is the moment to consider splitting into separate ingest, deserialize, and merge jobs. The raw ingest job almost never has lag problems — it does so little per message that it can keep up with almost any topic throughput. Get the ingest right first; then optimize deserialize and merge independently. I am here to help.

Read more