1,000 Kafka Topics, One Databricks Job, Zero Success: A Case Study in Streaming Architecture

A client came to me with a straightforward-sounding problem. They had a Kafka publisher producing over 1,000 topics — one per data source in their manufacturing network — and they needed to ingest all of them into their Databricks environment and make the data available in gold layer Delta tables for analytics. The architecture they proposed: a single Databricks Structured Streaming job that subscribed to all 1,000 topics, deserialized messages inline, computed diffs against the current gold table state, and merged.

Simple, right?

It was not. And the path from "why isn't this working?" to "here is what actually works" is worth documenting in detail.

The Setup

The publisher was a SCADA system feeding telemetry from over 1,000 individual monitoring points — pumps, valves, meters, sensors — each publishing to its own Kafka topic at varying rates. Some topics produced 10 messages per second. Some produced 1 message per minute. The aggregate throughput was substantial but not extraordinary: roughly 200,000 messages per minute at peak.

The Databricks cluster was configured on the most powerful general-purpose instance type available in 2022: Standard_L32s_v2 nodes (32 vCores, 256 GB RAM each). They started with 4 workers, which gave them 128 vCores and 1 TB of RAM. More than enough, by most measures, for 200,000 messages per minute.

What Happened

The job started. Consumer group lag immediately began rising. They added workers — 8, 12, 16. Lag kept rising. At 20 workers (640 vCores), they were barely keeping pace with the incoming message rate. During peak hours, they fell behind. During off-peak hours, they caught up slightly, then fell behind again when peak resumed.

CPU utilization was pegged at 85–95% across all workers, constantly. Memory was fine. Network was fine. Disk I/O was elevated but not the bottleneck. The cluster was pure CPU bound.

The monitoring told the story clearly. 200,000 messages per minute at 640 vCores should be trivially easy — that is 311 messages per core per minute, or about 5 messages per core per second. Even the most complex deserialization and merge logic should not take 200 milliseconds per message per core. But it was.

The Root Cause

The monolithic job was doing three expensive operations on every message in a single pass:

  1. Deserialization: each message was an Avro record. The Avro deserializer was doing a Schema Registry lookup (with caching, but still) plus deserialization on every record.
  2. Schema union computation: subscribing to 1,000+ topics meant the job was trying to handle 1,000+ different schemas in a single DataFrame operation. Spark was computing the schema union on every micro-batch — an O(n) operation over 1,000 schemas.
  3. Merge against the gold table: every micro-batch triggered a MERGE against a gold Delta table that now had 1,000+ distinct entity types. The merge predicate was complex. The shuffle was enormous.

Each individual operation was reasonable. The combination of all three, at 1,000x scale, for every message, was not. The CPU cost was multiplicative, not additive.

Adding more workers made each worker responsible for fewer partitions — fewer topics per core — but the per-message CPU cost did not decrease. You cannot scale out of a fundamental per-record overhead problem by adding more of the same units that are doing the expensive work.

The Numbers

At 20 workers (640 vCores), the cluster cost approximately $8,400/month on on-demand instances plus Databricks DBUs. They were spending over $100,000 per year on a pipeline that was still struggling to keep pace. The architecture was not just slow — it was expensive to be slow.

The next post covers the fix. The short version: three separate jobs, each doing one thing, on clusters sized for that one thing's actual resource requirement. The total cost of the fixed architecture: under $40,000 per year. Same data, same latency, 80% CPU reduction per stage, total cost less than half. I am here to help if you are staring at a similar problem today.

Read more