I have written about MQTT, Kafka, and Delta Lake separately. This post is the end-to-end: what the full IoT ingestion architecture looks like when you wire them together, what each hop adds, and why the complexity is justified.
The scenario: sensors in a manufacturing facility publish readings via MQTT. The target: a queryable Delta Lake table in Databricks, updated every 15 minutes, with full history available for trending and anomaly detection. Somewhere in between, we need durability, schema management, and the ability to recover from any component failure without data loss.
Hop 1: Sensor to MQTT Broker
The sensors speak MQTT — this is not negotiable, it is what the hardware vendor provides. They publish to topics like plant/line3/press01/temperature at QoS 1 (at-least-once). The MQTT broker (Mosquitto or Azure IoT Hub, depending on whether you are self-hosted or cloud-native) receives the messages and holds them until a subscriber picks them up.
What this hop adds: decoupling the sensors from your data infrastructure. Sensors do not care what happens after they publish. They do not know Kafka exists. You can change your entire downstream pipeline without touching the sensors.
What this hop does not add: durability beyond what the broker holds. MQTT brokers are not designed for long-term retention. They hold messages until a subscriber consumes them, plus a configurable persistent session window.
Hop 2: MQTT Broker to Kafka
The MQTT-to-Kafka bridge (Kafka Connect MQTT source connector, or a custom Python subscriber-producer as described earlier in this series) subscribes to the MQTT broker and publishes each message to a Kafka topic as it arrives.
# Kafka Connect: MQTT to Kafka bridge configuration
{
"name": "plant-sensors-mqtt-source",
"config": {
"connector.class": "io.confluent.connect.mqtt.MqttSourceConnector",
"mqtt.server.uri": "tcp://mqtt-broker.plant.local:1883",
"mqtt.topics": "plant/#",
"kafka.topic": "plant_sensor_readings_raw",
"mqtt.qos": "1",
"tasks.max": "1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"transforms": "AddTimestamp",
"transforms.AddTimestamp.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.AddTimestamp.timestamp.field": "bridge_ts_ms"
}
}
What this hop adds: durability. Kafka retains messages for the configured retention window (14 days in this architecture) regardless of whether downstream consumers are running. If the Databricks pipeline goes down for a weekend, the messages are still in Kafka waiting. The MQTT broker can be restarted, the bridge can fail and recover, and the Kafka topic holds the authoritative record.
What this hop also adds: fanout. Multiple consumer groups can read the same Kafka topic independently — the analytics pipeline, a real-time alerting system, a separate archive process — without interfering with each other.
Hop 3: Kafka to Raw Delta (Databricks Job 1)
The Databricks raw ingest job reads from the Kafka topic and writes to a Delta table in ADLS. Raw payload stored as a string, Kafka metadata preserved (topic, partition, offset, Kafka timestamp). No parsing.
What this hop adds: queryable, long-term raw storage. The Kafka topic retains 14 days. The raw Delta table retains indefinitely (subject to your storage budget). This is where you can go back 90 days to investigate an anomaly that you did not know to look for until today.
Hop 4: Raw Delta to Bronze Delta (Databricks Job 2)
The deserialize job reads from raw Delta, parses the MQTT payload JSON, validates the sensor reading values, and writes structured records to the bronze Delta table. Bad records go to a dead letter table.
# MQTT payload for plant sensor reading
# {"sensor_id": "press01", "measurement": "temperature", "value": 87.3, "unit": "C", "ts": 1625000000}
SENSOR_SCHEMA = StructType([
StructField("sensor_id", StringType(), nullable=False),
StructField("measurement", StringType(), nullable=False),
StructField("value", DoubleType(), nullable=False),
StructField("unit", StringType(), nullable=False),
StructField("ts", LongType(), nullable=False) # Unix epoch seconds
])
# Parse and extract the original MQTT topic from the Kafka message key
parsed = raw_df .withColumn("mqtt_topic", col("topic")) .withColumn("parsed", from_json(col("raw_payload"), SENSOR_SCHEMA)) .filter(col("parsed").isNotNull())
Hop 5: Bronze Delta to Gold Delta (Databricks Job 3)
The merge job reads from bronze, deduplicates (sensors occasionally send duplicate readings at QoS 1), and merges into the gold table keyed on sensor_id + measurement + ts. The gold table is the queryable endpoint for reporting and anomaly detection.
Why Each Hop Is Justified
Every component in this chain adds something distinct:
- MQTT broker: device protocol abstraction
- Kafka: durable, replayable buffer with fanout
- Raw Delta: long-term raw storage with time travel
- Bronze Delta: validated, structured, with dead letter tracking
- Gold Delta: queryable, deduplicated, merge-ready serving layer
Remove any one of these hops and you lose one of these properties. Remove Kafka and your MQTT broker becomes a single point of failure for durability. Skip the raw zone and you lose the ability to reprocess from original messages. Skip the bronze layer and schema bugs go directly to gold. Each hop is earning its complexity.
I have run versions of this architecture for multiple manufacturing and industrial clients. The component list looks intimidating up front. Three months in, it is the last time anyone questions whether the complexity was worth it. As always, I am here to help.