Compound AI: The Architecture Behind Real GenAI Systems
One of the terms that kept surfacing at DAIS 2024 was "compound AI systems." Databricks' research arm has been writing about this concept, and it was woven through several keynote segments: the idea that the AI applications enterprises actually need aren't single model calls, but pipelines of multiple models, retrievers, validators, and tools chained together.
This is an architectural pattern I've been thinking about a lot, and the Databricks framing is worth pulling apart.
Why a Single Model Call Isn't Enough
The simple mental model for a GenAI application is: send a prompt, get a response. That works for demos. It doesn't work for production enterprise AI applications for a straightforward reason: the tasks enterprises need to automate are too complex, too conditional, and too subject to domain-specific constraints for a single model call to handle reliably.
A customer support AI that answers questions about account balances, policy terms, and billing disputes isn't one model — it's a classifier that routes the question, a retriever that fetches the relevant policy documents, a model that synthesizes an answer from those documents, and a validator that checks whether the answer cites a real policy provision. Each step has its own failure mode. Each step can be tested and improved independently.
Compound AI systems formalize this pattern. The system is a graph of components: models, retrievers, validators, tools, guardrails. The sophistication lives in the architecture, not in a single model's parameter count.
How This Aligns With Data Engineering Instincts
If you've been building data pipelines for any length of time, this pattern is intuitive. Pipelines are composed of steps. Each step has inputs, outputs, and testable behavior. Failures are isolated. Observability is built in at the step level. The fact that some of the steps are now LLM calls rather than SQL transformations doesn't fundamentally change the architectural discipline.
What it does change is the failure mode vocabulary. A SQL transformation that's wrong produces wrong data consistently — you can detect it, measure it, debug it. An LLM step that's wrong produces wrong data probabilistically and in ways that aren't always detectable with a simple row count. That requires a different monitoring approach: output sampling, factual consistency checks, human review loops.
Questions I Still Have About Databricks' Implementation
Mosaic AI Agent Framework was demoed in early preview, and the vision is compelling: define agent components in code, wire them together, track inputs and outputs through MLflow, govern the whole thing through Unity Catalog. In principle, this is exactly the right architecture.
What I'm watching for: how opinionated is the framework about component interfaces? The value of a framework for compound AI systems comes from standardization — if every agent component follows the same interface contract, you can swap out the retriever or the model without rewriting the orchestration logic. If the framework is too flexible (anything goes), you get the same fragmentation problem you started with, just at a higher level of abstraction.
I'll be building on Mosaic AI as it matures and I'll report back on what the production constraints actually look like. Right now this is directionally right, with implementation details still being worked out. As always, I'm here to help.