None of Them Work Out of the Box
The model evaluation was useful, but it produced a conclusion that might seem discouraging: none of the alternatives I tested solved the problem I was trying to solve. Let me be more precise about what that means, because "they don't work" and "none of them work out of the box for my specific requirements" are different claims, and only the second one is true.
What Each Candidate Actually Does Well
Claude is the strongest instruction-following model I tested, at any parameter count. The quality of reasoning on complex technical problems is consistently better than comparable alternatives. If I were building a system where cloud API dependency and data egress were acceptable, Claude would be the first-choice model for the conversational and review layers of the stack. The limitation is the cloud API dependency, not the model quality.
Pi Coder is fine for simple, well-scoped coding tasks on a local machine. Code completion, function generation from a signature, docstring creation. The task type has to match the model's training distribution — tasks that look like the code completion tasks it was fine-tuned on. Outside that zone, the quality drops fast. Not a criticism; it's a scoped tool with appropriate expectations.
Hermes performs better than Pi Coder on instruction-following tasks and has better instruction retention across long sessions. For agentic use cases — tasks that require the model to follow a multi-step procedure and maintain state about what has been done — Hermes is the stronger local model. The ceiling is raw reasoning quality on novel domain problems, where the larger cloud models have a significant advantage.
Why "Out of the Box" Is the Wrong Frame
The frame I was using during the evaluation — does this model solve my problem without additional engineering? — is not the right question. No model solves the orchestration problem. Models generate text. They don't maintain persistent state. They don't enforce output contracts. They don't route between subtasks based on outcome. They don't retry on failure.
Every one of those things I needed was a system property, not a model property. Looking for a model that "works out of the box" for agentic workflows is like looking for a database engine that ships with your application's data already in it. The engine is one piece. The system is the thing that matters.
The evaluation was not wasted. It produced a clear model hierarchy for different task types, which I could use to make routing decisions in the orchestration layer. Claude for reasoning-heavy tasks where data sensitivity allowed. Hermes for local inference on sensitive content. Pi Coder for autocomplete-style tasks. No single model for everything — a routing decision based on task type and data sensitivity requirements.
The Framework Question
With the model landscape mapped, the remaining question was the framework. I needed something that could express multi-step workflows as code, manage state between steps, handle conditional branching (success paths vs. failure paths), integrate tool use (knowledge retrieval, code execution, external API calls), and work with multiple model providers without a separate integration layer for each.
The two serious contenders were LangChain and LangGraph. I had written about LangChain before — the early versions had a "plumbing everywhere" quality that made complex workflows hard to reason about and even harder to debug. LangGraph was the newer addition: a graph-based extension of LangChain that let you express workflows as nodes and edges rather than chained function calls.
The graph model turned out to be important. Not because graphs are inherently better than chains, but because the workflows I needed to express were genuinely graph-shaped — not linear sequences, but directed acyclic (or cyclic, for retry loops) workflows with conditional branching, parallel subgraphs, and multiple entry points. Forcing that into a chain model produces code that fights the data structure.
Month eight was the start of the LangGraph adoption. It didn't go smoothly, but it went. As always, I'm here to help if you want to compare notes on how you approached the model selection problem for a similar use case.