Researching the Alternatives: Claude, Pi Coder, and Hermes

When you've decided the current tool isn't doing the job, the responsible move is to evaluate the alternatives before building from scratch. I spent the better part of three months doing that evaluation. The alternatives were Claude, Pi Coder, and Hermes — each with a different set of design assumptions, each promising something the others lacked. The conclusion was less about which one won and more about what none of them provided.

What I Was Actually Evaluating

The failure modes from the persona work had defined the requirements clearly. I needed a model or model-plus-framework combination that could:

  1. Maintain behavioral consistency across a session without prompt drift
  2. Respect scope constraints reliably under conversational pressure
  3. Produce output in a verifiable format that a downstream system could parse and validate
  4. Run locally or on infrastructure I controlled — no mandatory cloud API dependency
  5. Integrate with the knowledge retrieval system I had built

Requirement 4 was the hard filter. Most commercial model APIs failed it immediately. Requirement 3 was the evaluative discriminator once I had a candidate that passed 4.

Claude

Anthropic's Claude was the strongest performer on the conversational quality and instruction-following dimensions. The Constitutional AI training approach produced a model that was genuinely better at maintaining stated constraints than models trained with pure RLHF. In testing, Claude held persona scope more consistently and for longer than the Copilot Chat baseline.

The limitation was architectural, not model quality. Claude is a cloud API. All inference happens on Anthropic's infrastructure. Data sent to Claude leaves my control. For personal projects that's a tradeoff I can evaluate; for client project data, it's a harder line. The model quality was compelling enough that I kept Claude in the mix for tasks where the data sensitivity was low — architecture discussions, general reasoning, code review on non-sensitive examples. But it couldn't be the primary layer for production work on client projects.

Pi Coder and Hermes

Pi Coder and Hermes represented the local model path. Both could run on local hardware without data leaving the machine. Both had been specifically tuned or fine-tuned for code-related tasks. Both fell short in ways that were specific and consistent enough to be useful data.

Pi Coder's instruction following was adequate for simple, single-step tasks. For complex, multi-turn interactions with a detailed persona prompt, it struggled to maintain coherence across the conversation. The persona instructions were present in the context window, but the model's ability to apply them consistently degraded as the conversation lengthened. Not unique to Pi Coder — most smaller local models had this property — but the degradation was faster and more pronounced than I had hoped.

Hermes had better instruction following characteristics than Pi Coder at comparable parameter counts, which made it the stronger local model candidate for persona-driven work. The limitation was in reasoning depth on complex domain problems. For tasks that required multi-step logical reasoning about data modeling tradeoffs or pipeline architecture decisions, Hermes produced outputs that were plausible but shallow. It could pattern-match to similar problems it had seen in training but couldn't reason through novel combinations of constraints.

The Common Thread

Testing all three candidates against the real workloads I needed them to handle produced a clear picture: none of them solved the orchestration problem, because none of them were designed to. They were all models. What I needed was a model plus an orchestration layer that enforced the behavioral contracts, managed state across steps, and handled the verification of outputs.

A better model gets you better inference. It doesn't get you enforced scope, verified output formats, or automatic retry when the output fails validation. Those properties live in the layer that wraps the model — and that layer has to be built.

The question was what to build it with. I had been looking at frameworks in parallel with the model evaluation. One kept coming back in the research: LangGraph. The graph-based approach to workflow orchestration matched the mental model I had developed for what the orchestration layer needed to do. Plan steps as nodes. State transitions as edges. Side effects — API calls, tool use, knowledge retrieval — as node operations. Verification as a conditional edge that loops back on failure.

The model evaluation had confirmed the requirement. The framework evaluation was pointing at the implementation. Putting them together was the next step. As always, I'm here to help if you've evaluated different models for similar use cases and reached different conclusions — the evaluation methodology matters as much as the results.

Read more