ForgeAI in Practice: Early Wins and Rough Edges
Running a system in production against real work is different from running it against test cases you designed to make it succeed. The production gap — the difference between what you built and what you needed — shows up in ways that are difficult to anticipate and impossible to fully test for. Here's what the first few months of production ForgeAI use actually looked like.
The Early Wins
The plan generation step was the most reliable part of the system from the start. Given a well-scoped Forgejo issue — clear acceptance criteria, defined scope, reference to relevant existing code — the orchestrator consistently produced plans that were architecturally sound and sequenced correctly. Not plans I would follow exactly, but plans that reflected genuine understanding of the project structure and the problem requirements.
The knowledge retrieval integration showed up clearly in the plan quality. Issues that touched areas with good knowledge coverage — well-documented domain rules, clear architectural decisions captured in the knowledge base — produced better plans than issues in areas where the knowledge base was thin. This was expected but useful to confirm: the system's output quality was directly proportional to the quality of the knowledge it could access.
Output format consistency improved significantly compared to raw model calls. The output verification step caught format deviations before they propagated to downstream steps. When the verification caught a failure, the retry logic could fix it automatically in most cases — either by re-prompting the model with more explicit format instructions or by post-processing the output to coerce it into the expected shape.
The Rough Edges
The implementation step was the weakest link. Generating code from a plan is harder than generating a plan from requirements, because code has to work — not just read coherently. The orchestrator could generate code that was syntactically correct, used the right project conventions (thanks to knowledge retrieval), and implemented the stated plan, but that had subtle logic errors that only showed up in testing.
These weren't random errors. They clustered around specific patterns: boundary conditions that the plan hadn't specified explicitly, edge cases in data transformation that required domain knowledge the model inferred incorrectly, and integration points where the generated code assumed an API contract that didn't quite match the actual interface.
The right response to implementation errors is not "generate better code." The right response is "improve the plan so it specifies the things the model is getting wrong." A plan that explicitly calls out boundary conditions and API contracts produces implementation code that handles them correctly. The failure was in plan quality, not code generation quality — and plan quality was improvable by making the specification more precise.
The Review Step: Better Than Expected
The review step surprised me. I had expected it to catch obvious issues and miss subtle ones. It caught subtle ones more often than I expected — specifically, issues that required understanding the project's domain constraints rather than just code correctness in the abstract. The knowledge retrieval providing domain context to the reviewer was the difference.
A code review that has access to "the fiscal year starts October 1 and Q1 closes December 31" will catch a date boundary calculation that would look correct to a reviewer who assumed calendar quarters. The review model wasn't more capable than the implementation model. It was seeing the same domain context and applying it as a verification layer rather than a generation layer. Same context, different cognitive task, and the verification task turned out to be well-suited to what the model does well.
The Workflow Stability Problem
The stability issue that emerged over the first few months of production use was subtle: the workflow behaved differently depending on how busy the model providers were. Under normal load, latency was acceptable and output quality was consistent. Under high load — model provider capacity constraints, which became more visible as usage scaled — latency increased, and the increase was uneven across steps. Sometimes a step timed out. Sometimes the model returned a response that was truncated mid-output because a token limit was reached under pressure.
This is the operational reality of building on shared cloud infrastructure. The upstream service has capacity constraints you don't control. The right engineering response is timeouts, circuit breakers, and fallback paths — not assuming the upstream is always available and always performing at specification.
Adding provider fallback logic — route to a secondary provider when the primary is slow or returning errors — was the immediate fix. It worked. It also highlighted the value of the provider-agnostic architecture I had built: swapping in a fallback provider required changing a configuration value, not rewriting integration code. The abstraction had paid forward into operational resilience.
If you're running a similar orchestrated workflow in production and have patterns for handling upstream model provider variability, I'd like to hear how you've approached it. As always, I'm here to help.