Two Years In: The Real State of My AI Stack
Two years changes the question. In the first year, the question was "does this work?" The answer was yes, with significant caveats. In the second year, the question became "is this sustainable?" — and the answer to that required being honest about what the system had cost to build and what it was actually delivering.
The Stack Today
Two years of accumulated engineering work has produced a specific stack:
Cortex Forge: the knowledge layer. PostgreSQL with pgvector, running locally. About 800 entries across eight active and archived projects. Semantic and full-text hybrid retrieval. An MCP server that exposes retrieval to AI tools. A CLI for fast human-driven ingestion. This is the most mature piece of the stack and the one that has paid forward most consistently.
ForgeAI: the orchestration layer. LangGraph-based workflows for the plan-implement-review loop. Multi-provider routing with Anthropic, OpenAI, and Ollama as providers. Output verification at each step. Cortex Forge integration at the context retrieval step. Running against a self-hosted Forgejo instance for issue tracking and PR management. This is functional and improving; it is not yet at the reliability level I want for unattended operation.
Personas library: a collection of system prompts that define specialized roles. Not enforced by an external system — applied as configuration at the orchestration layer level, not as raw chat prompts. The orchestration layer controls what each model call can do; the persona provides the behavioral framing. This addresses the enforcement gap that made raw persona prompts unreliable.
What the Stack Delivers
For well-scoped, well-documented tasks in projects with mature knowledge coverage: the orchestrator produces first-pass implementations that are close to merge-ready. The plan-implement-review loop catches most implementation issues before I see the output. The knowledge retrieval means the implementation uses project conventions correctly without explicit instruction in each prompt.
For ambiguous tasks, new domains, and projects without knowledge coverage: the stack provides less value. The orchestrator needs good inputs to produce good outputs. Tasks that require significant domain clarification before the implementation can start are better handled with direct model interaction — the overhead of the full orchestration loop isn't worth it for exploratory work.
The tool is better suited to implementation work than to discovery work. That's a real limitation, not a failure mode. Design the system for what it does well; don't blame it for what it doesn't do.
What It Cost
The honest accounting: two years of part-time engineering investment, spread across knowledge system design, pgvector integration, MCP server implementation, orchestration layer development, provider routing, output verification, and ongoing maintenance. Not nights-and-weekends-only — a significant fraction of actual working hours.
That's a real cost. The return needs to exceed it, and the time horizon matters. If you measure the return over two months, the investment doesn't pay back. Measured over two years on repeat patterns across multiple client projects, it has — but "it pays back over two years" is a different value proposition than what the AI tools marketing implies.
The Conclusions That Hold
LLMs are genuinely useful tools that extend what a solo engineer can accomplish. They are not autonomous agents that replace engineering judgment. The most productive framing I've found: they're a very capable autocomplete at the pattern layer and a good first-pass generator at the implementation layer, with the important caveat that the first pass requires review and the patterns have to be appropriate to your domain.
The infrastructure investment — knowledge systems, orchestration layers, provider abstractions — is what makes the tools reliable enough to build on. Without that infrastructure, you're using powerful tools inconsistently, which produces inconsistent results. The infrastructure is where the "AI development" work actually lives, and it's unglamorous, maintenance-heavy, and necessary.
The category of engineer who benefits most from this stack: someone with enough domain expertise to evaluate model output critically, enough infrastructure comfort to build and maintain the surrounding system, and enough patience to invest in tooling that pays back over months rather than days. That's not everyone. It might not even be most engineers. It is, deliberately, where I've put my effort — and as we head into the third year, the stack is capable enough to justify continuing to improve it.
If you've been on a similar journey and landed somewhere different, I'd genuinely like to understand why. As always, I'm here to help.