You Don't Need Fable. You Need a Router.

A multi-direction signpost — routing each task to the right model
Photo: “Riverside Path signpost directions in Northwich” by kitchenkraft, licensed under CC BY 2.0.

The performance gap between open-weight models and closed frontier models has spent the last year collapsing faster than anyone predicted. Epoch AI's tracking puts open weights at roughly a three-to-four-month lag behind state-of-the-art closed models on average. For coding tasks, the gap has effectively closed — DeepSeek V3.2 and MiMo V2 Pro sit within striking distance of Opus 4.8 on real-world workloads. For complex reasoning, the closed frontier still holds a meaningful edge.

That remaining gap matters less than people think, for a reason that's easy to miss: the tasks in your pipeline are not uniformly hard. Most of them are nowhere near the frontier. And if you're routing every request to a frontier model because "it's the best," you're paying frontier prices for work that a well-prompted 7B model handles correctly — while also handing your data to a vendor you can't audit, on infrastructure you don't control.

The mature architecture isn't "pick the best model." It's build a routing layer.

What the Performance Landscape Actually Looks Like

The wave of MoE (mixture-of-experts) open-weight models that landed in the first half of this year changed the economics more than any single benchmark result. Models like DeepSeek V4-Pro, Qwen 3.6-35B-A3B, and Mistral Small 4 achieve very high active-parameter efficiency — only a fraction of total parameters activate per token, which means they run fast on modest hardware while delivering quality that rivals much larger dense models.

The result is a bifurcated market. For routine tasks — classification, extraction, structured generation, code templating — open-weight models are now the volume leaders. For the hardest reasoning, long-context synthesis, and nuanced generation, the closed frontier still earns its premium. The right response to this landscape is not to pick a side. It's to route.

The Stack I'm Running

I've been building toward a multi-provider routing architecture, and after spending time testing different approaches, I landed on OmniRoute as the gateway layer. It's an OpenAI-compatible endpoint that routes across 200+ providers — closed APIs, local inference endpoints, everything — with 15 routing strategies, 4-tier auto-fallback, and a prompt compression pipeline that cuts token counts 15-75% per request before the model ever sees them.

The compression piece matters more than I initially gave it credit for. Fewer input tokens means lower cost at every provider, lower latency everywhere, and meaningfully better performance on smaller models that struggle under bloated prompts.

Behind OmniRoute I'm running about a dozen providers. The interesting one for this post is the local tier: models running on a Mac Mini via MLX and Ollama. MLX became the dominant Apple Silicon inference backend after Ollama switched its Metal backend to MLX — it's 30-60% faster than the previous llama.cpp approach, and 3-4x faster on prompt processing on M4 hardware. On a Mac Mini M4 Pro with 64GB unified memory, a MoE model like Qwen3-Coder-30B runs at around 130 tokens per second — fast enough for real pipeline work, not just demos.

That local tier covers three things nothing in the cloud can: zero per-token cost at any volume, full data sovereignty (the payload never leaves the machine), and offline operation when the cluster is down.

The Three-Tier Routing Model

The routing logic I've arrived at isn't complicated, but it has to be explicit. Here's the actual decision tree:

Local models (Mac Mini / self-hosted): Anything involving sensitive client data, high-volume routine tasks where per-token cost accumulates, anything that needs to work offline, and anything where I want absolute certainty the payload doesn't leave my network. These run at zero marginal cost and with complete sovereignty.

Mid-tier cloud (Mistral Small 4, DeepSeek V3.2, Haiku 4.5, open-weight providers): Tasks that need more quality than local models reliably deliver, but don't need frontier reasoning — complex extraction, multi-step code generation, structured analysis. Cost is a fraction of frontier, latency is acceptable, and quality meets the bar.

Frontier cloud: Reserved for tasks where the quality difference is real and worth paying for — complex multi-step reasoning, high-stakes decision points in agent pipelines, content where prose quality visibly matters. Maybe 5-10% of total request volume in a mature pipeline. Which specific frontier provider fills this slot at any given moment is, as it turns out, not something you can assume in advance.

The routing decision itself runs on a fast small model. Task classification is cheap; sending the wrong task to the wrong tier is expensive.

Provider Risk Is Not Theoretical

The argument for multi-provider routing used to feel like defensive engineering — sensible in principle but unlikely to matter in practice. Then the US government issued an export-control directive requiring Anthropic to immediately suspend access to Fable 5 and Mythos 5 for all foreign nationals, citing national security concerns over a reported jailbreak. Anthropic couldn't segment foreign nationals from US-based users across a hundreds-of-millions user base on same-day notice, so they pulled both models for everyone — US customers included.

Teams whose workflows depended on Fable 5 had no warning and no graceful fallback. Those who had been routing through a gateway with multiple frontier options configured fell over to the next available provider automatically, with no human intervention and no pipeline downtime.

The shutdown wasn't caused by a technical failure, a pricing decision, or a vendor relationship gone bad. It was a regulatory action that the vendor had no choice but to comply with. That's a category of risk that doesn't show up in SLA discussions or uptime metrics, and it's one that a single-provider dependency can't hedge against.

I'm not going to speculate on the merits of the directive or predict when or whether access is restored. What the situation makes undeniable is the architecture point: if your pipeline has a hard dependency on any specific closed model, you're exposed to every kind of availability risk that model's provider faces — technical, commercial, and regulatory. A routing layer with multiple frontier providers configured doesn't eliminate that exposure. It makes recovery automatic instead of manual.

The Sovereignty and Cost Math

Running this architecture for several months, the economics are instructive. The local tier absorbs most of the volume for high-frequency pipeline tasks. The mid-tier cloud handles the work that needs more quality than local provides. The frontier tier handles a small fraction of requests. The blended cost across all three tiers is dramatically lower than routing everything to a frontier API — and the data exposure surface is dramatically smaller because the sensitive volume stays on-prem.

The other dimension people undercount is exactly what the Fable situation illustrated: provider resilience. A routing architecture that runs across a dozen providers with auto-fallback absorbs outages, pricing changes, model deprecations, and regulatory shutdowns, without a pipeline change or an on-call incident.

What This Is Not

This is not a recommendation to avoid frontier models or to treat any specific provider as unreliable. The closed frontier produces genuinely better results on hard reasoning tasks, and a well-configured routing setup should absolutely include frontier options. The point is that which frontier provider fills that slot should be a routing decision, not an architectural dependency.

The goal is calibration: the right model for the right task, routed automatically, with fallback when something fails — for whatever reason. Build the router first. Then figure out where each tier actually earns its slot in your pipeline. As always, I'm here to help.

Read more