Open-Weight Models in Production: The Deployment Reality Check

Shannon Lowder

28 Jan 2026 — 2 min read

Rows of server racks in a data center — managed inference you don't have to operate — Photo: “Virginia Tech - data center” by cbowns, licensed under CC BY-SA 2.0.

Llama 4, DeepSeek V3.2, Mistral Large 3 — the open-weight model landscape at the end of 2025 is genuinely competitive with closed frontier models for a wide range of tasks. That's the good news. The deployment reality has some friction that's worth understanding before you commit to an on-premises or self-hosted inference strategy.

The Hardware Requirement Is Not Trivial

Running Llama 4 Maverick (400B total parameters, 17B active per token in MoE) in a configuration that produces acceptable throughput for production workloads requires meaningful GPU infrastructure. The MoE architecture means active compute per token is much lower than the parameter count suggests, but you still need the full model loaded across multiple GPUs for the router to function correctly. Verify your hardware budget before you commit to self-hosted Maverick.

Scout (109B total, 17B active) is more accessible, and for most pipeline tasks the quality gap between Scout and Maverick is smaller than the infrastructure cost gap.

Databricks Foundation Model APIs as the Middle Ground

For teams on Databricks, the Foundation Model APIs are the pragmatic on-ramp: you get open-weight model inference without managing the GPU infrastructure yourself. The models run in Databricks' infrastructure, your data doesn't leave the platform, and you pay per token. It's not as cheap as true self-hosted, but the operational overhead is near zero compared to managing your own GPU cluster.

Decision: default to managed inference (Databricks Foundation Model APIs); self-host only for data sovereignty, very high volume, or owned fine-tuned weights — Default to managed inference; reach for self-hosting only when sovereignty, volume economics, or owned weights demand it.

When Self-Hosted Makes Sense

Data sovereignty requirements that prohibit even Databricks-managed inference. Very high-volume, cost-sensitive workloads where the per-token savings of owned hardware cover the ops cost. Fine-tuned models where you own the weights and can't load them on a shared platform. For most teams, managed inference on Databricks Foundation Models is the right call until volume justifies the infrastructure investment. I'm here to help work through the economics for your specific workload.

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

I wrote recently about Azure Agent Mesh and OpenSharing — two infrastructure layers that between them cover how enterprises register, discover, share, and execute agents. Between them, they address a lot of the plumbing that has been missing from the enterprise agent stack. But there's a gap neither of

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

Unity AI Gateway, announced at DAIS this week, is the feature I've been waiting for since Agent Bricks shipped last year. It's a centralized governance layer for model access in Databricks — you configure which models are approved for use in your environment, who can call them,

You Don't Need Fable. You Need a Router.

The performance gap between open-weight models and closed frontier models has spent the last year collapsing faster than anyone predicted. Epoch AI's tracking puts open weights at roughly a three-to-four-month lag behind state-of-the-art closed models on average. For coding tasks, the gap has effectively closed — DeepSeek V3.2

DAIS 2026: Genie One and the Context Problem Databricks Is Solving

The central message from DAIS this week, delivered by Ali Ghodsi in the opening keynote, was direct: AI doesn't have an intelligence problem, it has a context problem. If your CFO can't get an AI system to explain why margins changed, that's not a