Introduction
The most common realization we see from teams who’ve started down the self-hosting path is this: inference engine ≠ inference stack.
What often begins as a simple pilot - maybe using vLLM or NIMs - quickly grows in complexity. Before long, there’s a growing list of dependencies, performance bottlenecks, questions around scale, security, and internal visibility. The result? Something that looks less like a model deployment - and a lot more like a full-blown platform. But because the space is still so new, there’s no playbook. Teams are learning as they go - comparing notes in Slack threads, chasing down GitHub issues, pulling fragments of advice from talks, blog posts, and internal experiments. Everyone is feeling their way forward, building in real time. As the team behind a self-hosted inference platform, working closely with some of the most forward-thinking enterprises out there, we’re in a unique position: we get to see the patterns. The pain points. The common missteps - and the practices that actually work. And we’ve started to distill that knowledge into a reference architecture that reflects what it really takes to self-host at scale.
Not just in theory - but in production, under load, with real users and real business requirements.
Self-Hosted Inference Stack
When we map out a modern inference stack, it’s clear that hardware and inference engines are just the foundation. Without them, nothing works - but with just them, nothing scales. You need orchestration layers that know how to keep your models healthy and performant. That means autoscaling tuned for LLMs, CI/CD that integrates with model lifecycles, and scheduling systems that don’t blindly assume every job is the same.
Then there’s the security layer, which becomes especially important as usage increases across teams and surfaces. API gateways, RBAC, and fine-grained authentication aren't just boxes to check - they’re the difference between controlled usage and chaos. Observability is another one that catches teams off guard. When latency spikes or throughput drops, do you know why? Can you trace it to a specific model, endpoint, or hardware node? If not, you're flying blind. There’s also the management layer, which often gets built ad hoc. Teams realize too late that model approval workflows, UI layers, or chargeback systems aren’t “nice-to-haves” - they’re necessary to keep operations clean and auditable, particularly in cross-functional teams. And of course, you’ll need broad model support. You may start with a single open-weight LLM, but needs evolve fast. Suddenly you’re managing fine-tuned variants, auxiliary models, or entirely new architectures. Your platform has to flex with that, or you’ll be rebuilding it six months from now.
Optimizations like LLM-aware load balancing, quantization, caching, and GPU sharding are where things go from working… to working well. They’re how you turn a proof of concept into something that can serve enterprise workloads without burning cash. And finally, there’s the application layer - where models become useful. Think agents, vector DBs, document intelligence, evaluation tools, guardrails, multi-stage pipelines. This is the layer that makes the whole stack matter to the business, but again relies on the foundations it’s built on.
Conclusion
One of the biggest lessons? Everything is interconnected. Swap out a model, and it could ripple through your engine, your caching logic, even your orchestration setup. There are no isolated changes when you’re dealing with production AI infrastructure.
Over the coming weeks and months, me and the team will be sharing what we’ve learned in the Behind the Stack series - patterns, anti-patterns, architectural trade-offs, and examples from the field. If you’re trying to move from demo to deployment, or prototype to platform, we hope this gives you a clearer picture of what’s ahead - and a head start on what to build.
Behind the Stack Series
This post is the first in our Behind the Stack series, where we’ll be publishing deep dives each week on the real-world challenges and design decisions behind self-hosted inference platforms. As new episodes are released, we’ll link them all here - so you can follow along, revisit key topics, and explore the full picture as it unfolds. Check back each week to catch the latest installment.
- Episode 1: What Should I Be Observing in my LLM Stack?
- In Ep. 1 of our Behind the Stack series, Chief Scientist Jamie Dborin dives into observability and talks through the three key signals that enterprise teams should be looking at when productionizing their AI: KV Cache utilization, sequence length and user feedback loops.
- Episode 2: How Many Users Can My GPU Serve?
- In Ep.2 of our Behind the Stack series, Chief Scientist Jamie Dborin breaks down how to calculate KV cache size based on model and GPU specs, how many users you can realistically support at different context lengths and strategies to increase this capacity. This is vital for anyone scaling up AI projects so you can plan deployments, budget capacity, and avoid invisible ceilings.