Doubleword | What It Really Takes to Self-Host Your Inference Stack

Introduction

The most common realization we see from teams who’ve started down the self-hosting path is this: inference engine ≠ inference stack.

What often begins as a simple pilot - maybe using vLLM or NIMs - quickly grows in complexity. Before long, there’s a growing list of dependencies, performance bottlenecks, questions around scale, security, and internal visibility. The result? Something that looks less like a model deployment - and a lot more like a full-blown platform. But because the space is still so new, there’s no playbook. Teams are learning as they go - comparing notes in Slack threads, chasing down GitHub issues, pulling fragments of advice from talks, blog posts, and internal experiments. Everyone is feeling their way forward, building in real time. As the team behind a self-hosted inference platform, working closely with some of the most forward-thinking enterprises out there, we’re in a unique position: we get to see the patterns. The pain points. The common missteps - and the practices that actually work. And we’ve started to distill that knowledge into a reference architecture that reflects what it really takes to self-host at scale.

Not just in theory - but in production, under load, with real users and real business requirements.

Self-Hosted Inference Stack

When we map out a modern inference stack, it’s clear that hardware and inference engines are just the foundation. Without them, nothing works - but with just them, nothing scales. You need orchestration layers that know how to keep your models healthy and performant. That means autoscaling tuned for LLMs, CI/CD that integrates with model lifecycles, and scheduling systems that don’t blindly assume every job is the same.

Then there’s the security layer, which becomes especially important as usage increases across teams and surfaces. API gateways, RBAC, and fine-grained authentication aren't just boxes to check - they’re the difference between controlled usage and chaos. Observability is another one that catches teams off guard. When latency spikes or throughput drops, do you know why? Can you trace it to a specific model, endpoint, or hardware node? If not, you're flying blind. There’s also the management layer, which often gets built ad hoc. Teams realize too late that model approval workflows, UI layers, or chargeback systems aren’t “nice-to-haves” - they’re necessary to keep operations clean and auditable, particularly in cross-functional teams. And of course, you’ll need broad model support. You may start with a single open-weight LLM, but needs evolve fast. Suddenly you’re managing fine-tuned variants, auxiliary models, or entirely new architectures. Your platform has to flex with that, or you’ll be rebuilding it six months from now.

Optimizations like LLM-aware load balancing, quantization, caching, and GPU sharding are where things go from working… to working well. They’re how you turn a proof of concept into something that can serve enterprise workloads without burning cash. And finally, there’s the application layer - where models become useful. Think agents, vector DBs, document intelligence, evaluation tools, guardrails, multi-stage pipelines. This is the layer that makes the whole stack matter to the business, but again relies on the foundations it’s built on.

Conclusion

One of the biggest lessons? Everything is interconnected. Swap out a model, and it could ripple through your engine, your caching logic, even your orchestration setup. There are no isolated changes when you’re dealing with production AI infrastructure.

Over the coming weeks and months, me and the team will be sharing what we’ve learned in the Behind the Stack series - patterns, anti-patterns, architectural trade-offs, and examples from the field. If you’re trying to move from demo to deployment, or prototype to platform, we hope this gives you a clearer picture of what’s ahead - and a head start on what to build.

‍

Behind the Stack Series

This post is the first in our Behind the Stack series, where we’ll be publishing deep dives each week on the real-world challenges and design decisions behind self-hosted inference platforms. As new episodes are released, we’ll link them all here - so you can follow along, revisit key topics, and explore the full picture as it unfolds. Check back each week to catch the latest installment.

Episode 1: What Should I Be Observing in my LLM Stack?
- In Ep. 1 of our Behind the Stack series, Chief Scientist Jamie Dborin dives into observability and talks through the three key signals that enterprise teams should be looking at when productionizing their AI: KV Cache utilization, sequence length and user feedback loops.
Episode 2: How Many Users Can My GPU Serve?
- In Ep. 2 of our Behind the Stack series, Chief Scientist Jamie Dborin breaks down how to calculate KV cache size based on model and GPU specs, how many users you can realistically support at different context lengths and strategies to increase this capacity. This is vital for anyone scaling up AI projects so you can plan deployments, budget capacity, and avoid invisible ceilings.
Episode 3: How to Serve 100 Models on a Single GPU with No Cold Starts
- In Ep. 3 of our Behind the Stack series, Dr. James Dborin explains how to serve dozens or even hundreds of fine-tuned models (LORAs/PEFTs) efficiently on a single GPU.
Episode 4: How to Make Your Load Balancer LLM-aware
- In Ep. 4 of our Behind the Stack series, Head of Research Dr. James Dborin walks through why traditional load balancing strategies fail in LLM deployments - and how to build systems that actually reflect model load.
Episode 5: Making RAG Work for Multimodal Documents
- In Ep. 5 of our Behind the Stack series, Dr James Dborin walks through how to improve retrieval-augmented generation (RAG) when dealing with tables, images, and multimodal documents.
Episode 6: How to Speed up the Inference of AI Agents
- In Ep. 6 Dr James Dborin shares that most agentic LLM workflows are surprisingly inefficient and explains how prefix caching reduces 5x latency growth into flat performance.
Episode 7: Choosing the Right Quantization for Self-Hosted LLMs
- In Ep. 7 Dr James Dborin dives into quantization - how it can unlock huge performance gains for LLM inference and which format you should choose for your workload.
Episode 8: Choosing the Right Inference Engine for your LLM Deployment
- In Ep. 8 Dr James Dborin breaks down the major deployment patterns for LLM infernece, what each pattern demands for your engine, which open-source projects are best suited for each pattern & how to choose the right engine for your stack.
Episode 9: How to Evaluate Open Source LLMs
- In Ep. 9 Dr James Dborin walks through LM Arena as the best, publicly available resource for comparing open-source LLMs.
Episode 10: Batched Endpoints
- In Ep. 10 Dr James Dborin walks through batched endpoints, how they can cut token costs, how to replicate these in a self-hosted system and strategies for designing custom SLA tiers.

What It Really Takes to Self-Host Your Inference Stack

Introduction

Self-Hosted Inference Stack

Conclusion

Behind the Stack Series

Footnotes

Table of contents:

Want to learn more?