Behind the Stack: Episode 2 - How Many Users Can My GPU Serve?
Introduction
When self-hosting LLMs and productionising AI, one of the first practical questions you’ll run into is: “How many users can this system actually support?”
It’s a question rooted in system design, not just intuition. While it's easy to watch GPU utilization or rely on batch size as a proxy, neither gives you a reliable measure of how far your hardware can actually stretch under real-world loads.
In this video, we break down the calculation that gives you a usable estimate of your system's capacity - grounded in memory constraints and model architecture. With just a few known quantities (model config, token usage, GPU size), you can forecast how many users your setup can realistically support as well as how to grow that number.
GPU Memory: What's Actually Using It?
At inference time, your GPU memory gets divided among three major components:
- Model Weights - a fixed chunk, based on parameter count and precision
- Activations - temporary tensors created during forward passes (often small and engine-managed)
- KV Cache - memory that stores every token currently active in the system
For real-time or multi-user workloads, the KV cache is often the limiting factor. It's what determines whether a new user’s request can be served without delay, regardless of what your GPU utilization says.
The Core Calculation
Let’s say you’re running LLaMA 8B in FP16 on an 80GB A100. The numbers break down roughly as:
- Model Weights ≈ 16GB (8B params × 2 bytes per param)
- Remaining VRAM for KV cache ≈ 64GB
- Each token uses ≈ 0.00013 GB (based on head size, KV heads, layers, and precision)
That gives you:
64 / 0.00013 ≈ ~492,000 tokens total
Now, assume each user sends 8K tokens of input and expects a 2K token output:
492,000 / 10,000 tokens per user ≈ ~49 users
This gives you a rough upper bound on concurrent users - based entirely on memory, not compute throughput.
Scaling That Number
Once you understand the math, there are three main ways to increase your capacity:
1. Quantize the Model (and/or the KV Cache)
Reducing model precision shrinks the memory footprint of weights - and can sometimes reduce KV cache size if supported by your inference engine.
KV cache quantization is less common in production but can double or quadruple token capacity if supported. The tradeoff is increased, decoding latency unless fused dequantization kernels are available.
2. Increase Available VRAM
You can scale up or out:
- Vertical scaling: Upgrade to higher VRAM GPUs (e.g., 24GB → 80GB → 128GB)
- Horizontal scaling: Distribute the model across multiple GPUs using tensor parallelism or pipeline parallelism
More VRAM gives you a larger KV cache - and therefore more tokens to work with. Horizontal scaling introduces some duplication overhead and infrastructure complexity, but it’s often necessary at larger scale.
3. Offload the KV Cache
Some engines allow you to offload older KV layers to CPU or even disk, or keep only the last few layers on GPU. This can reduce GPU KV cache usage by 90%+.
The catch is latency. Unless your inference engine overlaps data movement with computation efficiently, you’ll see increased response times - so this is best used in workloads that prioritize token capacity over speed.
Other Considerations
These calculations give you a strong first estimate, but real-world behavior varies depending on:
- Inference engine - whether it supports paged attention, chunked prefill, quantized cache, etc.
- Workload shape - are requests long, short, bursty, or streaming?
- Fragmentation - fixed-size page allocation can leave some KV cache space unused
- Decode behavior - Token generation is more memory bound than prefill - so decreasing the load won't necessarily improve response times.
If you’re tuning for production, these second-order factors can shift your real limits by 10–30%.
Conclusion
If you’re self-hosting LLMs and need to hit concurrency or latency targets, it’s critical to move from intuition to calculation. With just a few inputs - model size, VRAM, context length - you can:
- Estimate concurrency limits
- Choose the right model precision
- Plan upgrades and scaling strategies
- Tune memory settings per engine
This is what determines whether you can serve 10 users - or 100.