Doubleword launches Snowflake Native App
Snowflake logo black
Doubleword logo black
Product
Resources
Resource CenterAI DictionaryCustomer Stories
Docs
Pricing
Book a demo
Book a demo
Resources
/
Blog
/
Behind the Stack, Ep 2: How Many Users Can My GPU Serve?
June 4, 2025

Behind the Stack, Ep 2: How Many Users Can My GPU Serve?

Jamie Dborin
Share:
https://doubleword.ai/resources/behind-the-stack-ep-2-how-many-users-can-my-gpu-serve
Copied
To Webinar
•

Behind the Stack: Episode 2 - How Many Users Can My GPU Serve?

Introduction

When self-hosting LLMs and productionising AI, one of the first practical questions you’ll run into is: “How many users can this system actually support?”

It’s a question rooted in system design, not just intuition. While it's easy to watch GPU utilization or rely on batch size as a proxy, neither gives you a reliable measure of how far your hardware can actually stretch under real-world loads.

In this video, we break down the calculation that gives you a usable estimate of your system's capacity - grounded in memory constraints and model architecture. With just a few known quantities (model config, token usage, GPU size), you can forecast how many users your setup can realistically support as well as how to grow that number.

GPU Memory: What's Actually Using It?

At inference time, your GPU memory gets divided among three major components:

  • Model Weights - a fixed chunk, based on parameter count and precision

  • Activations - temporary tensors created during forward passes (often small and engine-managed)

  • KV Cache - memory that stores every token currently active in the system

For real-time or multi-user workloads, the KV cache is often the limiting factor. It's what determines whether a new user’s request can be served without delay, regardless of what your GPU utilization says.

The Core Calculation

Let’s say you’re running LLaMA 8B in FP16 on an 80GB A100. The numbers break down roughly as:

  • Model Weights ≈ 16GB (8B params × 2 bytes per param)

  • Remaining VRAM for KV cache ≈ 64GB

  • Each token uses ≈ 0.00013 GB (based on head size, KV heads, layers, and precision)

That gives you:

64 / 0.00013 ≈ ~492,000 tokens total

Now, assume each user sends 8K tokens of input and expects a 2K token output:

492,000 / 10,000 tokens per user ≈ ~49 users

This gives you a rough upper bound on concurrent users - based entirely on memory, not compute throughput.

Scaling That Number

Once you understand the math, there are three main ways to increase your capacity:

1. Quantize the Model (and/or the KV Cache)

Reducing model precision shrinks the memory footprint of weights - and can sometimes reduce KV cache size if supported by your inference engine.

‍

KV cache quantization is less common in production but can double or quadruple token capacity if supported. The tradeoff is increased, decoding latency unless fused dequantization kernels are available.

2. Increase Available VRAM

You can scale up or out:

  • Vertical scaling: Upgrade to higher VRAM GPUs (e.g., 24GB → 80GB → 128GB)

  • Horizontal scaling: Distribute the model across multiple GPUs using tensor parallelism or pipeline parallelism

More VRAM gives you a larger KV cache - and therefore more tokens to work with. Horizontal scaling introduces some duplication overhead and infrastructure complexity, but it’s often necessary at larger scale.

3. Offload the KV Cache

Some engines allow you to offload older KV layers to CPU or even disk, or keep only the last few layers on GPU. This can reduce GPU KV cache usage by 90%+.

The catch is latency. Unless your inference engine overlaps data movement with computation efficiently, you’ll see increased response times - so this is best used in workloads that prioritize token capacity over speed.

Other Considerations

These calculations give you a strong first estimate, but real-world behavior varies depending on:

  • Inference engine - whether it supports paged attention, chunked prefill, quantized cache, etc.

  • Workload shape - are requests long, short, bursty, or streaming?

  • Fragmentation - fixed-size page allocation can leave some KV cache space unused

  • Decode behavior - Token generation is more memory bound than prefill - so decreasing the load won't necessarily improve response times.

If you’re tuning for production, these second-order factors can shift your real limits by 10–30%.

Conclusion

If you’re self-hosting LLMs and need to hit concurrency or latency targets, it’s critical to move from intuition to calculation. With just a few inputs - model size, VRAM, context length - you can:

  • Estimate concurrency limits
  • Choose the right model precision
  • Plan upgrades and scaling strategies
  • Tune memory settings per engine

This is what determines whether you can serve 10 users - or 100.

Footnotes

Table of contents:

Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
"
Learn more about self-hosted AI Inference
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.

Want to learn more?

We work with enterprises at every stage of their self-hosting journey - whether you're deploying your first model in an on-prem environment or scaling dozens of fine-tuned, domain-specific models across a hybrid, multi-cloud setup. Doubleword is here to help you do it faster, easier, and with confidence.

Book a demo
Doubleword logo white
Sitemap
HomePricingDocsResourcesBook a demo
Contact
hello@doubleword.ai
Address
Farringdon, London
JOIN THE COMMUNITY
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.
©2025 Doubleword. All rights reserved.
designed by
celerart
Privacy Policy
We use cookies to ensure you get the best experience on our website.
Accept
Deny