Introduction
If you’re self-hosting LLMs, scaling out to multiple nodes isn’t optional - and neither is load balancing. But conventional strategies like round-robin or least-connections often fail silently when applied to LLM workloads.
This blog explains:
- Why traditional load balancing breaks for language models
- Why standard metrics like GPU utilization are misleading
- How KV cache utilization provides a better signal
- How to go further with prefix-aware routing to reduce latency
- What’s required to implement an LLM-aware balancer in practice
Why Traditional Load Balancing Falls Short
In typical backend systems, load balancers assume that each request costs roughly the same. This assumption holds when you're serving small, homogeneous workloads - but LLM inference isn’t that.
In practice, one user might send:
- A 5-token query: “What’s your name?”
- A 10,000-token retrieval-augmented generation (RAG) prompt with long context and multi-paragraph output
If both requests are routed blindly in round-robin order, you’ll overload one node while others sit mostly idle - leading to unpredictable latencies and wasted compute.
The Metrics That Mislead You
Most load balancers rely on system-level metrics like:
- CPU utilization
- GPU utilization (from nvidia-smi)
- VRAM usage
These are easy to collect but fail to reflect the actual load of LLM inference:
- CPU usage - inference in GPU-bound; CPU mostly idles
- GPU utilization - shows activity, not how much capacity is used
- VRAM usage - most memory is pre-allocated, always appears full
A GPU can report 95% utilization with just one small request in flight - while still having room for dozens more.
The Right Signal: KV Cache Utilization
Every request to an LLM allocates memory in the KV cache, which stores token data during inference. This cache grows with:
- Input + output length (in tokens)
- Model size (number of layers/heads)
- Precision (e.g., FP16 vs INT8)
Example:
Say you’re running a 7B model on an 80GB A100. After allocating space for the model weights, maybe 64GB is left for KV cache.
If each token uses ~0.00013GB of memory:
64 / 0.00013 ≈ ~492,000 tokens available
A user sending 8K input tokens and expecting 2K output consumes 10K tokens - meaning your GPU can handle ~49 users at once. KV cache usage is a much more direct reflection of actual system load than GPU utilization (check out my previous blog on this).
Load Balancing with KV Cache Metrics
Modern inference engines can expose real-time KV cache usage:
{"capacity_tokens": 500000,"used_tokens": 350000, "utilization": 0.70}
A smarter load balancer uses this metric to:
- Route requests to the least loaded node (based on token capacity)
- Avoid sending requests to saturated replicas
- Predict when to shed load or autoscale
This ensures more even distribution and reduces queueing.
Going Further: Prefix-Aware Routing
Many production systems prepend a long, shared system prompt to every request - explaining task rules, tools, or formatting instructions.
If multiple users share that same prefix, a sophisticated engine can reuse that data in memory, avoiding redundant computation.
How the Balancer Can Help
If the load balancer inspects the incoming request’s prefix, it can:
- Compare it to cached prefixes on each replica
- Route to the node with the most overlap
- Save both compute and KV cache space
This is especially powerful in high-volume applications with long static prompts.
Architecture Overview
To implement this kind of LLM-aware routing, you need two key components:
1. Inference Engine
The engine should expose:
- Live KV cache usage per model replica
- (Optional) Active prefix fingerprints for overlap detection
Examples of engines that support this (or can be extended to):
- Doubleword Inference Stack
- TensorRT-LLM
- vLLM (partial support)
- SGLang (limited support)
2. Load Balancer
Your routing layer (Envoy, custom proxy, or orchestration logic) should:
- Poll these metrics regularly
- Route requests based on available token capacity
- Optionally, include logic for prefix matching and routing by overlap
You may need a sidecar or middleware service to perform this logic outside of traditional load balancer frameworks.
Bonus: Estimating Cost Without Metrics
If your engine doesn’t expose KV metrics, you can still build a basic heuristic:
def estimated_token_cost(input_len, output_len, layers=32):
return (input_len + output_len) * layers
While not as accurate as tracking real memory usage, this gives a rough proxy for balancing token-heavy vs. token-light requests.
Beyond Load Balancing: Autoscaling and Admission Control
Once you have reliable metrics for actual model load, you can drive other systems off them:
- Autoscaling based on KV cache pressure
- Request queuing or admission control to preserve latency SLAs
- Rate limiting when capacity is tight
This transforms your inference deployment from static infrastructure to a responsive, data-driven system.
Conclusion
Serving LLMs at scale requires more than just spinning up more replicas. To maintain performance and cost-efficiency, your load balancing logic needs to be:
- Token-aware, not just connection-aware
- Memory-aware, not just utilization-aware
- Prefix-aware, when sharing context is possible
KV cache utilization gives you a high-fidelity signal for real system load - and using it unlocks smarter orchestration, better concurrency, and lower latency across the board.