Doubleword | Behind the Stack, Ep 4: Making Your Load Balancer LLM-Aware

Introduction

If you’re self-hosting LLMs, scaling out to multiple nodes isn’t optional - and neither is load balancing. But conventional strategies like round-robin or least-connections often fail silently when applied to LLM workloads.

This blog explains:

Why traditional load balancing breaks for language models
Why standard metrics like GPU utilization are misleading
How KV cache utilization provides a better signal
How to go further with prefix-aware routing to reduce latency
What’s required to implement an LLM-aware balancer in practice

Why Traditional Load Balancing Falls Short

In typical backend systems, load balancers assume that each request costs roughly the same. This assumption holds when you're serving small, homogeneous workloads - but LLM inference isn’t that.

In practice, one user might send:

A 5-token query: “What’s your name?”
A 10,000-token retrieval-augmented generation (RAG) prompt with long context and multi-paragraph output

If both requests are routed blindly in round-robin order, you’ll overload one node while others sit mostly idle - leading to unpredictable latencies and wasted compute.

The Metrics That Mislead You

Most load balancers rely on system-level metrics like:

CPU utilization
GPU utilization (from nvidia-smi)
VRAM usage

These are easy to collect but fail to reflect the actual load of LLM inference:

CPU usage - inference in GPU-bound; CPU mostly idles
GPU utilization - shows activity, not how much capacity is used‍
VRAM usage - most memory is pre-allocated, always appears full

A GPU can report 95% utilization with just one small request in flight - while still having room for dozens more.

The Right Signal: KV Cache Utilization

Every request to an LLM allocates memory in the KV cache, which stores token data during inference. This cache grows with:

Input + output length (in tokens)
Model size (number of layers/heads)
Precision (e.g., FP16 vs INT8)

Example:

Say you’re running a 7B model on an 80GB A100. After allocating space for the model weights, maybe 64GB is left for KV cache.

If each token uses ~0.00013GB of memory:

64 / 0.00013 ≈ ~492,000 tokens available

A user sending 8K input tokens and expecting 2K output consumes 10K tokens - meaning your GPU can handle ~49 users at once. KV cache usage is a much more direct reflection of actual system load than GPU utilization (check out my previous blog on this).

Load Balancing with KV Cache Metrics

Modern inference engines can expose real-time KV cache usage:

{"capacity_tokens": 500000,"used_tokens": 350000, "utilization": 0.70}

A smarter load balancer uses this metric to:

Route requests to the least loaded node (based on token capacity)
Avoid sending requests to saturated replicas
Predict when to shed load or autoscale

This ensures more even distribution and reduces queueing.

Going Further: Prefix-Aware Routing

Many production systems prepend a long, shared system prompt to every request - explaining task rules, tools, or formatting instructions.

If multiple users share that same prefix, a sophisticated engine can reuse that data in memory, avoiding redundant computation.

How the Balancer Can Help

If the load balancer inspects the incoming request’s prefix, it can:

Compare it to cached prefixes on each replica
Route to the node with the most overlap
Save both compute and KV cache space

This is especially powerful in high-volume applications with long static prompts.

Architecture Overview

To implement this kind of LLM-aware routing, you need two key components:

1. Inference Engine

The engine should expose:

Live KV cache usage per model replica
(Optional) Active prefix fingerprints for overlap detection

Examples of engines that support this (or can be extended to):

Doubleword Inference Stack
TensorRT-LLM
vLLM (partial support)
SGLang (limited support)

2. Load Balancer

Your routing layer (Envoy, custom proxy, or orchestration logic) should:

Poll these metrics regularly
Route requests based on available token capacity
Optionally, include logic for prefix matching and routing by overlap

You may need a sidecar or middleware service to perform this logic outside of traditional load balancer frameworks.

Bonus: Estimating Cost Without Metrics

If your engine doesn’t expose KV metrics, you can still build a basic heuristic:

def estimated_token_cost(input_len, output_len, layers=32):

return (input_len + output_len) * layers

While not as accurate as tracking real memory usage, this gives a rough proxy for balancing token-heavy vs. token-light requests.

Beyond Load Balancing: Autoscaling and Admission Control

Once you have reliable metrics for actual model load, you can drive other systems off them:

Autoscaling based on KV cache pressure
Request queuing or admission control to preserve latency SLAs
Rate limiting when capacity is tight

This transforms your inference deployment from static infrastructure to a responsive, data-driven system.

Conclusion

Serving LLMs at scale requires more than just spinning up more replicas. To maintain performance and cost-efficiency, your load balancing logic needs to be:

Token-aware, not just connection-aware
Memory-aware, not just utilization-aware
Prefix-aware, when sharing context is possible

KV cache utilization gives you a high-fidelity signal for real system load - and using it unlocks smarter orchestration, better concurrency, and lower latency across the board.

Behind the Stack, Ep 4: Making Your Load Balancer LLM-Aware