Doubleword launches Snowflake Native App
Snowflake logo black
Doubleword logo black
Product
Resources
Resource CenterAI DictionaryCustomer Stories
Docs
Pricing
Book a demo
Book a demo
Resources
/
Blog
/
Behind the Stack, Ep 4: Making Your Load Balancer LLM-Aware
June 18, 2025

Behind the Stack, Ep 4: Making Your Load Balancer LLM-Aware

Jamie Dborin
Share:
https://doubleword.ai/resources/behind-the-stack-ep-4-making-your-load-balancer-llm-aware
Copied
To Webinar
•
Introduction

If you’re self-hosting LLMs, scaling out to multiple nodes isn’t optional - and neither is load balancing. But conventional strategies like round-robin or least-connections often fail silently when applied to LLM workloads.

This blog explains:

  • Why traditional load balancing breaks for language models
  • Why standard metrics like GPU utilization are misleading
  • How KV cache utilization provides a better signal
  • How to go further with prefix-aware routing to reduce latency
  • What’s required to implement an LLM-aware balancer in practice
Why Traditional Load Balancing Falls Short

In typical backend systems, load balancers assume that each request costs roughly the same. This assumption holds when you're serving small, homogeneous workloads - but LLM inference isn’t that.

In practice, one user might send:

  • A 5-token query: “What’s your name?”
  • A 10,000-token retrieval-augmented generation (RAG) prompt with long context and multi-paragraph output

If both requests are routed blindly in round-robin order, you’ll overload one node while others sit mostly idle - leading to unpredictable latencies and wasted compute.

The Metrics That Mislead You

Most load balancers rely on system-level metrics like:

  • CPU utilization
  • GPU utilization (from nvidia-smi)
  • VRAM usage

These are easy to collect but fail to reflect the actual load of LLM inference:

  • CPU usage - inference in GPU-bound; CPU mostly idles
  • GPU utilization - shows activity, not how much capacity is used‍
  • VRAM usage - most memory is pre-allocated, always appears full

A GPU can report 95% utilization with just one small request in flight - while still having room for dozens more.

The Right Signal: KV Cache Utilization

Every request to an LLM allocates memory in the KV cache, which stores token data during inference. This cache grows with:

  • Input + output length (in tokens)
  • Model size (number of layers/heads)
  • Precision (e.g., FP16 vs INT8)
Example:
Say you’re running a 7B model on an 80GB A100. After allocating space for the model weights, maybe 64GB is left for KV cache.
If each token uses ~0.00013GB of memory:
64 / 0.00013 ≈ ~492,000 tokens available
A user sending 8K input tokens and expecting 2K output consumes 10K tokens - meaning your GPU can handle ~49 users at once. KV cache usage is a much more direct reflection of actual system load than GPU utilization (check out my previous blog on this).
Load Balancing with KV Cache Metrics

Modern inference engines can expose real-time KV cache usage:

{"capacity_tokens": 500000,"used_tokens": 350000, "utilization": 0.70}

A smarter load balancer uses this metric to:

  • Route requests to the least loaded node (based on token capacity)
  • Avoid sending requests to saturated replicas
  • Predict when to shed load or autoscale

This ensures more even distribution and reduces queueing.

Going Further: Prefix-Aware Routing

Many production systems prepend a long, shared system prompt to every request - explaining task rules, tools, or formatting instructions.

If multiple users share that same prefix, a sophisticated engine can reuse that data in memory, avoiding redundant computation.

How the Balancer Can Help

If the load balancer inspects the incoming request’s prefix, it can:

  1. Compare it to cached prefixes on each replica
  2. Route to the node with the most overlap
  3. Save both compute and KV cache space

This is especially powerful in high-volume applications with long static prompts.

Architecture Overview

To implement this kind of LLM-aware routing, you need two key components:

1. Inference Engine

The engine should expose:

  • Live KV cache usage per model replica
  • (Optional) Active prefix fingerprints for overlap detection

Examples of engines that support this (or can be extended to):

  • Doubleword Inference Stack
  • TensorRT-LLM
  • vLLM (partial support)
  • SGLang (limited support)

2. Load Balancer

Your routing layer (Envoy, custom proxy, or orchestration logic) should:

  • Poll these metrics regularly
  • Route requests based on available token capacity
  • Optionally, include logic for prefix matching and routing by overlap

You may need a sidecar or middleware service to perform this logic outside of traditional load balancer frameworks.

Bonus: Estimating Cost Without Metrics

If your engine doesn’t expose KV metrics, you can still build a basic heuristic:

def estimated_token_cost(input_len, output_len, layers=32):
    return (input_len + output_len) * layers

While not as accurate as tracking real memory usage, this gives a rough proxy for balancing token-heavy vs. token-light requests.

Beyond Load Balancing: Autoscaling and Admission Control

Once you have reliable metrics for actual model load, you can drive other systems off them:

  • Autoscaling based on KV cache pressure
  • Request queuing or admission control to preserve latency SLAs
  • Rate limiting when capacity is tight

This transforms your inference deployment from static infrastructure to a responsive, data-driven system.

Conclusion

Serving LLMs at scale requires more than just spinning up more replicas. To maintain performance and cost-efficiency, your load balancing logic needs to be:

  • Token-aware, not just connection-aware
  • Memory-aware, not just utilization-aware
  • Prefix-aware, when sharing context is possible

KV cache utilization gives you a high-fidelity signal for real system load - and using it unlocks smarter orchestration, better concurrency, and lower latency across the board.

Footnotes

Table of contents:

Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
"
Learn more about self-hosted AI Inference
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.

Want to learn more?

We work with enterprises at every stage of their self-hosting journey - whether you're deploying your first model in an on-prem environment or scaling dozens of fine-tuned, domain-specific models across a hybrid, multi-cloud setup. Doubleword is here to help you do it faster, easier, and with confidence.

Book a demo
Doubleword logo white
Sitemap
HomePricingDocsResourcesBook a demo
Contact
hello@doubleword.ai
Address
Farringdon, London
JOIN THE COMMUNITY
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.
©2025 Doubleword. All rights reserved.
designed by
celerart
Privacy Policy
We use cookies to ensure you get the best experience on our website.
Accept
Deny