Behind the Stack: Episode 1 - What Should I be Observing in my LLM Stack?
Introduction
It’s easy to default to GPU or CPU utilization to assess LLM system load - but that’s a trap. These metrics were built for traditional compute workflows and fall short in LLM deployments. They can stay flat while your model silently hits capacity, leading to missed scaling signals and degraded performance.
What Should I be Observing?
KV Cache
The real lever? KV Cache Utilization (if you’re unsure what KV Cache is check out this blog from Hamish Hall where he breaks it down). It captures what matters: how many tokens are actively being processed across requests. It scales with context length and generation, not just batch count. When the cache is full, your system’s full - no matter what nvidia-smi tells you. Unlike traditional metrics, KV cache usage directly reflects whether a new request can be processed without delay. If there isn’t enough space in the cache for a user’s token load, the request has to wait. This is the constraint that actually determines throughput.
By contrast, GPU utilization - as reported by tools like nvidia-smi - only reflects whether something is running on the GPU at a given time slice. A single batch-size-one request can show high GPU utilization, even if your hardware is massively underutilized in terms of throughput capacity. That’s why KV cache usage is the far more actionable signal.
Sequence Length
Sequence length stats are another often-missed insight. Engineers spend time fine-tuning models, but without knowing typical input/output lengths, they’re guessing at optimal configuration. These stats affect not just performance but core architectural choices: for instance, long input sequences may favor models with windowed attention or linear attention mechanisms, while generation-heavy workloads benefit from memory-efficient decoding optimizations.
Sequence lengths also inform how you allocate resources between prefill and decode stages. Prefill is compute-bound, while decoding is memory-bound. Get this balance wrong, and you’re wasting capacity where it matters most. Tools like NVIDIA’s Dynamo allow you to split prefill and decode workers - but only if you’ve measured your workload ratios accurately.
User Feedback
And then there’s user feedback. It’s not a nice-to-have - it’s the ground truth. Metrics like thumbs up/down, continuation rates, or even copy/paste behavior offer real signals of model quality. This feedback is critical not only for evaluating existing deployments, but for training better models over time — through supervised fine-tuning, DPO (Direct Preference Optimization), or full reinforcement learning workflows. It's also the only way to answer cost-impacting questions like "Can we switch to an 8B quantized model without degrading UX?"
Conclusion
The takeaway: good observability isn’t just about visibility - it’s about knowing which levers actually correlate with system capacity, user satisfaction, and cost-efficiency. If you’re only watching traditional metrics, you’re probably flying blind.