Doubleword launches Snowflake Native App
Snowflake logo black
Doubleword logo black
Product
Resources
Resource CenterAI DictionaryCustomer Stories
Docs
Pricing
Book a demo
Book a demo
Resources
/
Blog
/
Behind the Stack, Ep 1: What Should I Be Observing in my LLM Stack?
May 28, 2025

Behind the Stack, Ep 1: What Should I Be Observing in my LLM Stack?

Jamie Dborin
Share:
https://doubleword.ai/resources/behind-the-stack-ep-1-what-should-i-be-observing-in-my-llm-stack
Copied
To Webinar
•

Behind the Stack: Episode 1 - What Should I be Observing in my LLM Stack? 

Introduction 

It’s easy to default to GPU or CPU utilization to assess LLM system load - but that’s a trap. These metrics were built for traditional compute workflows and fall short in LLM deployments. They can stay flat while your model silently hits capacity, leading to missed scaling signals and degraded performance.

‍What Should I be Observing?

‍KV Cache 

The real lever? KV Cache Utilization (if you’re unsure what KV Cache is check out this blog from Hamish Hall where he breaks it down). It captures what matters: how many tokens are actively being processed across requests. It scales with context length and generation, not just batch count. When the cache is full, your system’s full - no matter what nvidia-smi tells you. Unlike traditional metrics, KV cache usage directly reflects whether a new request can be processed without delay. If there isn’t enough space in the cache for a user’s token load, the request has to wait. This is the constraint that actually determines throughput.

By contrast, GPU utilization - as reported by tools like nvidia-smi - only reflects whether something is running on the GPU at a given time slice. A single batch-size-one request can show high GPU utilization, even if your hardware is massively underutilized in terms of throughput capacity. That’s why KV cache usage is the far more actionable signal.

‍Sequence Length

Sequence length stats are another often-missed insight. Engineers spend time fine-tuning models, but without knowing typical input/output lengths, they’re guessing at optimal configuration. These stats affect not just performance but core architectural choices: for instance, long input sequences may favor models with windowed attention or linear attention mechanisms, while generation-heavy workloads benefit from memory-efficient decoding optimizations.

Sequence lengths also inform how you allocate resources between prefill and decode stages. Prefill is compute-bound, while decoding is memory-bound. Get this balance wrong, and you’re wasting capacity where it matters most. Tools like NVIDIA’s Dynamo allow you to split prefill and decode workers - but only if you’ve measured your workload ratios accurately.

User Feedback 

And then there’s user feedback. It’s not a nice-to-have - it’s the ground truth. Metrics like thumbs up/down, continuation rates, or even copy/paste behavior offer real signals of model quality. This feedback is critical not only for evaluating existing deployments, but for training better models over time — through supervised fine-tuning, DPO (Direct Preference Optimization), or full reinforcement learning workflows. It's also the only way to answer cost-impacting questions like "Can we switch to an 8B quantized model without degrading UX?"

‍Conclusion

The takeaway: good observability isn’t just about visibility - it’s about knowing which levers actually correlate with system capacity, user satisfaction, and cost-efficiency. If you’re only watching traditional metrics, you’re probably flying blind.

Footnotes

Table of contents:

Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
"
Learn more about self-hosted AI Inference
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.

Want to learn more?

We work with enterprises at every stage of their self-hosting journey - whether you're deploying your first model in an on-prem environment or scaling dozens of fine-tuned, domain-specific models across a hybrid, multi-cloud setup. Doubleword is here to help you do it faster, easier, and with confidence.

Book a demo
Doubleword logo white
Sitemap
HomePricingDocsResourcesBook a demo
Contact
hello@doubleword.ai
Address
Farringdon, London
JOIN THE COMMUNITY
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.
©2025 Doubleword. All rights reserved.
designed by
celerart
Privacy Policy
We use cookies to ensure you get the best experience on our website.
Accept
Deny