Doubleword launches Snowflake Native App
Snowflake logo black
Doubleword logo black
Product
Resources
Resource CenterAI DictionaryCustomer Stories
Docs
Pricing
Book a demo
Book a demo
Resources
/
Blog
/
Behind the Stack, Ep 7 - Choosing the Right Quantization for Self-Hosted LLMs
July 8, 2025

Behind the Stack, Ep 7 - Choosing the Right Quantization for Self-Hosted LLMs

Jamie Dborin
Share:
https://doubleword.ai/resources/behind-the-stack-ep-7--
Copied
To Webinar
•

Introduction: Quantization Isn’t Just About Memory - It’s About Making LLMs Practical

Large Language Models (LLMs) are incredibly powerful but also incredibly resource-hungry. Running them efficiently, especially on self-hosted infrastructure, requires squeezing every bit of performance out of limited compute and memory. That’s where quantization comes in.

At its core, quantization is the process of reducing the precision of numerical values in a model - from 16-bit floats to 8-bit, 4-bit, or even lower. This seemingly simple change has huge implications: lower memory usage, faster inference, and reduced costs.

It typically applies to two things:

  • Weights: the learned, static parameters of the model

  • Activations: the dynamic, intermediate values produced at each layer as the model processes input

Activations vary with every inference and can consume significant memory - especially for long prompts - while weights remain fixed. Compressing either (or both) can bring efficiency gains, but with different trade-offs.

And here’s the catch: not all quantization methods benefit all workloads equally. Choosing between weight-only quantization and full weight+activation quantization isn’t just a technical decision - it’s a strategic one that depends on your model architecture, input/output patterns, and the hardware you’re running on.

This blog walks through how to choose the right quantization strategy for your specific use case - so you can cut costs and improve performance without falling into common traps.

Understanding LLM Quantization: The Basics

Most LLMs are released in 16-bit precision, typically using bf16 or fp16. This format offers a balance between performance and accuracy, but large models can still eat up massive amounts of GPU memory.

Quantization solves this by reducing the number of bits used to store weights and/or activations. For example:

  • 4-bit quantization cuts memory by 75%
  • 8-bit formats offer better speed–accuracy balance
  • FP4 and FP8 are emerging as new standards for high-speed inference

However, memory savings aren’t the only goal. The real payoff is inference speed, and this is where quant format matters most.

Two Broad Types of Quantization: When to Use Each

Weight-Only Quantization (e.g. AWQ, GPTQ, GGUF)

This approach compresses the model weights while keeping activations at 16-bit precision. It's ideal for workloads that are decode-heavy - where the model is generating long sequences of text.

Examples of suitable use cases:

  • Agent-like chains that reason over many steps
  • Chatbots that generate long responses 
  • Reasoning models with long chains of reasoning.

In these scenarios, weight-only quantization can reduce latency dramatically - especially if you’re using kernels that perform dequantization inside the compute kernel. Libraries like AWQ and GemmLite support this and can deliver up to 4× speedups in decoding.

But beware: some formats perform the dequantization outside the kernel, like bitsandbytes, which adds overhead and reduces the benefit.

Weight + Activation Quantization (e.g. FP8, SmoothQuant)

This approach compresses both the model weights and activations, leading to smaller memory footprint and potential speedups in prefill-heavy workloads - those where most of the time is spent analyzing long inputs, not generating long outputs.

Use cases where this shines:

  • Long-context question answering
  • RAG systems with huge document payloads
  • Chatbots with long chat history

When the input is large but the output is short (think: "read 10k tokens and summarize"), this quantization approach reduces compute cost during the prefill stage.

That said, these formats are heavily hardware dependent, so you need to make the quantization choice alongside the hardware - which isn’t a factor for weight-only quantization. 

Hardware Compatibility: Know What Your GPU Supports

Your quantization strategy is only as good as your hardware allows.

For instance:

  • Older GPUs like A100 or consumer cards may only support INT8 tensor cores
  • Hopper and Blackwell GPUs support FP8, unlocking activation quantization benefits
  • CPUs can use formats like GGUF for edge inference (e.g., via llama.cpp)

You should check whether your hardware supports the desired format at the kernel level. A perfectly chosen quant format won't help if your GPU can't accelerate it.

Performance Trade-Offs in the Real World

Here’s the high-level summary in plain terms:

  • If your model is decode-heavy (long generations, agents, chat), go with weight-only quantization. It’s simple, fast, and widely supported.
  • If your model is prefill-heavy (large inputs, short answers), explore weight + activation quantization, especially if you're on FP8-ready hardware.
  • If you want max portability or run on mixed hardware, stick to weight-only. Formats like GPTQ or AWQ work almost everywhere.
  • If you’re chasing speed on modern GPUs, test FP8 + SmoothQuant or use Marlin/Machete kernels to push the limits.

Accuracy Considerations: Not All Formats Are Equal

Weight + activation quantization is more lossy than weight-only. Activations tend to be higher-variance and more difficult to quantize without degrading performance.

Approaches like SmoothQuant help mitigate this by scaling and reshaping activations pre-quantization. But even then, models may need retraining or fine-tuning for best results.

So while full quantization can offer better performance for certain workloads, be sure to validate accuracy if your application is sensitive (e.g., healthcare, legal, finance).

Final Takeaways

Quantization isn’t a one-size-fits-all optimization. The right choice depends on your workload, model size, hardware stack, and latency goals.

Go with weight-only quantization if:

  • You prioritize latency during decoding
  • Your GPU doesn’t support low-bit activations
  • You want maximum compatibility

Use weight + activation quantization if:

  • You’re compute-bound during prefill
  • You’re targeting FP8-native hardware (like H100s or B200s)
  • You can tolerate or mitigate accuracy loss with techniques like SmoothQuant

Whether you're scaling agents, serving long documents, or tuning transformer performance - quantization is no longer optional. It's infrastructure-level strategy.

‍

Want help benchmarking quantization formats or optimizing your inference stack? Reach out or follow along in the Behind the Stack series for more deep dives.

Footnotes

Table of contents:

Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
"
Learn more about self-hosted AI Inference
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.

Want to learn more?

We work with enterprises at every stage of their self-hosting journey - whether you're deploying your first model in an on-prem environment or scaling dozens of fine-tuned, domain-specific models across a hybrid, multi-cloud setup. Doubleword is here to help you do it faster, easier, and with confidence.

Book a demo
Doubleword logo white
Sitemap
HomePricingDocsResourcesBook a demo
Contact
hello@doubleword.ai
Address
Farringdon, London
JOIN THE COMMUNITY
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.
©2025 Doubleword. All rights reserved.
designed by
celerart
Privacy Policy
We use cookies to ensure you get the best experience on our website.
Accept
Deny