Introduction: Quantization Isn’t Just About Memory - It’s About Making LLMs Practical
Large Language Models (LLMs) are incredibly powerful but also incredibly resource-hungry. Running them efficiently, especially on self-hosted infrastructure, requires squeezing every bit of performance out of limited compute and memory. That’s where quantization comes in.
At its core, quantization is the process of reducing the precision of numerical values in a model - from 16-bit floats to 8-bit, 4-bit, or even lower. This seemingly simple change has huge implications: lower memory usage, faster inference, and reduced costs.
It typically applies to two things:
- Weights: the learned, static parameters of the model
- Activations: the dynamic, intermediate values produced at each layer as the model processes input
Activations vary with every inference and can consume significant memory - especially for long prompts - while weights remain fixed. Compressing either (or both) can bring efficiency gains, but with different trade-offs.
And here’s the catch: not all quantization methods benefit all workloads equally. Choosing between weight-only quantization and full weight+activation quantization isn’t just a technical decision - it’s a strategic one that depends on your model architecture, input/output patterns, and the hardware you’re running on.
This blog walks through how to choose the right quantization strategy for your specific use case - so you can cut costs and improve performance without falling into common traps.
Understanding LLM Quantization: The Basics
Most LLMs are released in 16-bit precision, typically using bf16 or fp16. This format offers a balance between performance and accuracy, but large models can still eat up massive amounts of GPU memory.
Quantization solves this by reducing the number of bits used to store weights and/or activations. For example:
- 4-bit quantization cuts memory by 75%
- 8-bit formats offer better speed–accuracy balance
- FP4 and FP8 are emerging as new standards for high-speed inference
However, memory savings aren’t the only goal. The real payoff is inference speed, and this is where quant format matters most.
Two Broad Types of Quantization: When to Use Each
Weight-Only Quantization (e.g. AWQ, GPTQ, GGUF)
This approach compresses the model weights while keeping activations at 16-bit precision. It's ideal for workloads that are decode-heavy - where the model is generating long sequences of text.
Examples of suitable use cases:
- Agent-like chains that reason over many steps
- Chatbots that generate long responses
- Reasoning models with long chains of reasoning.
In these scenarios, weight-only quantization can reduce latency dramatically - especially if you’re using kernels that perform dequantization inside the compute kernel. Libraries like AWQ and GemmLite support this and can deliver up to 4× speedups in decoding.
But beware: some formats perform the dequantization outside the kernel, like bitsandbytes, which adds overhead and reduces the benefit.
Weight + Activation Quantization (e.g. FP8, SmoothQuant)
This approach compresses both the model weights and activations, leading to smaller memory footprint and potential speedups in prefill-heavy workloads - those where most of the time is spent analyzing long inputs, not generating long outputs.
Use cases where this shines:
- Long-context question answering
- RAG systems with huge document payloads
- Chatbots with long chat history
When the input is large but the output is short (think: "read 10k tokens and summarize"), this quantization approach reduces compute cost during the prefill stage.
That said, these formats are heavily hardware dependent, so you need to make the quantization choice alongside the hardware - which isn’t a factor for weight-only quantization.
Hardware Compatibility: Know What Your GPU Supports
Your quantization strategy is only as good as your hardware allows.
For instance:
- Older GPUs like A100 or consumer cards may only support INT8 tensor cores
- Hopper and Blackwell GPUs support FP8, unlocking activation quantization benefits
- CPUs can use formats like GGUF for edge inference (e.g., via llama.cpp)
You should check whether your hardware supports the desired format at the kernel level. A perfectly chosen quant format won't help if your GPU can't accelerate it.
Performance Trade-Offs in the Real World
Here’s the high-level summary in plain terms:
- If your model is decode-heavy (long generations, agents, chat), go with weight-only quantization. It’s simple, fast, and widely supported.
- If your model is prefill-heavy (large inputs, short answers), explore weight + activation quantization, especially if you're on FP8-ready hardware.
- If you want max portability or run on mixed hardware, stick to weight-only. Formats like GPTQ or AWQ work almost everywhere.
- If you’re chasing speed on modern GPUs, test FP8 + SmoothQuant or use Marlin/Machete kernels to push the limits.
Accuracy Considerations: Not All Formats Are Equal
Weight + activation quantization is more lossy than weight-only. Activations tend to be higher-variance and more difficult to quantize without degrading performance.
Approaches like SmoothQuant help mitigate this by scaling and reshaping activations pre-quantization. But even then, models may need retraining or fine-tuning for best results.
So while full quantization can offer better performance for certain workloads, be sure to validate accuracy if your application is sensitive (e.g., healthcare, legal, finance).
Final Takeaways
Quantization isn’t a one-size-fits-all optimization. The right choice depends on your workload, model size, hardware stack, and latency goals.
Go with weight-only quantization if:
- You prioritize latency during decoding
- Your GPU doesn’t support low-bit activations
- You want maximum compatibility
Use weight + activation quantization if:
- You’re compute-bound during prefill
- You’re targeting FP8-native hardware (like H100s or B200s)
- You can tolerate or mitigate accuracy loss with techniques like SmoothQuant
Whether you're scaling agents, serving long documents, or tuning transformer performance - quantization is no longer optional. It's infrastructure-level strategy.
Want help benchmarking quantization formats or optimizing your inference stack? Reach out or follow along in the Behind the Stack series for more deep dives.