Large language models have transformed how we build AI systems - but they’re still notoriously slow and expensive to run. Over the last few years, one optimization technique has been attracting serious attention for its potential to make inference faster and more efficient: speculative decoding.
In this episode of Behind the Stack, we explore what speculative decoding is, how it accelerates model inference, and what’s still holding it back.
What is Speculative Decoding?
To understand speculative decoding, we first need to understand how language model inference actually works.
Inference can be divided into two phases:
- Prefill (or encoding) - when the model processes your entire prompt.
- Decoding - when the model starts generating output tokens one by one.
During prefill, everything happens in parallel. You can feed thousands of words into the model at once, and it efficiently processes them to produce the first token.
But once decoding begins, things slow down dramatically. That’s because large language models are auto-regressive - they take each output token, feed it back in as input, and then generate the next one. This happens sequentially, meaning each new token requires another full model pass.
So while prefill benefits from GPU parallelism and amortized compute costs, decoding is bottlenecked by memory bandwidth - every single step requires the model weights to be reloaded into GPU compute units. That’s why output tokens are often far more expensive (and slower) than input tokens in API pricing.
Speculative decoding aims to fix that.
Adding Parallelism Back Into Decoding
Speculative decoding introduces a form of parallelism back into the decoding process.
The idea comes from speculative execution in CPUs - where processors “guess” future operations to save time, only discarding results if the guess was wrong.
Similarly, speculative decoding makes guesses about what the model will generate next. Rather than waiting for each token one by one, we feed in several draft tokens - potential next words - and ask the model to verify them in a single pass.
If the model agrees with those guesses, we can skip multiple decoding steps at once.
In theory, this can make inference up to many times faster, since we’re effectively generating multiple tokens per forward pass.
Even though this adds more compute work, most of the time models are bandwidth-bound - so doing more work per pass often costs little extra, while saving significant time overall.
The Catch: Where Do Draft Tokens Come From?
Here’s the tricky part: to speculate correctly, we need to guess what the model will say next - before it actually says it.
The tokens we guess are called draft tokens, and finding good ones is the core challenge of speculative decoding.
The original speculative decoding paper solved this by using a smaller model to generate draft tokens. The small model runs faster and produces multiple guesses at once. The large model then checks those guesses, accepts the ones it agrees with, and discards the rest.
This can dramatically speed up inference - if the small model is good enough to make accurate guesses.
However, this approach comes with trade-offs:
- Smaller models are less capable. They often produce incorrect tokens, meaning the larger model rejects many of them.
- It adds overhead. The small model must still run before the large one, introducing latency.
- It consumes GPU memory. The small model has its own weights and KV cache, potentially eating into resources that could have been used for the main model.
- Tokenizer mismatch. Speculative decoding only works if both models use the exact same tokenizer - which isn’t always the case.
So while the “small + big model” method can work well for certain architectures, it’s not always practical or efficient in real-world deployments.
Alternative Sources of Draft Tokens
Researchers have explored several other approaches to generating draft tokens:
- Training additional model heads.
Some methods add extra “heads” to the language model that predict future tokens (e.g., the next 4 or 5) during a single forward pass. These can then be used as speculative guesses. This approach ensures shared tokenization but requires additional training. - Using past outputs or user suggestions.
APIs like OpenAI’s support passing in suggested continuations - known phrases or past completions - as speculative guesses. This can work well if you have strong prior data, but performance depends on how good your guesses are. - Prefix trees of past outputs (Doubleword’s approach).
At Doubleword, we’ve developed our own speculative decoding technique that uses a weighted prefix tree built from previous model outputs.
As the model generates text, it walks down this tree to find past completions that share a common prefix - and speculatively decodes from those continuations.
Over time, as the system observes more real-world requests, the prefix tree adapts - prioritizing the most frequent or probable completions. This means the more you use it, the faster it gets, automatically tuning itself to your workload distribution.
The Future of Speculative Decoding
Speculative decoding is one of the most promising directions for accelerating LLM inference. By introducing controlled parallelism into the decoding phase, it can unlock significant speed gains - especially when combined with smart draft-token strategies that adapt over time.
However, it’s not a one-size-fits-all solution. For certain workloads or architectures, the overhead may outweigh the benefits. For others - especially those with repetitive, predictable output distributions - it can deliver major efficiency wins.
At Doubleword, we’re continuing to explore these trade-offs as part of our broader mission to optimize and govern inference at scale - across any model, on any infrastructure.


