At Doubleword, our research team have recently been wrestling with the idea of batched inference, which is the task of producing the most amount of tokens for the least spend, without worrying about the latency of any one request. Think about the OpenAI Batched API Endpoint as an example of this kind of service.
When working on this problem, we noticed a parallel between this problem, and that of Comparative Advantage in the field of classical economics.
What is Comparative Advantage?
Here's an example to illustrate what it is. Michael Jordan is obviously much better than the average person at basketball. But he is also almost certainly better than the average person at painting the walls in his house, being much taller and more athletic than the average person. Where it might take an average person 4 hours to paint a wall, it might take Michael Jordan 2 hours. Does this mean that Michael Jordan should paint the walls of his own house, because he is faster? No! It is obvious to us that he is better off spending that two hours playing basketball, and paying someone else to do the house with the money he earns from basketball. Because he will earn so much money from those two hours, that he can then use to pay for someone to do his house, and both parties are better off as a result.
So even though Michael Jordan has an absolute advantage in both basketball and house painting, its in his interest to specialise in the task that he is comparatively better at, which is basketball, and let others focus on the task that they are comparatively better at, like painting walls. In doing so both Michael Jordan and the painter are better off than if Michael Jordan had decided to split his time.
You can see a nice worked example with numbers on the wikipedia page for Comparative Advantage.
How does this relate to GPUs and LLM Inference?
Well there is a nice symmetry between the problem set up above and the problem that you see in the LLM inference landscape.
Inference of LLMs can be split up into two phases: prefill and decode. The prefill phase is compute bound, you are limited by how much maths the GPU can perform. The decode stage is memory bandwidth bound, you are limited by how quickly the GPU can move data from its memory to the bits that do math.
GPUs themselves are either relatively good at math (High FLOP/s), and some are relatively good at memory movement (high memory bandwidth).
Below are some numbers on this:
Using economics terminology, you would say that the H100 has an absolute advantage in both compute and memory bandwidth, compared to the A100, but the A100 has a comparative advantage in memory bandwidth.
So lets say for example you had a mixed set of GPUs, with H100s and A10s, which might be a typical case for enterprises in a cloud environment. If you wanted to run a language model in this environment the comparative advantage principle would say that in order to maximise the total throughput of the system, you would run the prefills (your flop heavy bit of the workload) on your H100s, and your decodes (the bandwidth heavy part) on the A10s. You can do this using disaggregated prefills: https://docs.nvidia.com/dynamo/latest/architecture/disagg_serving.html, available in most inference engines.
This would have a big impact on your latency, since you are using a lower memory bandwidth GPU for decoding, but in exchange you are getting a lot more flops flowing through your FLOP-hungry H100s, and so you should expect that this sort of heterogeneous setup could outperform the same set of GPUs with vLLM running independently on each one.
This is just our intuition on the matter and something that we are going to be exploring at Doubleword going forward, so stay tuned to find out if LLM inference can learn a thing or two from classical economics!
We are hiring excellent inference and infrastructure folks in the UK & are open to collaborations with researchers and labs who are also interested in this topic
N.B. There are lots of technical difficulties hidden in the assumption that you can inference across different GPU types. Not all GPUs have the same data type support, a H100 can run in fp8 and an A10 cannot, for example. Also, you would likely use different parallelisation strategies for H100 and A10 inference as they have different memory sizes, so your engine needs to be able to reshape the KV Cache as it sends them between GPUs. But putting all that aside, you might find there is some merit to the idea of these heterogeneous inference systems. It was demonstrated to be the case in this paper for example: https://arxiv.org/abs/2502.09334