Behind the Stack, Ep 3 - Serving Hundreds of Fine-Tuned Models on a Single GPU (Without Cold Starts)
Introduction
In many orgs, self-hosting LLMs starts with a single model. Then comes a customisation request. Then another. And before long, you’ve got dozens of fine-tuned variants - each trained with a LORA or other parameter-efficient technique.
Training these models is relatively lightweight. Serving them efficiently is a much harder problem. In this video, I break down how to serve many LORAs (or other PEFTs) on a single GPU, support dynamic load patterns, and avoid the high cost and latency of traditional serverless setups.
What Is a LoRA (and Why Use One)?
LoRA (Low-Rank Adaptation) is a popular form of parameter-efficient fine-tuning. Instead of updating full weight matrices, LoRA inserts small trainable adapters at key layers.
- Only a small fraction of parameters are updated
- Training uses much less memory
- The resulting adapters are tiny (often <1% of the model size)
These benefits make LoRA a go-to method for use cases where you want to:
- Customize a base model per task or domain
- Run many fine-tunes without retraining or duplicating the base model
- Stay compatible with quantized or frozen weights
At inference time, LoRA can either be merged into the model (for zero overhead), or kept separate to allow swapping between fine-tunes.
Why One LoRA Per GPU Doesn’t Scale
The simplest serving pattern is:
1 model + 1 LoRA = 1 GPU.
This works when each model serves high, steady traffic. But it quickly breaks down when:
- Models have sporadic usage
- You have more fine-tunes than GPUs
- You want to scale elastically without cold starts
In this case, many GPUs end up underutilized or idle, while others are overworked. Auto-scaling helps - but brings slow spin-up times, race conditions, and resource contention.
The Better Pattern: Batched LoRA Serving
Modern inference engines (like vLLM, SGLang, and the Doubleword Inference Stack) support a more efficient approach:
Serve multiple LoRAs concurrently on a single GPU.
You load the base model once, keep multiple LoRA adapters in memory, and dynamically batch incoming requests - even across LoRAs.
This enables:
- High GPU utilization
- Shared compute across low-traffic apps
- Consistent latency, no matter which fine-tune is called
For many teams, this setup alone unlocks significant cost and complexity savings.
Scaling Further: Hot-Swapping LoRAs
What if you have hundreds of LoRAs? They won’t all fit on a single GPU. So we extend the idea: keep a subset in GPU memory, and hot-swap others on demand.
How?
- Store inactive LoRAs in RAM, disk, or object storage (they’re tiny)
- When a request for an “inactive” fine-tune comes in, quickly load its LoRA weights onto the GPU
- Resume batching and serve as usual
Since LoRAs are small (e.g., <100MB), swap times are fast:
- <70ms from CPU to GPU in our tests with LLaMA 8B
- No need to reload the base model
- No containers or full model provisioning delays
What You Get from This Architecture
This setup effectively gives you “serverless” LoRA serving, with none of the traditional cold start penalties. For more on this, check out a previous blog I wrote on the serverless LoRA framework that we use at Doubleword.
Benefits:
- Serve dozens (or hundreds) of models per GPU
- Eliminate idle capacity
- Respond to dynamic usage patterns without spinning up full nodes
- Centralize scheduling and load balancing
And since LoRAs are non-invasive and often task-specific, this scales cleanly across applications, domains, and internal teams.
Summary
As LLM use cases diversify, so will the number of fine-tunes you support. Using LoRA and other PEFTs doesn’t just save cost at training time, it also unlocks more scalable and responsive deployment options.
With:
- Batched multi-LoRA serving
- Fast adapter hot-swapping
- Persistent base models
…you can support hundreds of fine-tuned models with just a handful of GPUs - and none of the cold-start drag.