Doubleword | Behind the Stack, Ep 3: How to Serve 100 Models on a Single GPU with No Cold Starts

Behind the Stack, Ep 3 - Serving Hundreds of Fine-Tuned Models on a Single GPU (Without Cold Starts)

Introduction

In many orgs, self-hosting LLMs starts with a single model. Then comes a customisation request. Then another. And before long, you’ve got dozens of fine-tuned variants - each trained with a LORA or other parameter-efficient technique.

Training these models is relatively lightweight. Serving them efficiently is a much harder problem. In this video, I break down how to serve many LORAs (or other PEFTs) on a single GPU, support dynamic load patterns, and avoid the high cost and latency of traditional serverless setups.

What Is a LoRA (and Why Use One)?

LoRA (Low-Rank Adaptation) is a popular form of parameter-efficient fine-tuning. Instead of updating full weight matrices, LoRA inserts small trainable adapters at key layers.

Only a small fraction of parameters are updated
Training uses much less memory
The resulting adapters are tiny (often <1% of the model size)

These benefits make LoRA a go-to method for use cases where you want to:

Customize a base model per task or domain
Run many fine-tunes without retraining or duplicating the base model
Stay compatible with quantized or frozen weights

At inference time, LoRA can either be merged into the model (for zero overhead), or kept separate to allow swapping between fine-tunes.

Why One LoRA Per GPU Doesn’t Scale

The simplest serving pattern is:
1 model + 1 LoRA = 1 GPU.

This works when each model serves high, steady traffic. But it quickly breaks down when:

Models have sporadic usage
You have more fine-tunes than GPUs
You want to scale elastically without cold starts

In this case, many GPUs end up underutilized or idle, while others are overworked. Auto-scaling helps - but brings slow spin-up times, race conditions, and resource contention.

The Better Pattern: Batched LoRA Serving

Modern inference engines (like vLLM, SGLang, and the Doubleword Inference Stack) support a more efficient approach:

Serve multiple LoRAs concurrently on a single GPU.

You load the base model once, keep multiple LoRA adapters in memory, and dynamically batch incoming requests - even across LoRAs.

This enables:

High GPU utilization
Shared compute across low-traffic apps
Consistent latency, no matter which fine-tune is called

For many teams, this setup alone unlocks significant cost and complexity savings.

Scaling Further: Hot-Swapping LoRAs

What if you have hundreds of LoRAs? They won’t all fit on a single GPU. So we extend the idea: keep a subset in GPU memory, and hot-swap others on demand.

How?

Store inactive LoRAs in RAM, disk, or object storage (they’re tiny)
When a request for an “inactive” fine-tune comes in, quickly load its LoRA weights onto the GPU
Resume batching and serve as usual

Since LoRAs are small (e.g., <100MB), swap times are fast:

<70ms from CPU to GPU in our tests with LLaMA 8B
No need to reload the base model
No containers or full model provisioning delays

What You Get from This Architecture

This setup effectively gives you “serverless” LoRA serving, with none of the traditional cold start penalties. For more on this, check out a previous blog I wrote on the serverless LoRA framework that we use at Doubleword.

Benefits:

Serve dozens (or hundreds) of models per GPU
Eliminate idle capacity
Respond to dynamic usage patterns without spinning up full nodes
Centralize scheduling and load balancing

And since LoRAs are non-invasive and often task-specific, this scales cleanly across applications, domains, and internal teams.

Summary

As LLM use cases diversify, so will the number of fine-tunes you support. Using LoRA and other PEFTs doesn’t just save cost at training time, it also unlocks more scalable and responsive deployment options.

With:

Batched multi-LoRA serving
Fast adapter hot-swapping
Persistent base models

…you can support hundreds of fine-tuned models with just a handful of GPUs - and none of the cold-start drag.

Behind the Stack, Ep 3: How to Serve 100 Models on a Single GPU with No Cold Starts