Doubleword launches Snowflake Native App
Snowflake logo black
Doubleword logo black
Product
Resources
Resource CenterAI DictionaryCustomer Stories
Docs
Pricing
Book a demo
Book a demo
Resources
/
Blog
/
Behind the Stack, Ep 3: How to Serve 100 Models on a Single GPU with No Cold Starts
June 10, 2025

Behind the Stack, Ep 3: How to Serve 100 Models on a Single GPU with No Cold Starts

Jamie Dborin
Share:
https://doubleword.ai/resources/behind-the-stack-how-to-serve-100-models-on-a-single-gpu-with-no-cold-starts
Copied
To Webinar
•

Behind the Stack, Ep 3 - Serving Hundreds of Fine-Tuned Models on a Single GPU (Without Cold Starts)

Introduction

In many orgs, self-hosting LLMs starts with a single model. Then comes a customisation request. Then another. And before long, you’ve got dozens of fine-tuned variants - each trained with a LORA or other parameter-efficient technique.

Training these models is relatively lightweight. Serving them efficiently is a much harder problem. In this video, I break down how to serve many LORAs (or other PEFTs) on a single GPU, support dynamic load patterns, and avoid the high cost and latency of traditional serverless setups.

What Is a LoRA (and Why Use One)?

LoRA (Low-Rank Adaptation) is a popular form of parameter-efficient fine-tuning. Instead of updating full weight matrices, LoRA inserts small trainable adapters at key layers.

  • Only a small fraction of parameters are updated
  • Training uses much less memory
  • The resulting adapters are tiny (often <1% of the model size)

These benefits make LoRA a go-to method for use cases where you want to:

  • Customize a base model per task or domain
  • Run many fine-tunes without retraining or duplicating the base model
  • Stay compatible with quantized or frozen weights

At inference time, LoRA can either be merged into the model (for zero overhead), or kept separate to allow swapping between fine-tunes.

Why One LoRA Per GPU Doesn’t Scale

The simplest serving pattern is:
1 model + 1 LoRA = 1 GPU.

This works when each model serves high, steady traffic. But it quickly breaks down when:

  • Models have sporadic usage
  • You have more fine-tunes than GPUs
  • You want to scale elastically without cold starts

In this case, many GPUs end up underutilized or idle, while others are overworked. Auto-scaling helps - but brings slow spin-up times, race conditions, and resource contention.

The Better Pattern: Batched LoRA Serving

Modern inference engines (like vLLM, SGLang, and the Doubleword Inference Stack) support a more efficient approach:

Serve multiple LoRAs concurrently on a single GPU.

You load the base model once, keep multiple LoRA adapters in memory, and dynamically batch incoming requests - even across LoRAs.

This enables:

  • High GPU utilization
  • Shared compute across low-traffic apps
  • Consistent latency, no matter which fine-tune is called

For many teams, this setup alone unlocks significant cost and complexity savings.

Scaling Further: Hot-Swapping LoRAs

What if you have hundreds of LoRAs? They won’t all fit on a single GPU. So we extend the idea: keep a subset in GPU memory, and hot-swap others on demand.

How?

  • Store inactive LoRAs in RAM, disk, or object storage (they’re tiny)
  • When a request for an “inactive” fine-tune comes in, quickly load its LoRA weights onto the GPU
  • Resume batching and serve as usual

Since LoRAs are small (e.g., <100MB), swap times are fast:

  • <70ms from CPU to GPU in our tests with LLaMA 8B
  • No need to reload the base model
  • No containers or full model provisioning delays

What You Get from This Architecture

This setup effectively gives you “serverless” LoRA serving, with none of the traditional cold start penalties. For more on this, check out a previous blog I wrote on the serverless LoRA framework that we use at Doubleword.

Benefits:

  • Serve dozens (or hundreds) of models per GPU
  • Eliminate idle capacity
  • Respond to dynamic usage patterns without spinning up full nodes
  • Centralize scheduling and load balancing

And since LoRAs are non-invasive and often task-specific, this scales cleanly across applications, domains, and internal teams.

Summary

As LLM use cases diversify, so will the number of fine-tunes you support. Using LoRA and other PEFTs doesn’t just save cost at training time, it also unlocks more scalable and responsive deployment options.

With:

  • Batched multi-LoRA serving
  • Fast adapter hot-swapping
  • Persistent base models

…you can support hundreds of fine-tuned models with just a handful of GPUs - and none of the cold-start drag.

Footnotes

Table of contents:

Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
"
Learn more about self-hosted AI Inference
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.

Want to learn more?

We work with enterprises at every stage of their self-hosting journey - whether you're deploying your first model in an on-prem environment or scaling dozens of fine-tuned, domain-specific models across a hybrid, multi-cloud setup. Doubleword is here to help you do it faster, easier, and with confidence.

Book a demo
Doubleword logo white
Sitemap
HomePricingDocsResourcesBook a demo
Contact
hello@doubleword.ai
Address
Farringdon, London
JOIN THE COMMUNITY
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.
©2025 Doubleword. All rights reserved.
designed by
celerart
Privacy Policy
We use cookies to ensure you get the best experience on our website.
Accept
Deny