One GPU, multiple models. Hereβs how to do it without buying more hardware.
When to use multiple models
Not every task needs the same model. Common multi-model setups:
- Autocomplete + chat: A small fast model (7-9B) for tab completions, a larger model (22-27B) for chat interactions
- Code + general reasoning: A code-specialized model for implementation, a general model for architecture discussions
- Draft + review: A cheap model generates code, a better model reviews it
- Different languages: Specialized models for different programming languages or natural languages
The key insight: you rarely need two models loaded simultaneously. Most workflows are sequential β you use one model, then switch to another.
VRAM sharing strategies
Understanding how much VRAM each model needs is the foundation of multi-model serving:
| Strategy | VRAM usage | Latency | Best for |
|---|---|---|---|
| Sequential swapping | 1 model at a time | 2-5s swap | Single developer |
| Concurrent loading | Sum of all models | None | Multi-user serving |
| Partial offloading | GPU + CPU split | Variable | Budget setups |
| LoRA adapters | Base + tiny adapters | <100ms swap | Fine-tuned variants |
| Speculative decoding | Draft + verify model | None | Speed optimization |
The golden rule: if your models fit in VRAM together, load them together. If they donβt, swap them.
Option 1: Model swapping (Ollama)
Ollama automatically loads and unloads models on demand. When you switch models, the previous one is evicted from VRAM.
# Model A loads into VRAM
ollama run qwen3.5:27b "Fix this bug"
# Model A unloaded, Model B loads
ollama run codestral:22b "Complete this function"
Tradeoff: 2-5 second swap time. Fine for single-user dev, not for multi-user serving.
Best for: Local development with Continue.dev (chat model + autocomplete model).
Ollama multi-model serving
Ollama can keep multiple models loaded simultaneously if you have enough VRAM. Configure this with environment variables:
# Keep models loaded for 30 minutes (default: 5 minutes)
OLLAMA_KEEP_ALIVE=30m ollama serve
# Or set per-request keep-alive
curl http://localhost:11434/api/generate -d '{
"model": "codestral:22b",
"prompt": "hello",
"keep_alive": "30m"
}'
To load two models concurrently on a 24GB GPU:
# Load a small autocomplete model (5GB)
curl http://localhost:11434/api/generate -d '{"model": "qwen3.5:9b-q4", "keep_alive": "60m", "prompt": ""}'
# Load a code model alongside it (12GB)
curl http://localhost:11434/api/generate -d '{"model": "codestral:22b-q4", "keep_alive": "60m", "prompt": ""}'
Check whatβs loaded:
ollama ps
# NAME SIZE PROCESSOR
# qwen3.5:9b-q4 5.2GB 100% GPU
# codestral:22b-q4 12.1GB 100% GPU
Option 2: Multi-LoRA (vLLM)
Load one base model and multiple LoRA adapters. Each adapter adds <1GB of VRAM. vLLM supports serving multiple LoRA adapters simultaneously.
python -m vllm.entrypoints.openai.api_server \
--model base-model \
--enable-lora \
--lora-modules adapter1=/path/to/lora1 adapter2=/path/to/lora2
Tradeoff: All adapters share the base modelβs quality ceiling. Only works for fine-tuned variants, not completely different models.
Best for: Serving multiple fine-tuned versions (one per customer, one per task).
Option 3: vLLM model routing
For serving genuinely different models, vLLM supports multi-model serving with request routing. This follows the multi-model architecture pattern:
# Run two vLLM instances on the same GPU with memory limits
# Instance 1: code model (60% of GPU memory)
python -m vllm.entrypoints.openai.api_server \
--model codestral-22b \
--gpu-memory-utilization 0.55 \
--port 8001
# Instance 2: chat model (35% of GPU memory)
python -m vllm.entrypoints.openai.api_server \
--model qwen3.5-9b \
--gpu-memory-utilization 0.35 \
--port 8002
Then route requests with a simple proxy:
import httpx
async def route_request(task: str, prompt: str):
if task == "autocomplete":
port = 8001 # codestral
else:
port = 8002 # qwen chat
async with httpx.AsyncClient() as client:
resp = await client.post(f"http://localhost:{port}/v1/completions", json={
"model": "default",
"prompt": prompt,
"max_tokens": 256
})
return resp.json()
Option 4: Smaller models that fit together
Choose models that fit in VRAM simultaneously:
| GPU | Combo that fits |
|---|---|
| RTX 4090 (24GB) | Codestral 22B Q4 (12GB) + Qwen 9B Q4 (5GB) |
| Mac M4 32GB | Devstral Small 24B Q4 (14GB) + Qwen 9B Q4 (5GB) |
| 2x RTX 4090 (48GB) | Qwen 27B Q4 (16GB) + Codestral 22B Q4 (12GB) |
See our GPU memory planning guide for exact calculations.
Option 5: CPU offloading
Load part of the model on GPU, part on CPU RAM. Slower but fits larger models.
# llama.cpp: offload 30 of 40 layers to GPU, rest on CPU
./llama-server -m model.gguf -ngl 30
Tradeoff: 2-5x slower than full GPU. Only viable for light usage.
Practical examples
Developer workstation (RTX 4090, 24GB)
The most common setup for a solo developer using Continue.dev:
{
"models": [
{"provider": "ollama", "model": "qwen3.5:27b-q4", "title": "Chat"}
],
"tabAutocompleteModel": {
"provider": "ollama",
"model": "codestral:22b-q4",
"title": "Autocomplete"
}
}
Ollama swaps between them automatically. The autocomplete model loads in ~2 seconds when you start typing, and the chat model loads when you open the chat panel.
Team server (A100 80GB)
For a small team sharing one GPU:
# Run vLLM with a single large model
python -m vllm.entrypoints.openai.api_server \
--model qwen3.5:72b-awq \
--gpu-memory-utilization 0.90 \
--max-num-seqs 8 \
--port 8000
One powerful model serving multiple users is more efficient than multiple smaller models. Route all tasks (code, chat, review) to the same model.
The practical recommendation
For most developers: use Ollama with model swapping. The 2-5 second swap is barely noticeable, and you get access to any model without VRAM planning.
For teams: use vLLM with one primary model. Route different tasks to the same model rather than running multiple models.
For power users who need zero-latency switching: pick two models that fit in VRAM together (check the table above) and keep both loaded with OLLAMA_KEEP_ALIVE.
Related: How Much VRAM for AI Β· Ollama Complete Guide Β· Serve LLMs with vLLM Β· Multi-Model Architecture Β· GPU Memory Planning Β· Best AI Models Under 16GB VRAM Β· Quantization Trade-offs Β· Cpu Vs Gpu Llm Inference