Apr 23, 2026 · 4 min read

Last updated on Apr 20, 2026

How to Run Multiple Models on One GPU

One GPU, multiple models. Here’s how to do it without buying more hardware.

When to use multiple models

Not every task needs the same model. Common multi-model setups:

Autocomplete + chat: A small fast model (7-9B) for tab completions, a larger model (22-27B) for chat interactions
Code + general reasoning: A code-specialized model for implementation, a general model for architecture discussions
Draft + review: A cheap model generates code, a better model reviews it
Different languages: Specialized models for different programming languages or natural languages

The key insight: you rarely need two models loaded simultaneously. Most workflows are sequential — you use one model, then switch to another.

Understanding how much VRAM each model needs is the foundation of multi-model serving:

Strategy	VRAM usage	Latency	Best for
Sequential swapping	1 model at a time	2-5s swap	Single developer
Concurrent loading	Sum of all models	None	Multi-user serving
Partial offloading	GPU + CPU split	Variable	Budget setups
LoRA adapters	Base + tiny adapters	<100ms swap	Fine-tuned variants
Speculative decoding	Draft + verify model	None	Speed optimization

The golden rule: if your models fit in VRAM together, load them together. If they don’t, swap them.

If you’re consistently running out of VRAM trying to serve multiple models, it may be time to upgrade — or rent a larger GPU from a cloud GPU provider to avoid the upfront hardware cost.

Option 1: Model swapping (Ollama)

Ollama automatically loads and unloads models on demand. When you switch models, the previous one is evicted from VRAM.

# Model A loads into VRAM
ollama run qwen3.5:27b "Fix this bug"

# Model A unloaded, Model B loads
ollama run codestral:22b "Complete this function"

Tradeoff: 2-5 second swap time. Fine for single-user dev, not for multi-user serving.

Best for: Local development with Continue.dev (chat model + autocomplete model).

Ollama multi-model serving

Ollama can keep multiple models loaded simultaneously if you have enough VRAM. Configure this with environment variables:

# Keep models loaded for 30 minutes (default: 5 minutes)
OLLAMA_KEEP_ALIVE=30m ollama serve

# Or set per-request keep-alive
curl http://localhost:11434/api/generate -d '{
  "model": "codestral:22b",
  "prompt": "hello",
  "keep_alive": "30m"
}'

To load two models concurrently on a 24GB GPU:

# Load a small autocomplete model (5GB)
curl http://localhost:11434/api/generate -d '{"model": "qwen3.5:9b-q4", "keep_alive": "60m", "prompt": ""}'

# Load a code model alongside it (12GB)
curl http://localhost:11434/api/generate -d '{"model": "codestral:22b-q4", "keep_alive": "60m", "prompt": ""}'

Check what’s loaded:

ollama ps
# NAME              SIZE    PROCESSOR
# qwen3.5:9b-q4    5.2GB   100% GPU
# codestral:22b-q4 12.1GB  100% GPU

Option 2: Multi-LoRA (vLLM)

Load one base model and multiple LoRA adapters. Each adapter adds <1GB of VRAM. vLLM supports serving multiple LoRA adapters simultaneously.

python -m vllm.entrypoints.openai.api_server \
  --model base-model \
  --enable-lora \
  --lora-modules adapter1=/path/to/lora1 adapter2=/path/to/lora2

Tradeoff: All adapters share the base model’s quality ceiling. Only works for fine-tuned variants, not completely different models.

Best for: Serving multiple fine-tuned versions (one per customer, one per task).

Option 3: vLLM model routing

For serving genuinely different models, vLLM supports multi-model serving with request routing. This follows the multi-model architecture pattern:

# Run two vLLM instances on the same GPU with memory limits
# Instance 1: code model (60% of GPU memory)
python -m vllm.entrypoints.openai.api_server \
  --model codestral-22b \
  --gpu-memory-utilization 0.55 \
  --port 8001

# Instance 2: chat model (35% of GPU memory)
python -m vllm.entrypoints.openai.api_server \
  --model qwen3.5-9b \
  --gpu-memory-utilization 0.35 \
  --port 8002

Then route requests with a simple proxy:

import httpx

async def route_request(task: str, prompt: str):
    if task == "autocomplete":
        port = 8001  # codestral
    else:
        port = 8002  # qwen chat
    
    async with httpx.AsyncClient() as client:
        resp = await client.post(f"http://localhost:{port}/v1/completions", json={
            "model": "default",
            "prompt": prompt,
            "max_tokens": 256
        })
    return resp.json()

Option 4: Smaller models that fit together

Choose models that fit in VRAM simultaneously:

GPU	Combo that fits
RTX 4090 (24GB)	Codestral 22B Q4 (12GB) + Qwen 9B Q4 (5GB)
Mac M4 32GB	Devstral Small 24B Q4 (14GB) + Qwen 9B Q4 (5GB)
2x RTX 4090 (48GB)	Qwen 27B Q4 (16GB) + Codestral 22B Q4 (12GB)

See our GPU memory planning guide for exact calculations.

Option 5: CPU offloading

Load part of the model on GPU, part on CPU RAM. Slower but fits larger models.

# llama.cpp: offload 30 of 40 layers to GPU, rest on CPU
./llama-server -m model.gguf -ngl 30

Tradeoff: 2-5x slower than full GPU. Only viable for light usage.

Practical examples

Developer workstation (RTX 4090, 24GB)

The most common setup for a solo developer using Continue.dev:

{
  "models": [
    {"provider": "ollama", "model": "qwen3.5:27b-q4", "title": "Chat"}
  ],
  "tabAutocompleteModel": {
    "provider": "ollama",
    "model": "codestral:22b-q4",
    "title": "Autocomplete"
  }
}

Ollama swaps between them automatically. The autocomplete model loads in ~2 seconds when you start typing, and the chat model loads when you open the chat panel.

Team server (A100 80GB)

For a small team sharing one GPU:

# Run vLLM with a single large model
python -m vllm.entrypoints.openai.api_server \
  --model qwen3.5:72b-awq \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 8 \
  --port 8000

One powerful model serving multiple users is more efficient than multiple smaller models. Route all tasks (code, chat, review) to the same model.

The practical recommendation

For most developers: use Ollama with model swapping. The 2-5 second swap is barely noticeable, and you get access to any model without VRAM planning.

For teams: use vLLM with one primary model. Route different tasks to the same model rather than running multiple models.

For power users who need zero-latency switching: pick two models that fit in VRAM together (check the table above) and keep both loaded with OLLAMA_KEEP_ALIVE.

How to Run Multiple Models on One GPU

When to use multiple models

VRAM sharing strategies

Option 1: Model swapping (Ollama)

Ollama multi-model serving

Option 2: Multi-LoRA (vLLM)

Option 3: vLLM model routing

Option 4: Smaller models that fit together

Option 5: CPU offloading

Practical examples

Developer workstation (RTX 4090, 24GB)

Team server (A100 80GB)

The practical recommendation

📬 AI Dev Weekly

You might also like

When to Use CPU vs GPU for LLM Inference

How Much VRAM Do You Need for AI Models? (2026 Calculator)

GPU vs CPU for AI Inference — When Do You Actually Need a GPU?

LLM Inference on Apple Silicon — M4 Performance Guide (2026)