πŸ€– AI Tools
Β· 4 min read
Last updated on

How to Run Multiple Models on One GPU


One GPU, multiple models. Here’s how to do it without buying more hardware.

When to use multiple models

Not every task needs the same model. Common multi-model setups:

  • Autocomplete + chat: A small fast model (7-9B) for tab completions, a larger model (22-27B) for chat interactions
  • Code + general reasoning: A code-specialized model for implementation, a general model for architecture discussions
  • Draft + review: A cheap model generates code, a better model reviews it
  • Different languages: Specialized models for different programming languages or natural languages

The key insight: you rarely need two models loaded simultaneously. Most workflows are sequential β€” you use one model, then switch to another.

VRAM sharing strategies

Understanding how much VRAM each model needs is the foundation of multi-model serving:

StrategyVRAM usageLatencyBest for
Sequential swapping1 model at a time2-5s swapSingle developer
Concurrent loadingSum of all modelsNoneMulti-user serving
Partial offloadingGPU + CPU splitVariableBudget setups
LoRA adaptersBase + tiny adapters<100ms swapFine-tuned variants
Speculative decodingDraft + verify modelNoneSpeed optimization

The golden rule: if your models fit in VRAM together, load them together. If they don’t, swap them.

Option 1: Model swapping (Ollama)

Ollama automatically loads and unloads models on demand. When you switch models, the previous one is evicted from VRAM.

# Model A loads into VRAM
ollama run qwen3.5:27b "Fix this bug"

# Model A unloaded, Model B loads
ollama run codestral:22b "Complete this function"

Tradeoff: 2-5 second swap time. Fine for single-user dev, not for multi-user serving.

Best for: Local development with Continue.dev (chat model + autocomplete model).

Ollama multi-model serving

Ollama can keep multiple models loaded simultaneously if you have enough VRAM. Configure this with environment variables:

# Keep models loaded for 30 minutes (default: 5 minutes)
OLLAMA_KEEP_ALIVE=30m ollama serve

# Or set per-request keep-alive
curl http://localhost:11434/api/generate -d '{
  "model": "codestral:22b",
  "prompt": "hello",
  "keep_alive": "30m"
}'

To load two models concurrently on a 24GB GPU:

# Load a small autocomplete model (5GB)
curl http://localhost:11434/api/generate -d '{"model": "qwen3.5:9b-q4", "keep_alive": "60m", "prompt": ""}'

# Load a code model alongside it (12GB)
curl http://localhost:11434/api/generate -d '{"model": "codestral:22b-q4", "keep_alive": "60m", "prompt": ""}'

Check what’s loaded:

ollama ps
# NAME              SIZE    PROCESSOR
# qwen3.5:9b-q4    5.2GB   100% GPU
# codestral:22b-q4 12.1GB  100% GPU

Option 2: Multi-LoRA (vLLM)

Load one base model and multiple LoRA adapters. Each adapter adds <1GB of VRAM. vLLM supports serving multiple LoRA adapters simultaneously.

python -m vllm.entrypoints.openai.api_server \
  --model base-model \
  --enable-lora \
  --lora-modules adapter1=/path/to/lora1 adapter2=/path/to/lora2

Tradeoff: All adapters share the base model’s quality ceiling. Only works for fine-tuned variants, not completely different models.

Best for: Serving multiple fine-tuned versions (one per customer, one per task).

Option 3: vLLM model routing

For serving genuinely different models, vLLM supports multi-model serving with request routing. This follows the multi-model architecture pattern:

# Run two vLLM instances on the same GPU with memory limits
# Instance 1: code model (60% of GPU memory)
python -m vllm.entrypoints.openai.api_server \
  --model codestral-22b \
  --gpu-memory-utilization 0.55 \
  --port 8001

# Instance 2: chat model (35% of GPU memory)
python -m vllm.entrypoints.openai.api_server \
  --model qwen3.5-9b \
  --gpu-memory-utilization 0.35 \
  --port 8002

Then route requests with a simple proxy:

import httpx

async def route_request(task: str, prompt: str):
    if task == "autocomplete":
        port = 8001  # codestral
    else:
        port = 8002  # qwen chat
    
    async with httpx.AsyncClient() as client:
        resp = await client.post(f"http://localhost:{port}/v1/completions", json={
            "model": "default",
            "prompt": prompt,
            "max_tokens": 256
        })
    return resp.json()

Option 4: Smaller models that fit together

Choose models that fit in VRAM simultaneously:

GPUCombo that fits
RTX 4090 (24GB)Codestral 22B Q4 (12GB) + Qwen 9B Q4 (5GB)
Mac M4 32GBDevstral Small 24B Q4 (14GB) + Qwen 9B Q4 (5GB)
2x RTX 4090 (48GB)Qwen 27B Q4 (16GB) + Codestral 22B Q4 (12GB)

See our GPU memory planning guide for exact calculations.

Option 5: CPU offloading

Load part of the model on GPU, part on CPU RAM. Slower but fits larger models.

# llama.cpp: offload 30 of 40 layers to GPU, rest on CPU
./llama-server -m model.gguf -ngl 30

Tradeoff: 2-5x slower than full GPU. Only viable for light usage.

Practical examples

Developer workstation (RTX 4090, 24GB)

The most common setup for a solo developer using Continue.dev:

{
  "models": [
    {"provider": "ollama", "model": "qwen3.5:27b-q4", "title": "Chat"}
  ],
  "tabAutocompleteModel": {
    "provider": "ollama",
    "model": "codestral:22b-q4",
    "title": "Autocomplete"
  }
}

Ollama swaps between them automatically. The autocomplete model loads in ~2 seconds when you start typing, and the chat model loads when you open the chat panel.

Team server (A100 80GB)

For a small team sharing one GPU:

# Run vLLM with a single large model
python -m vllm.entrypoints.openai.api_server \
  --model qwen3.5:72b-awq \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 8 \
  --port 8000

One powerful model serving multiple users is more efficient than multiple smaller models. Route all tasks (code, chat, review) to the same model.

The practical recommendation

For most developers: use Ollama with model swapping. The 2-5 second swap is barely noticeable, and you get access to any model without VRAM planning.

For teams: use vLLM with one primary model. Route different tasks to the same model rather than running multiple models.

For power users who need zero-latency switching: pick two models that fit in VRAM together (check the table above) and keep both loaded with OLLAMA_KEEP_ALIVE.

Related: How Much VRAM for AI Β· Ollama Complete Guide Β· Serve LLMs with vLLM Β· Multi-Model Architecture Β· GPU Memory Planning Β· Best AI Models Under 16GB VRAM Β· Quantization Trade-offs Β· Cpu Vs Gpu Llm Inference