Jun 7, 2026 · 4 min read

How to Run Devstral 2 Locally: Setup Guide for Mistral's Coding Model (2026)

Devstral 2 is Mistral AI’s open-weight coding model — purpose-built for software engineering tasks with strong performance on SWE-bench and deep understanding of code semantics. It is fully open-weight, available on Hugging Face, and runs locally on consumer hardware.

Unlike general-purpose models that also do coding, Devstral 2 was trained specifically for code — meaning it excels at code completion, refactoring, bug fixing, and explaining code patterns. This guide covers everything you need to run it locally.

Hardware requirements

Devstral 2 is estimated at ~50B parameters. Here are the requirements by quantization:

Quantization	Memory needed	Hardware	Speed (est.)
FP16	~100GB	2× A100, Mac Studio 128GB	15-25 t/s
Q8	~50GB	1× A100, Mac Studio 64GB	20-35 t/s
Q6_K	~38GB	RTX 5090 (32GB + overflow), Mac Studio 64GB	25-40 t/s
Q4_K_M	~28GB	RTX 4090 (24GB + some CPU), Mac Studio 36GB+	30-50 t/s
Q3_K	~22GB	RTX 4090 24GB	35-55 t/s

The sweet spot is Q4_K_M at ~28GB — fits on a single RTX 4090 or comfortably on a Mac with 36GB+ unified memory.

Setup with Ollama (easiest)

# Install Ollama (if not already)
curl -fsSL https://ollama.com/install.sh | sh

# Pull Devstral 2
ollama pull devstral2

# Run interactively
ollama run devstral2

# Or start as a server
ollama serve
# Then use API at http://localhost:11434

Ollama handles quantization, memory management, and GPU offloading automatically. It will use the Q4_K_M quantization by default.

For Ollama setup details, see our Ollama complete guide.

Setup with llama.cpp (more control)

# Download GGUF from Hugging Face
huggingface-cli download mistralai/Devstral-2-GGUF \
  devstral-2-Q4_K_M.gguf \
  --local-dir ./models/

# Run server
./llama-server \
  -m ./models/devstral-2-Q4_K_M.gguf \
  -c 32768 \
  -ngl 99 \
  --port 8080

For GPU-specific optimizations, see our Ollama vs llama.cpp vs vLLM comparison.

Setup with vLLM (production serving)

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Devstral-2 \
  --max-model-len 32768 \
  --port 8000

vLLM provides better throughput for multiple concurrent requests and automatic batching.

Integration with coding tools

With Aider

# Via Ollama
aider --model ollama/devstral2

# Via llama.cpp server
export OPENAI_API_BASE="http://localhost:8080/v1"
export OPENAI_API_KEY="not-needed"
aider --model openai/devstral-2

With Continue (VS Code)

Add to .continue/config.json:

{
  "models": [{
    "title": "Devstral 2 (Local)",
    "provider": "ollama",
    "model": "devstral2"
  }]
}

With OpenCode

opencode run -m ollama/devstral2 --dangerously-skip-permissions "fix the bug in auth.ts"

Devstral 2 vs other local coding models

Model	Size	Memory (Q4)	Coding strength	Best for
Devstral 2	~50B	~28GB	Strong (Mistral coding DNA)	Code-specific tasks
Qwen 3.6 27B	27B	~16GB	Strong (general + coding)	All-around local model
Qwen 3.7 27B	27B	~16GB	Stronger (latest Qwen)	Latest quality, less RAM
Granite 4.1 34B	34B	~20GB	Good (enterprise/tool use)	Enterprise workflows
Mistral Medium 3.5	~40B	~24GB	Strong (Mistral general)	General + coding
Llama 4 Scout	109B (17B active)	~60GB	Good (broad knowledge)	Large knowledge base

Devstral 2’s advantage: it is purpose-built for code, not a general model that also codes. This means better code completion, more idiomatic suggestions, and stronger understanding of code patterns — at the cost of weaker general knowledge and conversation.

When to use Devstral 2 locally vs API models

Scenario	Local Devstral 2	API (DeepSeek/MiMo)
Privacy (no data leaves machine)	✅	❌
Cost (>$50/mo API spend)	✅ Cheaper long-term	Pay per token
Quality (complex tasks)	Good	Better (larger models)
Speed (short context)	✅ Fast locally	Network latency
Offline use	✅	❌
Setup complexity	Medium	Easy (one API key)

For privacy-sensitive codebases or offline environments, local Devstral 2 is excellent. For maximum coding quality, API models like DeepSeek V4-Pro (80.6% SWE-bench) or Claude Opus 4.8 (69.2% SWE-bench Pro) are stronger.

Performance tips

GPU offloading — Always use -ngl 99 (llama.cpp) to offload all layers to GPU. CPU inference is 5-10× slower.
Context window — Devstral 2 supports up to 128K context but performance degrades. Keep context under 32K for best speed.
Temperature — Use 0.1-0.3 for code generation (lower = more deterministic). Use 0.6-0.8 for creative coding suggestions.
System prompt — A coding-focused system prompt (“You are a senior software engineer”) improves output quality vs generic chat.

FAQ

Can I run Devstral 2 on a MacBook Pro?

With 36GB+ unified memory: yes at Q4_K_M. With 16GB: no (model won’t fit). With 24GB: tight but possible at Q3_K with limited context.

How does it compare to running Qwen 3.6 27B locally?

Qwen 3.6 27B needs only 16GB at Q4 and is faster (smaller). Devstral 2 needs 28GB and is slower but may produce better code-specific output due to its specialized training. Qwen is the safer all-around choice; Devstral is the specialist.

Is Devstral 2 the same as Devstral Small 2?

No. Devstral Small 2 is a smaller variant (~14B) that runs on laptops with 8GB+ RAM. Devstral 2 is the full-size model (~50B) with better quality but higher hardware requirements.

Can I use it on RTX Spark?

Yes. RTX Spark (128GB unified memory) will run Devstral 2 at FP16 with room to spare. It is well within RTX Spark’s capability. See best LLMs for RTX Spark.

What’s the Mistral API alternative?

If you don’t want to self-host, Devstral 2 is available via Mistral’s API and OpenRouter. The API provides faster inference and larger context. See our Devstral 2 API guide.