How to Run Devstral 2 Locally: Setup Guide for Mistral's Coding Model (2026)
Devstral 2 is Mistral AIβs open-weight coding model β purpose-built for software engineering tasks with strong performance on SWE-bench and deep understanding of code semantics. It is fully open-weight, available on Hugging Face, and runs locally on consumer hardware.
Unlike general-purpose models that also do coding, Devstral 2 was trained specifically for code β meaning it excels at code completion, refactoring, bug fixing, and explaining code patterns. This guide covers everything you need to run it locally.
Hardware requirements
Devstral 2 is estimated at ~50B parameters. Here are the requirements by quantization:
| Quantization | Memory needed | Hardware | Speed (est.) |
|---|---|---|---|
| FP16 | ~100GB | 2Γ A100, Mac Studio 128GB | 15-25 t/s |
| Q8 | ~50GB | 1Γ A100, Mac Studio 64GB | 20-35 t/s |
| Q6_K | ~38GB | RTX 5090 (32GB + overflow), Mac Studio 64GB | 25-40 t/s |
| Q4_K_M | ~28GB | RTX 4090 (24GB + some CPU), Mac Studio 36GB+ | 30-50 t/s |
| Q3_K | ~22GB | RTX 4090 24GB | 35-55 t/s |
The sweet spot is Q4_K_M at ~28GB β fits on a single RTX 4090 or comfortably on a Mac with 36GB+ unified memory.
Setup with Ollama (easiest)
# Install Ollama (if not already)
curl -fsSL https://ollama.com/install.sh | sh
# Pull Devstral 2
ollama pull devstral2
# Run interactively
ollama run devstral2
# Or start as a server
ollama serve
# Then use API at http://localhost:11434
Ollama handles quantization, memory management, and GPU offloading automatically. It will use the Q4_K_M quantization by default.
For Ollama setup details, see our Ollama complete guide.
Setup with llama.cpp (more control)
# Download GGUF from Hugging Face
huggingface-cli download mistralai/Devstral-2-GGUF \
devstral-2-Q4_K_M.gguf \
--local-dir ./models/
# Run server
./llama-server \
-m ./models/devstral-2-Q4_K_M.gguf \
-c 32768 \
-ngl 99 \
--port 8080
For GPU-specific optimizations, see our Ollama vs llama.cpp vs vLLM comparison.
Setup with vLLM (production serving)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Devstral-2 \
--max-model-len 32768 \
--port 8000
vLLM provides better throughput for multiple concurrent requests and automatic batching.
Integration with coding tools
With Aider
# Via Ollama
aider --model ollama/devstral2
# Via llama.cpp server
export OPENAI_API_BASE="http://localhost:8080/v1"
export OPENAI_API_KEY="not-needed"
aider --model openai/devstral-2
With Continue (VS Code)
Add to .continue/config.json:
{
"models": [{
"title": "Devstral 2 (Local)",
"provider": "ollama",
"model": "devstral2"
}]
}
With OpenCode
opencode run -m ollama/devstral2 --dangerously-skip-permissions "fix the bug in auth.ts"
Devstral 2 vs other local coding models
| Model | Size | Memory (Q4) | Coding strength | Best for |
|---|---|---|---|---|
| Devstral 2 | ~50B | ~28GB | Strong (Mistral coding DNA) | Code-specific tasks |
| Qwen 3.6 27B | 27B | ~16GB | Strong (general + coding) | All-around local model |
| Qwen 3.7 27B | 27B | ~16GB | Stronger (latest Qwen) | Latest quality, less RAM |
| Granite 4.1 34B | 34B | ~20GB | Good (enterprise/tool use) | Enterprise workflows |
| Mistral Medium 3.5 | ~40B | ~24GB | Strong (Mistral general) | General + coding |
| Llama 4 Scout | 109B (17B active) | ~60GB | Good (broad knowledge) | Large knowledge base |
Devstral 2βs advantage: it is purpose-built for code, not a general model that also codes. This means better code completion, more idiomatic suggestions, and stronger understanding of code patterns β at the cost of weaker general knowledge and conversation.
When to use Devstral 2 locally vs API models
| Scenario | Local Devstral 2 | API (DeepSeek/MiMo) |
|---|---|---|
| Privacy (no data leaves machine) | β | β |
| Cost (>$50/mo API spend) | β Cheaper long-term | Pay per token |
| Quality (complex tasks) | Good | Better (larger models) |
| Speed (short context) | β Fast locally | Network latency |
| Offline use | β | β |
| Setup complexity | Medium | Easy (one API key) |
For privacy-sensitive codebases or offline environments, local Devstral 2 is excellent. For maximum coding quality, API models like DeepSeek V4-Pro (80.6% SWE-bench) or Claude Opus 4.8 (69.2% SWE-bench Pro) are stronger.
Performance tips
- GPU offloading β Always use
-ngl 99(llama.cpp) to offload all layers to GPU. CPU inference is 5-10Γ slower. - Context window β Devstral 2 supports up to 128K context but performance degrades. Keep context under 32K for best speed.
- Temperature β Use 0.1-0.3 for code generation (lower = more deterministic). Use 0.6-0.8 for creative coding suggestions.
- System prompt β A coding-focused system prompt (βYou are a senior software engineerβ) improves output quality vs generic chat.
FAQ
Can I run Devstral 2 on a MacBook Pro?
With 36GB+ unified memory: yes at Q4_K_M. With 16GB: no (model wonβt fit). With 24GB: tight but possible at Q3_K with limited context.
How does it compare to running Qwen 3.6 27B locally?
Qwen 3.6 27B needs only 16GB at Q4 and is faster (smaller). Devstral 2 needs 28GB and is slower but may produce better code-specific output due to its specialized training. Qwen is the safer all-around choice; Devstral is the specialist.
Is Devstral 2 the same as Devstral Small 2?
No. Devstral Small 2 is a smaller variant (~14B) that runs on laptops with 8GB+ RAM. Devstral 2 is the full-size model (~50B) with better quality but higher hardware requirements.
Can I use it on RTX Spark?
Yes. RTX Spark (128GB unified memory) will run Devstral 2 at FP16 with room to spare. It is well within RTX Sparkβs capability. See best LLMs for RTX Spark.
Whatβs the Mistral API alternative?
If you donβt want to self-host, Devstral 2 is available via Mistralβs API and OpenRouter. The API provides faster inference and larger context. See our Devstral 2 API guide.