πŸ“ Tutorials
Β· 4 min read

How to Run Devstral 2 Locally: Setup Guide for Mistral's Coding Model (2026)


Devstral 2 is Mistral AI’s open-weight coding model β€” purpose-built for software engineering tasks with strong performance on SWE-bench and deep understanding of code semantics. It is fully open-weight, available on Hugging Face, and runs locally on consumer hardware.

Unlike general-purpose models that also do coding, Devstral 2 was trained specifically for code β€” meaning it excels at code completion, refactoring, bug fixing, and explaining code patterns. This guide covers everything you need to run it locally.

Hardware requirements

Devstral 2 is estimated at ~50B parameters. Here are the requirements by quantization:

QuantizationMemory neededHardwareSpeed (est.)
FP16~100GB2Γ— A100, Mac Studio 128GB15-25 t/s
Q8~50GB1Γ— A100, Mac Studio 64GB20-35 t/s
Q6_K~38GBRTX 5090 (32GB + overflow), Mac Studio 64GB25-40 t/s
Q4_K_M~28GBRTX 4090 (24GB + some CPU), Mac Studio 36GB+30-50 t/s
Q3_K~22GBRTX 4090 24GB35-55 t/s

The sweet spot is Q4_K_M at ~28GB β€” fits on a single RTX 4090 or comfortably on a Mac with 36GB+ unified memory.

Setup with Ollama (easiest)

# Install Ollama (if not already)
curl -fsSL https://ollama.com/install.sh | sh

# Pull Devstral 2
ollama pull devstral2

# Run interactively
ollama run devstral2

# Or start as a server
ollama serve
# Then use API at http://localhost:11434

Ollama handles quantization, memory management, and GPU offloading automatically. It will use the Q4_K_M quantization by default.

For Ollama setup details, see our Ollama complete guide.

Setup with llama.cpp (more control)

# Download GGUF from Hugging Face
huggingface-cli download mistralai/Devstral-2-GGUF \
  devstral-2-Q4_K_M.gguf \
  --local-dir ./models/

# Run server
./llama-server \
  -m ./models/devstral-2-Q4_K_M.gguf \
  -c 32768 \
  -ngl 99 \
  --port 8080

For GPU-specific optimizations, see our Ollama vs llama.cpp vs vLLM comparison.

Setup with vLLM (production serving)

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Devstral-2 \
  --max-model-len 32768 \
  --port 8000

vLLM provides better throughput for multiple concurrent requests and automatic batching.

Integration with coding tools

With Aider

# Via Ollama
aider --model ollama/devstral2

# Via llama.cpp server
export OPENAI_API_BASE="http://localhost:8080/v1"
export OPENAI_API_KEY="not-needed"
aider --model openai/devstral-2

With Continue (VS Code)

Add to .continue/config.json:

{
  "models": [{
    "title": "Devstral 2 (Local)",
    "provider": "ollama",
    "model": "devstral2"
  }]
}

With OpenCode

opencode run -m ollama/devstral2 --dangerously-skip-permissions "fix the bug in auth.ts"

Devstral 2 vs other local coding models

ModelSizeMemory (Q4)Coding strengthBest for
Devstral 2~50B~28GBStrong (Mistral coding DNA)Code-specific tasks
Qwen 3.6 27B27B~16GBStrong (general + coding)All-around local model
Qwen 3.7 27B27B~16GBStronger (latest Qwen)Latest quality, less RAM
Granite 4.1 34B34B~20GBGood (enterprise/tool use)Enterprise workflows
Mistral Medium 3.5~40B~24GBStrong (Mistral general)General + coding
Llama 4 Scout109B (17B active)~60GBGood (broad knowledge)Large knowledge base

Devstral 2’s advantage: it is purpose-built for code, not a general model that also codes. This means better code completion, more idiomatic suggestions, and stronger understanding of code patterns β€” at the cost of weaker general knowledge and conversation.

When to use Devstral 2 locally vs API models

ScenarioLocal Devstral 2API (DeepSeek/MiMo)
Privacy (no data leaves machine)βœ…βŒ
Cost (>$50/mo API spend)βœ… Cheaper long-termPay per token
Quality (complex tasks)GoodBetter (larger models)
Speed (short context)βœ… Fast locallyNetwork latency
Offline useβœ…βŒ
Setup complexityMediumEasy (one API key)

For privacy-sensitive codebases or offline environments, local Devstral 2 is excellent. For maximum coding quality, API models like DeepSeek V4-Pro (80.6% SWE-bench) or Claude Opus 4.8 (69.2% SWE-bench Pro) are stronger.

Performance tips

  1. GPU offloading β€” Always use -ngl 99 (llama.cpp) to offload all layers to GPU. CPU inference is 5-10Γ— slower.
  2. Context window β€” Devstral 2 supports up to 128K context but performance degrades. Keep context under 32K for best speed.
  3. Temperature β€” Use 0.1-0.3 for code generation (lower = more deterministic). Use 0.6-0.8 for creative coding suggestions.
  4. System prompt β€” A coding-focused system prompt (β€œYou are a senior software engineer”) improves output quality vs generic chat.

FAQ

Can I run Devstral 2 on a MacBook Pro?

With 36GB+ unified memory: yes at Q4_K_M. With 16GB: no (model won’t fit). With 24GB: tight but possible at Q3_K with limited context.

How does it compare to running Qwen 3.6 27B locally?

Qwen 3.6 27B needs only 16GB at Q4 and is faster (smaller). Devstral 2 needs 28GB and is slower but may produce better code-specific output due to its specialized training. Qwen is the safer all-around choice; Devstral is the specialist.

Is Devstral 2 the same as Devstral Small 2?

No. Devstral Small 2 is a smaller variant (~14B) that runs on laptops with 8GB+ RAM. Devstral 2 is the full-size model (~50B) with better quality but higher hardware requirements.

Can I use it on RTX Spark?

Yes. RTX Spark (128GB unified memory) will run Devstral 2 at FP16 with room to spare. It is well within RTX Spark’s capability. See best LLMs for RTX Spark.

What’s the Mistral API alternative?

If you don’t want to self-host, Devstral 2 is available via Mistral’s API and OpenRouter. The API provides faster inference and larger context. See our Devstral 2 API guide.