Apr 21, 2026 · 6 min read

Last updated on Apr 19, 2026

Devstral Small 2 Guide — Mistral's 24B Coding Model You Can Run Locally

📢 Update: Mistral Medium 3.5 has replaced Devstral 2 as the default model in Vibe CLI. See the Medium 3.5 complete guide and Vibe 2.0 remote agents guide.

Devstral Small 2 is the consumer-friendly version of Devstral 2. At 24B parameters, it runs on a single RTX 4090 or a Mac with 32GB RAM while keeping the massive 256K context window. It’s designed specifically for agentic coding — multi-file edits, refactoring, and autonomous task completion.

Specs

Spec	Devstral Small 2	Devstral 2
Parameters	24B	123B
Context	256K	256K
SWE-bench Verified	~58%	72.2%
HumanEval	~82%	~88%
VRAM (Q4)	~14GB	~65GB
VRAM (Q8)	~26GB	~130GB
License	Modified MIT	Modified MIT

Benchmarks in detail

Devstral Small 2 punches well above its weight class for a 24B model:

SWE-bench Verified: ~58% — The key agentic benchmark measuring whether the model can autonomously fix real GitHub issues. For comparison, GPT-4o scores ~38% and Claude Sonnet 3.5 scored ~49% at launch. A 24B local model hitting 58% is remarkable.
HumanEval: ~82% — Standard code generation benchmark. Solid but not the primary strength — this model is optimized for multi-step agent tasks, not single-function generation.
Multi-file editing: Strong — The 256K context window means it can hold entire project structures in memory. Combined with its agentic training, it handles cross-file refactors that smaller-context models struggle with.
Instruction following: Very good — Trained specifically to follow tool-use patterns and structured output formats that coding agents require.

The gap between Devstral Small 2 (58% SWE-bench) and the full Devstral 2 (72.2%) is significant but expected given the 5x parameter difference. For local use without GPU clusters, Small 2 is the best agentic coding model available.

How to run locally with Ollama

The easiest way to run Devstral Small 2 is with Ollama:

# Pull the model (downloads ~14GB for Q4 quantization)
ollama pull devstral-small:24b

# Run interactively
ollama run devstral-small:24b

# Or start as a server for tool integrations
ollama serve

Hardware requirements

Setup	Quantization	VRAM needed	Performance
RTX 4090 (24GB)	Q4_K_M	~14GB	Fast, room for KV cache
RTX 3090 (24GB)	Q4_K_M	~14GB	Good, slightly slower
Mac M2/M3/M4 (32GB)	Q4_K_M	~14GB unified	Good, uses Metal
Mac M2/M3/M4 (64GB)	Q8_0	~26GB unified	Best quality
2x RTX 3090	Q8_0	~26GB split	Best quality, fast

For the best balance of quality and speed, use Q4_K_M quantization on a 24GB GPU. If you have 64GB unified memory on a Mac, go with Q8_0 for noticeably better output quality.

With vLLM (higher throughput)

pip install vllm
vllm serve mistralai/Devstral-Small-2 \
  --max-model-len 65536 \
  --tensor-parallel-size 1 \
  --quantization awq

vLLM gives you an OpenAI-compatible API endpoint with better batching and throughput than Ollama, useful if you’re running multiple agent sessions.

API setup (Mistral platform)

If you don’t want to run locally, use the Mistral API:

curl https://api.mistral.ai/v1/chat/completions \
  -H "Authorization: Bearer $MISTRAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "devstral-small-latest",
    "messages": [{"role": "user", "content": "Refactor this function to use async/await"}],
    "max_tokens": 4096
  }'

API pricing

Model	Input	Output
Devstral Small 2	$0.10/1M tokens	$0.30/1M tokens
Devstral 2 (full)	$2.00/1M tokens	$6.00/1M tokens
Codestral	$0.30/1M tokens	$0.90/1M tokens

Devstral Small 2 via API is extremely cheap — 20x less than the full Devstral 2.

Integration with coding tools

With Aider

# Using Ollama (local)
aider --model ollama/devstral-small:24b

# Using Mistral API
export MISTRAL_API_KEY=your-key
aider --model mistral/devstral-small-latest

With Continue.dev

{
  "models": [
    {
      "provider": "ollama",
      "model": "devstral-small:24b",
      "title": "Devstral Small 2"
    }
  ],
  "tabAutocompleteModel": {
    "provider": "ollama",
    "model": "codestral:22b",
    "title": "Codestral"
  }
}

With any OpenAI-compatible client

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
});

const response = await client.chat.completions.create({
  model: 'devstral-small:24b',
  messages: [{ role: 'user', content: 'Fix the bug in src/auth.ts' }],
  max_tokens: 8192,
});

Devstral Small 2 vs Codestral — which to use?

They’re complementary, not competing:

Aspect	Devstral Small 2	Codestral
Primary task	Agentic coding (multi-step)	Code completion (single-step)
FIM support	No	Yes (native)
Multi-file edits	Excellent	Not designed for this
Autocomplete	Mediocre	Best-in-class
Context window	256K	256K
Parameters	24B	22B
VRAM	~14GB	~12GB

Use Devstral Small 2 when you need an agent that can plan, edit multiple files, run commands, and iterate.

Use Codestral when you need fast, accurate tab completions in your IDE with native Fill-in-the-Middle support.

The ideal local setup: Run both. Devstral Small for your coding agent, Codestral for autocomplete. They fit on a single 24GB GPU if you swap between them, or run simultaneously on 48GB+.

Devstral Small vs other local coding models

Model	VRAM (Q4)	Context	SWE-bench	Best for
Devstral Small 24B	14GB	256K	~58%	Agentic coding
Codestral 22B	12GB	256K	N/A	FIM/tab completion
Qwen 3.5 27B	16GB	128K	~45%	General + coding
Gemma 4 27B	16GB	128K	~40%	General + coding

Devstral Small’s advantages: highest SWE-bench in its class, 256K context (double Qwen/Gemma), and purpose-built agentic training.

Tips for best results

Use Q4_K_M quantization — best speed/quality tradeoff for 24GB GPUs
Set temperature to 0.1-0.3 for coding tasks
Provide full file context — the 256K window is there, use it
Use structured prompts — Devstral responds best to clear task descriptions with file paths
Pair with Codestral for autocomplete — they complement each other perfectly

FAQ

Is Devstral Small 2 free?

Yes, for local use. Devstral Small 2 is released under a Modified MIT license that allows commercial use. You can download and run it via Ollama, vLLM, or any compatible inference engine at no cost. If you prefer not to run it locally, the Mistral API charges $0.10/$0.30 per million input/output tokens — one of the cheapest coding model APIs available.

Can I run it locally?

Yes, and it’s specifically designed for local deployment. You need approximately 14GB of VRAM for Q4 quantization, which fits on an RTX 4090, RTX 3090, or any Mac with 32GB+ unified memory. Install Ollama, run ollama pull devstral-small:24b, and you’re ready. For higher quality output, use Q8 quantization with ~26GB VRAM.

How does it compare to Codestral?

They serve different purposes. Devstral Small 2 is an agentic coding model — it excels at multi-step tasks like refactoring across files, fixing bugs autonomously, and following complex instructions. Codestral is a code completion model with native Fill-in-the-Middle (FIM) support — it’s optimized for fast, accurate autocomplete in your IDE. The ideal setup is running both: Devstral Small for your coding agent and Codestral for tab completions.

What’s the context window?

256K tokens — one of the largest context windows available in any local model. This means Devstral Small 2 can hold roughly 500-700 files of typical source code in its context simultaneously. For comparison, most competing models in the 24B range offer 128K or less. The large context is critical for agentic coding where the model needs to understand relationships across an entire codebase to make correct edits.