Devstral Small 2 Guide β Mistral's 24B Coding Model You Can Run Locally
π’ Update: Mistral Medium 3.5 has replaced Devstral 2 as the default model in Vibe CLI. See the Medium 3.5 complete guide and Vibe 2.0 remote agents guide.
Devstral Small 2 is the consumer-friendly version of Devstral 2. At 24B parameters, it runs on a single RTX 4090 or a Mac with 32GB RAM while keeping the massive 256K context window. Itβs designed specifically for agentic coding β multi-file edits, refactoring, and autonomous task completion.
Specs
| Spec | Devstral Small 2 | Devstral 2 |
|---|---|---|
| Parameters | 24B | 123B |
| Context | 256K | 256K |
| SWE-bench Verified | ~58% | 72.2% |
| HumanEval | ~82% | ~88% |
| VRAM (Q4) | ~14GB | ~65GB |
| VRAM (Q8) | ~26GB | ~130GB |
| License | Modified MIT | Modified MIT |
Benchmarks in detail
Devstral Small 2 punches well above its weight class for a 24B model:
- SWE-bench Verified: ~58% β The key agentic benchmark measuring whether the model can autonomously fix real GitHub issues. For comparison, GPT-4o scores ~38% and Claude Sonnet 3.5 scored ~49% at launch. A 24B local model hitting 58% is remarkable.
- HumanEval: ~82% β Standard code generation benchmark. Solid but not the primary strength β this model is optimized for multi-step agent tasks, not single-function generation.
- Multi-file editing: Strong β The 256K context window means it can hold entire project structures in memory. Combined with its agentic training, it handles cross-file refactors that smaller-context models struggle with.
- Instruction following: Very good β Trained specifically to follow tool-use patterns and structured output formats that coding agents require.
The gap between Devstral Small 2 (58% SWE-bench) and the full Devstral 2 (72.2%) is significant but expected given the 5x parameter difference. For local use without GPU clusters, Small 2 is the best agentic coding model available.
How to run locally with Ollama
The easiest way to run Devstral Small 2 is with Ollama:
# Pull the model (downloads ~14GB for Q4 quantization)
ollama pull devstral-small:24b
# Run interactively
ollama run devstral-small:24b
# Or start as a server for tool integrations
ollama serve
Hardware requirements
| Setup | Quantization | VRAM needed | Performance |
|---|---|---|---|
| RTX 4090 (24GB) | Q4_K_M | ~14GB | Fast, room for KV cache |
| RTX 3090 (24GB) | Q4_K_M | ~14GB | Good, slightly slower |
| Mac M2/M3/M4 (32GB) | Q4_K_M | ~14GB unified | Good, uses Metal |
| Mac M2/M3/M4 (64GB) | Q8_0 | ~26GB unified | Best quality |
| 2x RTX 3090 | Q8_0 | ~26GB split | Best quality, fast |
For the best balance of quality and speed, use Q4_K_M quantization on a 24GB GPU. If you have 64GB unified memory on a Mac, go with Q8_0 for noticeably better output quality.
With vLLM (higher throughput)
pip install vllm
vllm serve mistralai/Devstral-Small-2 \
--max-model-len 65536 \
--tensor-parallel-size 1 \
--quantization awq
vLLM gives you an OpenAI-compatible API endpoint with better batching and throughput than Ollama, useful if youβre running multiple agent sessions.
API setup (Mistral platform)
If you donβt want to run locally, use the Mistral API:
curl https://api.mistral.ai/v1/chat/completions \
-H "Authorization: Bearer $MISTRAL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "devstral-small-latest",
"messages": [{"role": "user", "content": "Refactor this function to use async/await"}],
"max_tokens": 4096
}'
API pricing
| Model | Input | Output |
|---|---|---|
| Devstral Small 2 | $0.10/1M tokens | $0.30/1M tokens |
| Devstral 2 (full) | $2.00/1M tokens | $6.00/1M tokens |
| Codestral | $0.30/1M tokens | $0.90/1M tokens |
Devstral Small 2 via API is extremely cheap β 20x less than the full Devstral 2.
Integration with coding tools
With Aider
# Using Ollama (local)
aider --model ollama/devstral-small:24b
# Using Mistral API
export MISTRAL_API_KEY=your-key
aider --model mistral/devstral-small-latest
With Continue.dev
{
"models": [
{
"provider": "ollama",
"model": "devstral-small:24b",
"title": "Devstral Small 2"
}
],
"tabAutocompleteModel": {
"provider": "ollama",
"model": "codestral:22b",
"title": "Codestral"
}
}
With any OpenAI-compatible client
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama',
});
const response = await client.chat.completions.create({
model: 'devstral-small:24b',
messages: [{ role: 'user', content: 'Fix the bug in src/auth.ts' }],
max_tokens: 8192,
});
Devstral Small 2 vs Codestral β which to use?
Theyβre complementary, not competing:
| Aspect | Devstral Small 2 | Codestral |
|---|---|---|
| Primary task | Agentic coding (multi-step) | Code completion (single-step) |
| FIM support | No | Yes (native) |
| Multi-file edits | Excellent | Not designed for this |
| Autocomplete | Mediocre | Best-in-class |
| Context window | 256K | 256K |
| Parameters | 24B | 22B |
| VRAM | ~14GB | ~12GB |
Use Devstral Small 2 when you need an agent that can plan, edit multiple files, run commands, and iterate.
Use Codestral when you need fast, accurate tab completions in your IDE with native Fill-in-the-Middle support.
The ideal local setup: Run both. Devstral Small for your coding agent, Codestral for autocomplete. They fit on a single 24GB GPU if you swap between them, or run simultaneously on 48GB+.
Devstral Small vs other local coding models
| Model | VRAM (Q4) | Context | SWE-bench | Best for |
|---|---|---|---|---|
| Devstral Small 24B | 14GB | 256K | ~58% | Agentic coding |
| Codestral 22B | 12GB | 256K | N/A | FIM/tab completion |
| Qwen 3.5 27B | 16GB | 128K | ~45% | General + coding |
| Gemma 4 27B | 16GB | 128K | ~40% | General + coding |
Devstral Smallβs advantages: highest SWE-bench in its class, 256K context (double Qwen/Gemma), and purpose-built agentic training.
Tips for best results
- Use Q4_K_M quantization β best speed/quality tradeoff for 24GB GPUs
- Set temperature to 0.1-0.3 for coding tasks
- Provide full file context β the 256K window is there, use it
- Use structured prompts β Devstral responds best to clear task descriptions with file paths
- Pair with Codestral for autocomplete β they complement each other perfectly
FAQ
Is Devstral Small 2 free?
Yes, for local use. Devstral Small 2 is released under a Modified MIT license that allows commercial use. You can download and run it via Ollama, vLLM, or any compatible inference engine at no cost. If you prefer not to run it locally, the Mistral API charges $0.10/$0.30 per million input/output tokens β one of the cheapest coding model APIs available.
Can I run it locally?
Yes, and itβs specifically designed for local deployment. You need approximately 14GB of VRAM for Q4 quantization, which fits on an RTX 4090, RTX 3090, or any Mac with 32GB+ unified memory. Install Ollama, run ollama pull devstral-small:24b, and youβre ready. For higher quality output, use Q8 quantization with ~26GB VRAM.
How does it compare to Codestral?
They serve different purposes. Devstral Small 2 is an agentic coding model β it excels at multi-step tasks like refactoring across files, fixing bugs autonomously, and following complex instructions. Codestral is a code completion model with native Fill-in-the-Middle (FIM) support β itβs optimized for fast, accurate autocomplete in your IDE. The ideal setup is running both: Devstral Small for your coding agent and Codestral for tab completions.
Whatβs the context window?
256K tokens β one of the largest context windows available in any local model. This means Devstral Small 2 can hold roughly 500-700 files of typical source code in its context simultaneously. For comparison, most competing models in the 24B range offer 128K or less. The large context is critical for agentic coding where the model needs to understand relationships across an entire codebase to make correct edits.
Related: Devstral 2 Complete Guide Β· Best AI Models for Coding Locally Β· Ollama Complete Guide