πŸ€– AI Tools
Β· 6 min read
Last updated on

Devstral Small 2 Guide β€” Mistral's 24B Coding Model You Can Run Locally


πŸ“’ Update: Mistral Medium 3.5 has replaced Devstral 2 as the default model in Vibe CLI. See the Medium 3.5 complete guide and Vibe 2.0 remote agents guide.

Devstral Small 2 is the consumer-friendly version of Devstral 2. At 24B parameters, it runs on a single RTX 4090 or a Mac with 32GB RAM while keeping the massive 256K context window. It’s designed specifically for agentic coding β€” multi-file edits, refactoring, and autonomous task completion.

Specs

SpecDevstral Small 2Devstral 2
Parameters24B123B
Context256K256K
SWE-bench Verified~58%72.2%
HumanEval~82%~88%
VRAM (Q4)~14GB~65GB
VRAM (Q8)~26GB~130GB
LicenseModified MITModified MIT

Benchmarks in detail

Devstral Small 2 punches well above its weight class for a 24B model:

  • SWE-bench Verified: ~58% β€” The key agentic benchmark measuring whether the model can autonomously fix real GitHub issues. For comparison, GPT-4o scores ~38% and Claude Sonnet 3.5 scored ~49% at launch. A 24B local model hitting 58% is remarkable.
  • HumanEval: ~82% β€” Standard code generation benchmark. Solid but not the primary strength β€” this model is optimized for multi-step agent tasks, not single-function generation.
  • Multi-file editing: Strong β€” The 256K context window means it can hold entire project structures in memory. Combined with its agentic training, it handles cross-file refactors that smaller-context models struggle with.
  • Instruction following: Very good β€” Trained specifically to follow tool-use patterns and structured output formats that coding agents require.

The gap between Devstral Small 2 (58% SWE-bench) and the full Devstral 2 (72.2%) is significant but expected given the 5x parameter difference. For local use without GPU clusters, Small 2 is the best agentic coding model available.

How to run locally with Ollama

The easiest way to run Devstral Small 2 is with Ollama:

# Pull the model (downloads ~14GB for Q4 quantization)
ollama pull devstral-small:24b

# Run interactively
ollama run devstral-small:24b

# Or start as a server for tool integrations
ollama serve

Hardware requirements

SetupQuantizationVRAM neededPerformance
RTX 4090 (24GB)Q4_K_M~14GBFast, room for KV cache
RTX 3090 (24GB)Q4_K_M~14GBGood, slightly slower
Mac M2/M3/M4 (32GB)Q4_K_M~14GB unifiedGood, uses Metal
Mac M2/M3/M4 (64GB)Q8_0~26GB unifiedBest quality
2x RTX 3090Q8_0~26GB splitBest quality, fast

For the best balance of quality and speed, use Q4_K_M quantization on a 24GB GPU. If you have 64GB unified memory on a Mac, go with Q8_0 for noticeably better output quality.

With vLLM (higher throughput)

pip install vllm
vllm serve mistralai/Devstral-Small-2 \
  --max-model-len 65536 \
  --tensor-parallel-size 1 \
  --quantization awq

vLLM gives you an OpenAI-compatible API endpoint with better batching and throughput than Ollama, useful if you’re running multiple agent sessions.

API setup (Mistral platform)

If you don’t want to run locally, use the Mistral API:

curl https://api.mistral.ai/v1/chat/completions \
  -H "Authorization: Bearer $MISTRAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "devstral-small-latest",
    "messages": [{"role": "user", "content": "Refactor this function to use async/await"}],
    "max_tokens": 4096
  }'

API pricing

ModelInputOutput
Devstral Small 2$0.10/1M tokens$0.30/1M tokens
Devstral 2 (full)$2.00/1M tokens$6.00/1M tokens
Codestral$0.30/1M tokens$0.90/1M tokens

Devstral Small 2 via API is extremely cheap β€” 20x less than the full Devstral 2.

Integration with coding tools

With Aider

# Using Ollama (local)
aider --model ollama/devstral-small:24b

# Using Mistral API
export MISTRAL_API_KEY=your-key
aider --model mistral/devstral-small-latest

With Continue.dev

{
  "models": [
    {
      "provider": "ollama",
      "model": "devstral-small:24b",
      "title": "Devstral Small 2"
    }
  ],
  "tabAutocompleteModel": {
    "provider": "ollama",
    "model": "codestral:22b",
    "title": "Codestral"
  }
}

With any OpenAI-compatible client

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
});

const response = await client.chat.completions.create({
  model: 'devstral-small:24b',
  messages: [{ role: 'user', content: 'Fix the bug in src/auth.ts' }],
  max_tokens: 8192,
});

Devstral Small 2 vs Codestral β€” which to use?

They’re complementary, not competing:

AspectDevstral Small 2Codestral
Primary taskAgentic coding (multi-step)Code completion (single-step)
FIM supportNoYes (native)
Multi-file editsExcellentNot designed for this
AutocompleteMediocreBest-in-class
Context window256K256K
Parameters24B22B
VRAM~14GB~12GB

Use Devstral Small 2 when you need an agent that can plan, edit multiple files, run commands, and iterate.

Use Codestral when you need fast, accurate tab completions in your IDE with native Fill-in-the-Middle support.

The ideal local setup: Run both. Devstral Small for your coding agent, Codestral for autocomplete. They fit on a single 24GB GPU if you swap between them, or run simultaneously on 48GB+.

Devstral Small vs other local coding models

ModelVRAM (Q4)ContextSWE-benchBest for
Devstral Small 24B14GB256K~58%Agentic coding
Codestral 22B12GB256KN/AFIM/tab completion
Qwen 3.5 27B16GB128K~45%General + coding
Gemma 4 27B16GB128K~40%General + coding

Devstral Small’s advantages: highest SWE-bench in its class, 256K context (double Qwen/Gemma), and purpose-built agentic training.

Tips for best results

  1. Use Q4_K_M quantization β€” best speed/quality tradeoff for 24GB GPUs
  2. Set temperature to 0.1-0.3 for coding tasks
  3. Provide full file context β€” the 256K window is there, use it
  4. Use structured prompts β€” Devstral responds best to clear task descriptions with file paths
  5. Pair with Codestral for autocomplete β€” they complement each other perfectly

FAQ

Is Devstral Small 2 free?

Yes, for local use. Devstral Small 2 is released under a Modified MIT license that allows commercial use. You can download and run it via Ollama, vLLM, or any compatible inference engine at no cost. If you prefer not to run it locally, the Mistral API charges $0.10/$0.30 per million input/output tokens β€” one of the cheapest coding model APIs available.

Can I run it locally?

Yes, and it’s specifically designed for local deployment. You need approximately 14GB of VRAM for Q4 quantization, which fits on an RTX 4090, RTX 3090, or any Mac with 32GB+ unified memory. Install Ollama, run ollama pull devstral-small:24b, and you’re ready. For higher quality output, use Q8 quantization with ~26GB VRAM.

How does it compare to Codestral?

They serve different purposes. Devstral Small 2 is an agentic coding model β€” it excels at multi-step tasks like refactoring across files, fixing bugs autonomously, and following complex instructions. Codestral is a code completion model with native Fill-in-the-Middle (FIM) support β€” it’s optimized for fast, accurate autocomplete in your IDE. The ideal setup is running both: Devstral Small for your coding agent and Codestral for tab completions.

What’s the context window?

256K tokens β€” one of the largest context windows available in any local model. This means Devstral Small 2 can hold roughly 500-700 files of typical source code in its context simultaneously. For comparison, most competing models in the 24B range offer 128K or less. The large context is critical for agentic coding where the model needs to understand relationships across an entire codebase to make correct edits.

Related: Devstral 2 Complete Guide Β· Best AI Models for Coding Locally Β· Ollama Complete Guide