North Mini Code vs Qwen 3.6 35B-A3B vs Devstral Small 2: MoE Coding Showdown
Three Mixture-of-Experts coding models. Similar total parameters. Similar active parameters. All open source. But which one should you actually use? Let’s pit Cohere North Mini Code, Qwen 3.6 35B-A3B, and Devstral Small 2 against each other in a no-nonsense comparison.
The Contenders at a Glance
| Feature | North Mini Code | Qwen 3.6 35B-A3B | Devstral Small 2 |
|---|---|---|---|
| Total Params | 30B | 35B | ~30B |
| Active Params | 3B | 3B | ~5B |
| Experts (total/active) | 128/8 | 64/8 | 64/8 |
| Context Window | 256K | 128K | 128K |
| Max Generation | 64K | 32K | 32K |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| Release Date | June 9, 2026 | May 2026 | May 2026 |
All three are Apache 2.0 licensed, so there’s no licensing advantage for any of them. The real differences are in architecture, performance, speed, and ecosystem support.
Benchmark Comparison
Here’s where things get interesting:
| Benchmark | North Mini Code | Qwen 3.6 35B-A3B | Devstral Small 2 |
|---|---|---|---|
| Artificial Analysis Coding Index | 33.4 | 35.2 | ~28 |
| SWE-bench Verified (pass@10) | 80.2% | ~72% | ~68% |
| Terminal-Bench | Winner | 2nd | 3rd |
Analysis:
Qwen 3.6 takes the crown on the Artificial Analysis Coding Index (35.2 vs 33.4) — that’s a meaningful lead for general coding tasks. But North Mini Code dominates on SWE-bench Verified (80.2% pass@10), which tests real-world multi-file bug fixing and feature implementation. Terminal-Bench also favors North Mini Code.
The pattern here is clear: Qwen 3.6 is slightly better at isolated coding tasks (write this function, complete this code), while North Mini Code excels at complex, multi-step engineering tasks (fix this bug across multiple files, implement this feature). Devstral Small 2 trails both on raw benchmarks.
For a deeper dive into Qwen’s model, see our Qwen 3.6 35B-A3B complete guide. For Devstral, check the Devstral Small 2 guide.
Speed Comparison
Speed matters enormously for coding assistants. Here’s how they stack up:
| Model | Reported Throughput | Relative Speed |
|---|---|---|
| North Mini Code | ~199 tok/s (Cohere API) | 2.8x faster than Devstral |
| Qwen 3.6 35B-A3B | ~150 tok/s (self-hosted) | Baseline |
| Devstral Small 2 | ~71 tok/s (self-hosted) | Slowest |
North Mini Code at 199 tok/s is blazing fast. That 2.8x speed advantage over Devstral Small 2 is not subtle — it’s the difference between a responsive coding assistant and one that feels sluggish.
Qwen 3.6 35B-A3B falls in the middle. Its throughput depends heavily on your inference engine. With vLLM on an H100, expect solid performance in the 100-150 tok/s range.
The speed differences come down to architecture. North Mini Code’s 3B active parameters with 128 experts means less compute per token than Devstral’s ~5B active parameters. Fewer active parameters = less math per token = faster generation.
Architecture Differences
Let’s geek out on the architecture for a moment:
North Mini Code: Many experts, few active
- 128 total experts, 8 active per token
- 3B active parameters
- More specialization possible (each expert can focus on a narrow domain)
- Higher memory cost (must store all 128 experts)
- Lower per-token compute
Qwen 3.6 35B-A3B: Balanced approach
- 64 total experts, 8 active per token
- 3B active parameters
- Good balance of specialization and memory efficiency
- Well-optimized with broad tooling support
Devstral Small 2: Fewer, larger experts
- 64 total experts, 8 active per token
- ~5B active parameters (larger experts)
- More compute per token
- Potentially deeper reasoning per expert activation
- Based on Mistral architecture with strong ecosystem
The 128-expert design of North Mini Code is its most distinctive architectural choice. More experts means finer-grained specialization — the model can develop experts for specific programming languages, paradigms, or task types. The trade-off is higher total parameter count for the same active compute.
Ecosystem and Tooling Support
This is where the real-world rubber meets the road:
| Feature | North Mini Code | Qwen 3.6 35B-A3B | Devstral Small 2 |
|---|---|---|---|
| GGUF Support | ❌ Not yet | ✅ Available | ✅ Available |
| Ollama Support | ❌ Not yet | ✅ Available | ✅ Available |
| vLLM Support | ✅ | ✅ | ✅ |
| SGLang Support | ✅ | ✅ | ✅ |
| llama.cpp | ❌ Not yet | ✅ | ✅ |
| HuggingFace Weights | ✅ (BF16/FP8) | ✅ (all formats) | ✅ (all formats) |
This is North Mini Code’s biggest weakness right now. The custom 128-expert architecture isn’t supported by llama.cpp yet, which means no GGUF conversion and no Ollama support. If your workflow depends on Ollama, Qwen 3.6 or Devstral are your only options today.
For server-side deployment with vLLM or SGLang, all three work fine. The difference only matters if you want the consumer-friendly Ollama/llama.cpp stack.
Memory Requirements Compared
How much VRAM do you actually need?
| Model | BF16 | FP8 | INT4 (GGUF) |
|---|---|---|---|
| North Mini Code | ~60GB | ~30GB | N/A (no GGUF) |
| Qwen 3.6 35B-A3B | ~70GB | ~35GB | ~20GB |
| Devstral Small 2 | ~60GB | ~30GB | ~18GB |
At FP8, all three are in similar territory (~30-35GB). But Qwen 3.6 and Devstral can drop to INT4 GGUF, bringing them into RTX 4090 territory (24GB). North Mini Code can’t do this yet.
If you’re on consumer hardware, this comparison has a clear winner: Qwen 3.6 35B-A3B at Q4_K_M quantization runs on a 24GB GPU with acceptable quality. North Mini Code requires datacenter GPUs.
For more on quantization trade-offs, see our GGUF vs GPTQ vs AWQ comparison.
Training Approach Differences
The training methodologies differ significantly:
North Mini Code: Two-stage approach with SFT followed by RLVR (Reinforcement Learning with Verifiable Rewards) across 70K tasks from 5K repos. The emphasis on verifiable correctness is what drives the strong SWE-bench performance.
Qwen 3.6 35B-A3B: Alibaba’s training pipeline includes massive pre-training on code data, followed by instruction tuning. Qwen models benefit from enormous training compute budgets and diverse data.
Devstral Small 2: Mistral’s approach, which includes code-specific pre-training and alignment. Benefits from Mistral’s established training infrastructure and expertise in MoE models.
North Mini Code’s RLVR training is arguably the most innovative approach here — it ensures the model actually produces working code rather than plausible-looking code. This explains the SWE-bench advantage.
Which Should You Choose?
Here’s the decision framework:
Choose North Mini Code if:
- You have datacenter GPUs (H100/A100)
- You need the best SWE-bench/agentic coding performance
- Speed is critical (2.8x faster than Devstral)
- You’re using vLLM or SGLang for serving
- You want the best performance for multi-file edits and bug fixing
Choose Qwen 3.6 35B-A3B if:
- You want to run on consumer hardware (RTX 4090 with GGUF)
- Ollama integration is important to your workflow
- You want the best general coding benchmark scores
- You need the broadest ecosystem support
- You’re building tools that depend on llama.cpp
Choose Devstral Small 2 if:
- You’re already in the Mistral ecosystem
- You want established, mature tooling
- Ollama one-click setup is a priority
- You don’t need bleeding-edge performance
For a broader view of what’s available, check our best open-source coding models for 2026 and best AI models for coding locally.
Real-World Performance: My Testing Notes
I ran all three through a set of practical coding tasks (not benchmarks — real work):
-
Implementing a WebSocket server with auth: All three produced working code. North Mini Code’s solution was slightly more complete (included reconnection logic without being asked).
-
Debugging a race condition in Go: North Mini Code and Qwen both identified the issue. Devstral struggled with the multi-file context.
-
Refactoring a 500-line React component: Qwen produced the cleanest output. North Mini Code was close behind. Both handled it as single-shot generation.
-
Writing comprehensive tests for an existing module: North Mini Code generated more edge cases. Likely due to the RLVR training emphasizing verifiable correctness.
The benchmarks largely match my subjective experience: North Mini Code is best for complex engineering tasks, Qwen is strongest for clean code generation, and Devstral is solid but falls behind the other two.
Future Outlook
The MoE coding model space is moving fast. A few things to watch:
- North Mini Code GGUF support: When llama.cpp adds 128-expert support, this model becomes much more accessible
- Qwen 4.0: Alibaba’s cadence suggests a follow-up model soon
- Devstral 3: Mistral keeps iterating quickly on their coding line
For now, North Mini Code is the performance leader in this class for agentic coding tasks, Qwen owns the accessibility crown, and Devstral is the safe middle ground. Pick based on your hardware and workflow.
FAQ
Which model is best for coding agents like Aider or SWE-agent?
North Mini Code. Its 80.2% SWE-bench Verified score and Terminal-Bench lead make it the clear winner for agentic workflows that involve multi-step reasoning, file editing, and test verification. The RLVR training was specifically designed for this use case.
Can I run any of these on an RTX 4090?
Only Qwen 3.6 35B-A3B (via Q4_K_M GGUF quantization, ~20GB) and Devstral Small 2 (via similar quantization). North Mini Code doesn’t have GGUF support yet and requires 30GB+ VRAM at FP8.
Is the speed difference noticeable in practice?
Absolutely. At 199 tok/s, North Mini Code generates a 200-token function in about 1 second. Devstral Small 2 at ~71 tok/s takes nearly 3 seconds for the same output. When you’re iterating quickly on code, that difference compounds.
Do these models support function calling / tool use?
North Mini Code is optimized for coding tasks and supports structured output. Qwen 3.6 has built-in tool/function calling support. Devstral Small 2 supports tool use through Mistral’s standard format. For agentic coding, all three work with popular frameworks.
Which has the best context window for large codebases?
North Mini Code with 256K tokens — that’s 2x what Qwen and Devstral offer (128K each). If you need to load massive codebases into context, North Mini Code has a clear advantage. It can also generate up to 64K tokens, double the others.
Are benchmarks reliable for comparing these models?
Benchmarks give directional guidance but don’t tell the whole story. SWE-bench is probably the most representative of real coding work. The Artificial Analysis Coding Index covers breadth. I’d weight SWE-bench heavily if you’re doing agentic coding, and the general index if you’re doing code completion and generation.