Jun 10, 2026 · 8 min read

North Mini Code vs Qwen 3.6 35B-A3B vs Devstral Small 2: MoE Coding Showdown

Three Mixture-of-Experts coding models. Similar total parameters. Similar active parameters. All open source. But which one should you actually use? Let’s pit Cohere North Mini Code, Qwen 3.6 35B-A3B, and Devstral Small 2 against each other in a no-nonsense comparison.

The Contenders at a Glance

Feature	North Mini Code	Qwen 3.6 35B-A3B	Devstral Small 2
Total Params	30B	35B	~30B
Active Params	3B	3B	~5B
Experts (total/active)	128/8	64/8	64/8
Context Window	256K	128K	128K
Max Generation	64K	32K	32K
License	Apache 2.0	Apache 2.0	Apache 2.0
Release Date	June 9, 2026	May 2026	May 2026

All three are Apache 2.0 licensed, so there’s no licensing advantage for any of them. The real differences are in architecture, performance, speed, and ecosystem support.

Benchmark Comparison

Here’s where things get interesting:

Benchmark	North Mini Code	Qwen 3.6 35B-A3B	Devstral Small 2
Artificial Analysis Coding Index	33.4	35.2	~28
SWE-bench Verified (pass@10)	80.2%	~72%	~68%
Terminal-Bench	Winner	2nd	3rd

Analysis:

Qwen 3.6 takes the crown on the Artificial Analysis Coding Index (35.2 vs 33.4) — that’s a meaningful lead for general coding tasks. But North Mini Code dominates on SWE-bench Verified (80.2% pass@10), which tests real-world multi-file bug fixing and feature implementation. Terminal-Bench also favors North Mini Code.

The pattern here is clear: Qwen 3.6 is slightly better at isolated coding tasks (write this function, complete this code), while North Mini Code excels at complex, multi-step engineering tasks (fix this bug across multiple files, implement this feature). Devstral Small 2 trails both on raw benchmarks.

For a deeper dive into Qwen’s model, see our Qwen 3.6 35B-A3B complete guide. For Devstral, check the Devstral Small 2 guide.

Speed Comparison

Speed matters enormously for coding assistants. Here’s how they stack up:

Model	Reported Throughput	Relative Speed
North Mini Code	~199 tok/s (Cohere API)	2.8x faster than Devstral
Qwen 3.6 35B-A3B	~150 tok/s (self-hosted)	Baseline
Devstral Small 2	~71 tok/s (self-hosted)	Slowest

North Mini Code at 199 tok/s is blazing fast. That 2.8x speed advantage over Devstral Small 2 is not subtle — it’s the difference between a responsive coding assistant and one that feels sluggish.

Qwen 3.6 35B-A3B falls in the middle. Its throughput depends heavily on your inference engine. With vLLM on an H100, expect solid performance in the 100-150 tok/s range.

The speed differences come down to architecture. North Mini Code’s 3B active parameters with 128 experts means less compute per token than Devstral’s ~5B active parameters. Fewer active parameters = less math per token = faster generation.

Architecture Differences

Let’s geek out on the architecture for a moment:

North Mini Code: Many experts, few active

128 total experts, 8 active per token
3B active parameters
More specialization possible (each expert can focus on a narrow domain)
Higher memory cost (must store all 128 experts)
Lower per-token compute

Qwen 3.6 35B-A3B: Balanced approach

64 total experts, 8 active per token
3B active parameters
Good balance of specialization and memory efficiency
Well-optimized with broad tooling support

Devstral Small 2: Fewer, larger experts

64 total experts, 8 active per token
~5B active parameters (larger experts)
More compute per token
Potentially deeper reasoning per expert activation
Based on Mistral architecture with strong ecosystem

The 128-expert design of North Mini Code is its most distinctive architectural choice. More experts means finer-grained specialization — the model can develop experts for specific programming languages, paradigms, or task types. The trade-off is higher total parameter count for the same active compute.

Ecosystem and Tooling Support

This is where the real-world rubber meets the road:

Feature	North Mini Code	Qwen 3.6 35B-A3B	Devstral Small 2
GGUF Support	❌ Not yet	✅ Available	✅ Available
Ollama Support	❌ Not yet	✅ Available	✅ Available
vLLM Support	✅	✅	✅
SGLang Support	✅	✅	✅
llama.cpp	❌ Not yet	✅	✅
HuggingFace Weights	✅ (BF16/FP8)	✅ (all formats)	✅ (all formats)

This is North Mini Code’s biggest weakness right now. The custom 128-expert architecture isn’t supported by llama.cpp yet, which means no GGUF conversion and no Ollama support. If your workflow depends on Ollama, Qwen 3.6 or Devstral are your only options today.

For server-side deployment with vLLM or SGLang, all three work fine. The difference only matters if you want the consumer-friendly Ollama/llama.cpp stack.

Memory Requirements Compared

How much VRAM do you actually need?

Model	BF16	FP8	INT4 (GGUF)
North Mini Code	~60GB	~30GB	N/A (no GGUF)
Qwen 3.6 35B-A3B	~70GB	~35GB	~20GB
Devstral Small 2	~60GB	~30GB	~18GB

At FP8, all three are in similar territory (~30-35GB). But Qwen 3.6 and Devstral can drop to INT4 GGUF, bringing them into RTX 4090 territory (24GB). North Mini Code can’t do this yet.

If you’re on consumer hardware, this comparison has a clear winner: Qwen 3.6 35B-A3B at Q4_K_M quantization runs on a 24GB GPU with acceptable quality. North Mini Code requires datacenter GPUs.

For more on quantization trade-offs, see our GGUF vs GPTQ vs AWQ comparison.

Training Approach Differences

The training methodologies differ significantly:

North Mini Code: Two-stage approach with SFT followed by RLVR (Reinforcement Learning with Verifiable Rewards) across 70K tasks from 5K repos. The emphasis on verifiable correctness is what drives the strong SWE-bench performance.

Qwen 3.6 35B-A3B: Alibaba’s training pipeline includes massive pre-training on code data, followed by instruction tuning. Qwen models benefit from enormous training compute budgets and diverse data.

Devstral Small 2: Mistral’s approach, which includes code-specific pre-training and alignment. Benefits from Mistral’s established training infrastructure and expertise in MoE models.

North Mini Code’s RLVR training is arguably the most innovative approach here — it ensures the model actually produces working code rather than plausible-looking code. This explains the SWE-bench advantage.

Which Should You Choose?

Here’s the decision framework:

Choose North Mini Code if:

You have datacenter GPUs (H100/A100)
You need the best SWE-bench/agentic coding performance
Speed is critical (2.8x faster than Devstral)
You’re using vLLM or SGLang for serving
You want the best performance for multi-file edits and bug fixing

Choose Qwen 3.6 35B-A3B if:

You want to run on consumer hardware (RTX 4090 with GGUF)
Ollama integration is important to your workflow
You want the best general coding benchmark scores
You need the broadest ecosystem support
You’re building tools that depend on llama.cpp

Choose Devstral Small 2 if:

You’re already in the Mistral ecosystem
You want established, mature tooling
Ollama one-click setup is a priority
You don’t need bleeding-edge performance

For a broader view of what’s available, check our best open-source coding models for 2026 and best AI models for coding locally.

Real-World Performance: My Testing Notes

I ran all three through a set of practical coding tasks (not benchmarks — real work):

Implementing a WebSocket server with auth: All three produced working code. North Mini Code’s solution was slightly more complete (included reconnection logic without being asked).
Debugging a race condition in Go: North Mini Code and Qwen both identified the issue. Devstral struggled with the multi-file context.
Refactoring a 500-line React component: Qwen produced the cleanest output. North Mini Code was close behind. Both handled it as single-shot generation.
Writing comprehensive tests for an existing module: North Mini Code generated more edge cases. Likely due to the RLVR training emphasizing verifiable correctness.

The benchmarks largely match my subjective experience: North Mini Code is best for complex engineering tasks, Qwen is strongest for clean code generation, and Devstral is solid but falls behind the other two.

Future Outlook

The MoE coding model space is moving fast. A few things to watch:

North Mini Code GGUF support: When llama.cpp adds 128-expert support, this model becomes much more accessible
Qwen 4.0: Alibaba’s cadence suggests a follow-up model soon
Devstral 3: Mistral keeps iterating quickly on their coding line

For now, North Mini Code is the performance leader in this class for agentic coding tasks, Qwen owns the accessibility crown, and Devstral is the safe middle ground. Pick based on your hardware and workflow.

FAQ

Which model is best for coding agents like Aider or SWE-agent?

North Mini Code. Its 80.2% SWE-bench Verified score and Terminal-Bench lead make it the clear winner for agentic workflows that involve multi-step reasoning, file editing, and test verification. The RLVR training was specifically designed for this use case.

Can I run any of these on an RTX 4090?

Only Qwen 3.6 35B-A3B (via Q4_K_M GGUF quantization, ~20GB) and Devstral Small 2 (via similar quantization). North Mini Code doesn’t have GGUF support yet and requires 30GB+ VRAM at FP8.

Is the speed difference noticeable in practice?

Absolutely. At 199 tok/s, North Mini Code generates a 200-token function in about 1 second. Devstral Small 2 at ~71 tok/s takes nearly 3 seconds for the same output. When you’re iterating quickly on code, that difference compounds.

Do these models support function calling / tool use?

North Mini Code is optimized for coding tasks and supports structured output. Qwen 3.6 has built-in tool/function calling support. Devstral Small 2 supports tool use through Mistral’s standard format. For agentic coding, all three work with popular frameworks.

Which has the best context window for large codebases?

North Mini Code with 256K tokens — that’s 2x what Qwen and Devstral offer (128K each). If you need to load massive codebases into context, North Mini Code has a clear advantage. It can also generate up to 64K tokens, double the others.

Are benchmarks reliable for comparing these models?

Benchmarks give directional guidance but don’t tell the whole story. SWE-bench is probably the most representative of real coding work. The Artificial Analysis Coding Index covers breadth. I’d weight SWE-bench heavily if you’re doing agentic coding, and the general index if you’re doing code completion and generation.