📝 Tutorials
· 7 min read

Cohere North Mini Code Complete Guide: 30B MoE for Local Coding (2026)


Cohere just dropped a bomb on the open-source coding model landscape. North Mini Code 1.0, released on June 9, 2026, is a 30B parameter Mixture-of-Experts model that only activates 3B parameters per token — and it’s beating models with 4x its active compute. Let’s break down everything you need to know.

What Is Cohere North Mini Code?

North Mini Code is Cohere’s first fully open-source coding model, released under the Apache 2.0 license. That means no strings attached — use it commercially, modify it, deploy it wherever you want. No usage restrictions, no phone-home requirements.

The headline numbers are impressive: 30B total parameters, but only 3B active per forward pass thanks to its Mixture-of-Experts architecture. This gives you the knowledge capacity of a much larger model with the inference cost of a small one.

It supports a massive 256K token context window and can generate up to 64K tokens in a single response. That’s enough to process entire codebases and generate complete implementations in one shot.

Architecture Deep Dive

The MoE architecture is what makes North Mini Code special. Here’s how it works:

  • Total parameters: 30 billion
  • Active parameters per token: 3 billion
  • Expert count: 128 experts
  • Active experts per token: 8
  • Context window: 256K tokens
  • Max generation length: 64K tokens

For each token that flows through the model, a router network selects 8 of the 128 available experts. This means different parts of the network specialize in different types of code — some experts might activate for Python, others for TypeScript, others for low-level systems code. The result is a model that has the breadth of knowledge you’d expect from a 30B model but runs at the speed of a 3B model.

If you’re familiar with how Qwen 3.6 35B-A3B works, the concept is similar — both use MoE to pack more capability into fewer active parameters. But North Mini Code pushes the expert count significantly higher (128 vs Qwen’s architecture).

Training Methodology

Cohere used a two-stage training approach:

  1. Supervised Fine-Tuning (SFT): Traditional instruction tuning on high-quality coding data.
  2. Reinforcement Learning with Verifiable Rewards (RLVR): The model was trained across 70,000 verifiable tasks drawn from approximately 5,000 real repositories.

That RLVR stage is crucial. Instead of relying on human preferences (which are noisy and expensive), Cohere used tasks where correctness can be objectively verified — tests pass, code compiles, outputs match. This is the same training philosophy that made DeepSeek’s models so strong at coding.

Benchmark Performance

Let’s talk numbers. Here’s how North Mini Code stacks up:

BenchmarkNorth Mini CodeQwen 3.6 35B-A3BGLM-4.7-Flash
Artificial Analysis Coding Index33.435.225.9
SWE-bench Verified (pass@10)80.2%

The Artificial Analysis Coding Index of 33.4 puts it just below Qwen 3.6 35B-A3B (35.2) but dramatically above GLM-4.7-Flash (25.9). On SWE-bench Verified, it achieves an 80.2% pass@10 rate, which is remarkable for a model in this size class.

On Terminal-Bench, North Mini Code outperforms both Devstral Small 2 and Gemma 4. But the really stunning comparison is against much larger models:

  • Nemotron 3 Super (120B total, 12B active): North Mini Code wins
  • Mistral Small 4 (119B total, 6B active): North Mini Code wins
  • Devstral 2 (123B total): North Mini Code wins

A model with 3B active parameters outperforming models with 6-12B active parameters. That’s the power of a well-trained MoE architecture combined with RLVR.

Speed and Inference Performance

Speed is where North Mini Code really shines for practical use:

  • ~199 tokens/second on the Cohere API
  • 2.8x faster output than Devstral Small 2

When you’re using a coding assistant, latency matters. Waiting 30 seconds for a response breaks your flow. At nearly 200 tokens per second, North Mini Code generates a typical function implementation in under a second. That’s fast enough for real-time coding assistance.

For comparison, check our benchmark of inference engines to see how different serving solutions affect throughput.

Hardware Requirements

Here’s what you need to run North Mini Code locally:

Full Precision (BF16):

  • ~60GB VRAM (30B parameters × 2 bytes)
  • 1x H100 80GB or 2x A100 40GB

FP8 Quantization:

  • ~30GB VRAM
  • 1x H100 80GB (comfortable) or 1x A100 80GB

Lower Quantizations (when available):

  • INT4/INT8 would bring this into 16-24GB territory
  • Note: GGUF support is TBD due to custom MoE architecture

The good news is that because only 3B parameters are active per token, the actual compute is modest. The challenge is memory — you still need to load all 30B parameters (and the 128 expert weights) into VRAM, even though only a fraction activates at inference time.

For a deeper understanding of memory requirements, check our guide on how much VRAM AI models need.

Where to Get North Mini Code

The model is available through multiple channels:

  1. HuggingFace: Both BF16 and FP8 variants available for download
  2. Cohere API: Managed inference with ~199 tok/s throughput
  3. OpenRouter: Multi-provider access with pay-per-token pricing

For local deployment, grab the weights from HuggingFace and serve with vLLM or SGLang. For details on local setup, see our guide to running North Mini Code locally.

Comparison to the Competition

The sub-5B active parameter coding model space is getting crowded. Here’s how North Mini Code compares:

vs Qwen 3.6 35B-A3B: Qwen 3.6 scores slightly higher on the Artificial Analysis Coding Index (35.2 vs 33.4), but North Mini Code wins on SWE-bench and Terminal-Bench. Both are Apache 2.0 licensed. Qwen has better ecosystem support (GGUF available). Read our full comparison.

vs Devstral Small 2: North Mini Code is 2.8x faster and outperforms on Terminal-Bench. Devstral has better Ollama integration. See our Devstral Small 2 guide for details.

vs DeepSeek V4 Flash (API): Different category — DeepSeek V4 Flash is an API-only model that’s extremely cheap to use. If you want self-hosted and free, North Mini Code wins. If you want the cheapest per-token cost and don’t care about privacy, DeepSeek V4 Flash is worth considering.

For a broader overview, see our best open-source coding models for 2026.

When to Use North Mini Code

Great for:

  • Agentic coding tasks (SWE-bench style multi-file edits)
  • Long-context code understanding (256K context)
  • Self-hosted coding assistants where privacy matters
  • Teams that want Apache 2.0 with no restrictions
  • Situations where you need fast inference with good quality

Less ideal for:

  • Consumer hardware without high-end GPUs (still needs ~30GB+ VRAM)
  • Ollama/GGUF workflows (not yet supported)
  • General chat and non-coding tasks (it’s a specialized coding model)

Practical Tips for Getting the Best Results

  1. Use long context wisely: Feed it entire files and project context. The 256K window is there for a reason.
  2. Leverage the 64K generation: Don’t break large generation tasks into chunks. Let it generate complete implementations.
  3. Pair with agentic workflows: The SWE-bench scores suggest it excels when given tools and multi-step tasks.
  4. FP8 is your friend: The official FP8 weights maintain quality while halving memory requirements.

FAQ

Is North Mini Code really free to use commercially?

Yes. It’s released under Apache 2.0, which is one of the most permissive open-source licenses available. You can use it commercially, modify it, redistribute it, and deploy it in products without any restrictions or royalties.

How does North Mini Code compare to GPT-4 or Claude for coding?

North Mini Code is in a different weight class — it’s optimized for efficiency and local deployment. For pure coding quality, frontier models like GPT-4 and Claude still lead on most benchmarks. But North Mini Code offers something they can’t: full local deployment with no API costs and complete data privacy. For many coding tasks, it’s more than good enough.

Can I run North Mini Code on a gaming GPU like the RTX 4090?

Not at full precision — a 4090 has 24GB VRAM and the model needs ~30GB minimum at FP8. You might be able to run aggressive INT4 quantizations when they become available, but quality may suffer. For consumer hardware, Qwen 3.6 35B-A3B with GGUF quantization is currently a better fit.

Why can’t I use North Mini Code with Ollama?

North Mini Code uses a custom MoE architecture with 128 experts that doesn’t yet have GGUF format support. The llama.cpp and Ollama teams need to implement support for this specific architecture. This is expected to come, but there’s no timeline yet. For now, use vLLM or SGLang for local serving.

What’s the difference between 30B total and 3B active parameters?

The model stores 30B parameters across 128 expert networks. For each token, a router selects only 8 experts (3B parameters) to process that token. This means the model has the knowledge capacity of 30B parameters but the inference speed of a 3B model. You still need enough VRAM to store all 30B parameters, but computation per token is minimal.

Is North Mini Code better than Devstral Small 2?

On benchmarks, yes — North Mini Code outperforms Devstral Small 2 on Terminal-Bench and is 2.8x faster. However, Devstral has better tooling support (GGUF, Ollama) and a more mature ecosystem. If you need ease of setup, Devstral wins. If you need raw performance, North Mini Code wins.