Jun 10, 2026 · 7 min read

How to Run Cohere North Mini Code Locally (2026 Guide)

Want to run Cohere’s new North Mini Code model on your own hardware? This guide covers everything from downloading the weights to serving the model with vLLM and SGLang. Fair warning: this isn’t an Ollama one-liner (yet). But if you’ve got the GPU, the results are worth it.

Prerequisites

Before we start, let’s make sure you have what you need:

GPU: Minimum 1x H100 80GB (FP8) or 2x A100 40GB (BF16)
System RAM: 64GB+ recommended
Storage: ~60GB for BF16 weights, ~30GB for FP8
Python: 3.10+
CUDA: 12.1+

If you’re not sure whether your hardware is sufficient, check our guide on how much VRAM AI models need.

Step 1: Download the Model from HuggingFace

North Mini Code is available in two formats on HuggingFace:

BF16 (full precision):

pip install huggingface_hub
huggingface-cli download CohereForAI/North-Mini-Code-1.0 --local-dir ./north-mini-code-bf16

FP8 (recommended for single GPU):

huggingface-cli download CohereForAI/North-Mini-Code-1.0-FP8 --local-dir ./north-mini-code-fp8

The FP8 variant is recommended for most users. It halves the memory requirement with negligible quality loss and is the format Cohere optimized for deployment.

Pro tip: If your connection is slow, use --resume-download flag to pick up where you left off if the download interrupts.

huggingface-cli download CohereForAI/North-Mini-Code-1.0-FP8 \
  --local-dir ./north-mini-code-fp8 \
  --resume-download

Step 2: Serving with vLLM

vLLM is currently the best option for serving North Mini Code locally. It has native support for MoE architectures and handles the 128-expert routing efficiently.

Install vLLM:

pip install vllm>=0.8.0

Launch the server (FP8):

vllm serve ./north-mini-code-fp8 \
  --tensor-parallel-size 1 \
  --max-model-len 65536 \
  --trust-remote-code \
  --dtype auto \
  --port 8000

Launch the server (BF16, multi-GPU):

vllm serve ./north-mini-code-bf16 \
  --tensor-parallel-size 2 \
  --max-model-len 65536 \
  --trust-remote-code \
  --dtype bfloat16 \
  --port 8000

Key flags explained:

--tensor-parallel-size: Number of GPUs to split the model across
--max-model-len: Maximum sequence length. The model supports 256K, but setting it lower saves memory. 65536 is a good balance.
--trust-remote-code: Required for custom MoE architecture code

Once running, vLLM exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use it with any OpenAI SDK or tool that supports custom endpoints.

For a detailed comparison of inference engines, see our vLLM vs Ollama vs llama.cpp vs TGI guide.

Step 3: Serving with SGLang

SGLang is another excellent option, particularly if you need advanced features like constrained decoding or RadixAttention for prompt caching.

Install SGLang:

pip install sglang[all]>=0.4.0

Launch the server:

python -m sglang.launch_server \
  --model-path ./north-mini-code-fp8 \
  --tp 1 \
  --port 8000 \
  --trust-remote-code \
  --context-length 65536

SGLang also exposes an OpenAI-compatible endpoint. The RadixAttention feature is particularly useful for coding tasks where you’re repeatedly sending the same file context with different prompts — it caches the KV values and skips recomputation.

Step 4: Testing Your Deployment

Once your server is running, test it with a simple curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "north-mini-code",
    "messages": [
      {"role": "user", "content": "Write a Python function that implements binary search on a sorted list. Include type hints and docstring."}
    ],
    "max_tokens": 1024,
    "temperature": 0.1
  }'

Or with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="north-mini-code",
    messages=[
        {"role": "user", "content": "Implement a thread-safe LRU cache in Python"}
    ],
    max_tokens=2048,
    temperature=0.1
)

print(response.choices[0].message.content)

Memory Requirements Breakdown

Let’s get specific about memory:

Format	Model Size on Disk	VRAM Required	GPU Configuration
BF16	~60GB	~65GB	2x A100 40GB or 1x H100 80GB
FP8	~30GB	~35GB	1x H100 80GB or 1x A100 80GB
INT4 (TBD)	~15GB	~20GB	Potentially 1x RTX 4090 24GB

The VRAM numbers include overhead for KV cache and activations. If you increase max-model-len, you’ll need proportionally more VRAM for the KV cache.

Context length vs memory trade-off:

16K context: minimal overhead
65K context: ~4-8GB additional KV cache
256K context: ~16-32GB additional KV cache (likely needs multi-GPU even at FP8)

For most coding tasks, 65K context is more than enough. You rarely need to load 256K tokens of code into a single prompt.

Quantization Options

Currently available quantization formats:

FP8 (official):

Provided by Cohere on HuggingFace
Best quality-to-size ratio
Native support in vLLM and SGLang
Recommended for production use

GPTQ/AWQ (community):

Community quantizations may appear on HuggingFace
Check TheBloke or other quantization providers
Quality depends on calibration data used
See our GGUF vs GPTQ vs AWQ comparison

GGUF (not available yet):

North Mini Code uses a custom MoE architecture with 128 experts
llama.cpp doesn’t yet support this specific architecture
GGUF conversion is not possible until upstream support is added
This means no Ollama support for now

This is an important limitation. If your workflow depends on Ollama, you’ll need to wait for llama.cpp to add support for the 128-expert architecture, or use a different model like Qwen 3.6 35B-A3B which already has full GGUF support.

Using with Coding Tools

Once you have North Mini Code running with an OpenAI-compatible API, you can connect it to most coding tools:

Continue.dev (VS Code):

{
  "models": [{
    "title": "North Mini Code",
    "provider": "openai",
    "model": "north-mini-code",
    "apiBase": "http://localhost:8000/v1",
    "apiKey": "not-needed"
  }]
}

Aider:

aider --openai-api-base http://localhost:8000/v1 --openai-api-key not-needed --model north-mini-code

Cursor (custom model): Point the OpenAI-compatible endpoint in Cursor’s settings to http://localhost:8000/v1.

Performance Tuning Tips

Enable prefix caching: Both vLLM and SGLang support automatic prefix caching. This dramatically speeds up repeated prompts with shared context (like sending the same file repeatedly).
Tune batch size: If you’re the only user, set --max-num-seqs 1 in vLLM to allocate all memory to a single sequence with maximum context.
Use speculative decoding: vLLM supports speculative decoding which can further improve throughput for coding tasks.
Pin memory: Use --enable-prefix-caching and consider --enable-chunked-prefill for long contexts.
Monitor GPU utilization: Use nvidia-smi -l 1 to watch GPU memory and utilization. You want consistent high utilization during generation.

Cohere API Alternative

If local deployment isn’t feasible, the Cohere API offers North Mini Code with ~199 tokens/second throughput. That’s blazing fast and saves you the GPU infrastructure:

import cohere

co = cohere.ClientV2(api_key="your-key-here")

response = co.chat(
    model="north-mini-code-1.0",
    messages=[
        {"role": "user", "content": "Implement a Redis-backed rate limiter in Go"}
    ]
)
print(response.message.content[0].text)

The trade-off is obvious: API costs money per token and sends your code to Cohere’s servers. For sensitive codebases, self-hosting is the way to go. For personal projects or non-sensitive work, the API is faster to get started with.

GPU Comparison for Running North Mini Code

Not all GPUs are equal. Here’s a practical comparison for this specific model:

GPU	VRAM	Can Run FP8?	Can Run BF16?	Notes
H100 80GB	80GB	✅ Comfortable	✅ Tight	Best single-GPU option
A100 80GB	80GB	✅ Comfortable	✅ Tight	Good alternative
A100 40GB	40GB	✅ Tight	❌ Need 2x	Budget multi-GPU
RTX 4090	24GB	❌	❌	Wait for INT4
RTX 5090	32GB	❌ (barely)	❌	Might work with INT4+offload

For a broader discussion of GPU options, see our GPU vs CPU for AI inference guide.

What About Cloud GPUs?

If you don’t own the hardware, cloud GPU providers offer H100s on demand:

RunPod: H100 from ~$3.50/hr
Lambda Labs: H100 from ~$3.00/hr
AWS (p5 instances): H100 available, higher cost but more features
Vast.ai: Community GPUs, cheapest option but less reliable

For occasional use, cloud GPUs are much cheaper than buying hardware. For regular daily use, the math starts favoring ownership.

FAQ

Why can’t I use Ollama with North Mini Code?

North Mini Code uses a custom 128-expert MoE architecture that isn’t yet supported by llama.cpp (which Ollama is built on). Support needs to be added upstream. Until then, use vLLM or SGLang. For Ollama-compatible alternatives in the same class, try Qwen 3.6 35B-A3B.

What’s the minimum hardware I need?

The absolute minimum is a single GPU with 35GB+ VRAM (for FP8). Practically, that means an H100 80GB or A100 80GB. Consumer GPUs like the RTX 4090 (24GB) cannot run this model at any currently available precision.

Is FP8 quality significantly worse than BF16?

No. Cohere specifically optimized the FP8 variant, and benchmarks show negligible quality difference. FP8 is the recommended format for deployment. You’re halving your memory requirement with essentially no quality loss.

How does the speed compare to running via the Cohere API?

The Cohere API achieves ~199 tok/s, which is very fast. Self-hosted performance depends heavily on your hardware and serving configuration. On a single H100 with vLLM, expect 80-150 tok/s for single requests. The API will generally be faster due to Cohere’s optimized infrastructure, but self-hosting gives you privacy and no per-token costs.

Can I fine-tune North Mini Code?

Yes — it’s Apache 2.0 licensed, so there are no restrictions on fine-tuning. However, fine-tuning a 30B MoE model requires significant compute. You’ll need multiple H100s and a framework that supports MoE fine-tuning (like Megatron-LM or specialized forks of DeepSpeed). For most use cases, prompt engineering with the base model is sufficient.

How do I choose between vLLM and SGLang?

Both work well. vLLM is more mature and has broader community support. SGLang offers RadixAttention (great for repeated context in coding workflows) and constrained generation. If you’re unsure, start with vLLM — it’s simpler to set up and has more documentation available.