Want to run Cohere’s new North Mini Code model on your own hardware? This guide covers everything from downloading the weights to serving the model with vLLM and SGLang. Fair warning: this isn’t an Ollama one-liner (yet). But if you’ve got the GPU, the results are worth it.
Prerequisites
Before we start, let’s make sure you have what you need:
- GPU: Minimum 1x H100 80GB (FP8) or 2x A100 40GB (BF16)
- System RAM: 64GB+ recommended
- Storage: ~60GB for BF16 weights, ~30GB for FP8
- Python: 3.10+
- CUDA: 12.1+
If you’re not sure whether your hardware is sufficient, check our guide on how much VRAM AI models need.
Step 1: Download the Model from HuggingFace
North Mini Code is available in two formats on HuggingFace:
BF16 (full precision):
pip install huggingface_hub
huggingface-cli download CohereForAI/North-Mini-Code-1.0 --local-dir ./north-mini-code-bf16
FP8 (recommended for single GPU):
huggingface-cli download CohereForAI/North-Mini-Code-1.0-FP8 --local-dir ./north-mini-code-fp8
The FP8 variant is recommended for most users. It halves the memory requirement with negligible quality loss and is the format Cohere optimized for deployment.
Pro tip: If your connection is slow, use --resume-download flag to pick up where you left off if the download interrupts.
huggingface-cli download CohereForAI/North-Mini-Code-1.0-FP8 \
--local-dir ./north-mini-code-fp8 \
--resume-download
Step 2: Serving with vLLM
vLLM is currently the best option for serving North Mini Code locally. It has native support for MoE architectures and handles the 128-expert routing efficiently.
Install vLLM:
pip install vllm>=0.8.0
Launch the server (FP8):
vllm serve ./north-mini-code-fp8 \
--tensor-parallel-size 1 \
--max-model-len 65536 \
--trust-remote-code \
--dtype auto \
--port 8000
Launch the server (BF16, multi-GPU):
vllm serve ./north-mini-code-bf16 \
--tensor-parallel-size 2 \
--max-model-len 65536 \
--trust-remote-code \
--dtype bfloat16 \
--port 8000
Key flags explained:
--tensor-parallel-size: Number of GPUs to split the model across--max-model-len: Maximum sequence length. The model supports 256K, but setting it lower saves memory. 65536 is a good balance.--trust-remote-code: Required for custom MoE architecture code
Once running, vLLM exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use it with any OpenAI SDK or tool that supports custom endpoints.
For a detailed comparison of inference engines, see our vLLM vs Ollama vs llama.cpp vs TGI guide.
Step 3: Serving with SGLang
SGLang is another excellent option, particularly if you need advanced features like constrained decoding or RadixAttention for prompt caching.
Install SGLang:
pip install sglang[all]>=0.4.0
Launch the server:
python -m sglang.launch_server \
--model-path ./north-mini-code-fp8 \
--tp 1 \
--port 8000 \
--trust-remote-code \
--context-length 65536
SGLang also exposes an OpenAI-compatible endpoint. The RadixAttention feature is particularly useful for coding tasks where you’re repeatedly sending the same file context with different prompts — it caches the KV values and skips recomputation.
Step 4: Testing Your Deployment
Once your server is running, test it with a simple curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "north-mini-code",
"messages": [
{"role": "user", "content": "Write a Python function that implements binary search on a sorted list. Include type hints and docstring."}
],
"max_tokens": 1024,
"temperature": 0.1
}'
Or with the OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="north-mini-code",
messages=[
{"role": "user", "content": "Implement a thread-safe LRU cache in Python"}
],
max_tokens=2048,
temperature=0.1
)
print(response.choices[0].message.content)
Memory Requirements Breakdown
Let’s get specific about memory:
| Format | Model Size on Disk | VRAM Required | GPU Configuration |
|---|---|---|---|
| BF16 | ~60GB | ~65GB | 2x A100 40GB or 1x H100 80GB |
| FP8 | ~30GB | ~35GB | 1x H100 80GB or 1x A100 80GB |
| INT4 (TBD) | ~15GB | ~20GB | Potentially 1x RTX 4090 24GB |
The VRAM numbers include overhead for KV cache and activations. If you increase max-model-len, you’ll need proportionally more VRAM for the KV cache.
Context length vs memory trade-off:
- 16K context: minimal overhead
- 65K context: ~4-8GB additional KV cache
- 256K context: ~16-32GB additional KV cache (likely needs multi-GPU even at FP8)
For most coding tasks, 65K context is more than enough. You rarely need to load 256K tokens of code into a single prompt.
Quantization Options
Currently available quantization formats:
FP8 (official):
- Provided by Cohere on HuggingFace
- Best quality-to-size ratio
- Native support in vLLM and SGLang
- Recommended for production use
GPTQ/AWQ (community):
- Community quantizations may appear on HuggingFace
- Check TheBloke or other quantization providers
- Quality depends on calibration data used
- See our GGUF vs GPTQ vs AWQ comparison
GGUF (not available yet):
- North Mini Code uses a custom MoE architecture with 128 experts
- llama.cpp doesn’t yet support this specific architecture
- GGUF conversion is not possible until upstream support is added
- This means no Ollama support for now
This is an important limitation. If your workflow depends on Ollama, you’ll need to wait for llama.cpp to add support for the 128-expert architecture, or use a different model like Qwen 3.6 35B-A3B which already has full GGUF support.
Using with Coding Tools
Once you have North Mini Code running with an OpenAI-compatible API, you can connect it to most coding tools:
Continue.dev (VS Code):
{
"models": [{
"title": "North Mini Code",
"provider": "openai",
"model": "north-mini-code",
"apiBase": "http://localhost:8000/v1",
"apiKey": "not-needed"
}]
}
Aider:
aider --openai-api-base http://localhost:8000/v1 --openai-api-key not-needed --model north-mini-code
Cursor (custom model):
Point the OpenAI-compatible endpoint in Cursor’s settings to http://localhost:8000/v1.
Performance Tuning Tips
-
Enable prefix caching: Both vLLM and SGLang support automatic prefix caching. This dramatically speeds up repeated prompts with shared context (like sending the same file repeatedly).
-
Tune batch size: If you’re the only user, set
--max-num-seqs 1in vLLM to allocate all memory to a single sequence with maximum context. -
Use speculative decoding: vLLM supports speculative decoding which can further improve throughput for coding tasks.
-
Pin memory: Use
--enable-prefix-cachingand consider--enable-chunked-prefillfor long contexts. -
Monitor GPU utilization: Use
nvidia-smi -l 1to watch GPU memory and utilization. You want consistent high utilization during generation.
Cohere API Alternative
If local deployment isn’t feasible, the Cohere API offers North Mini Code with ~199 tokens/second throughput. That’s blazing fast and saves you the GPU infrastructure:
import cohere
co = cohere.ClientV2(api_key="your-key-here")
response = co.chat(
model="north-mini-code-1.0",
messages=[
{"role": "user", "content": "Implement a Redis-backed rate limiter in Go"}
]
)
print(response.message.content[0].text)
The trade-off is obvious: API costs money per token and sends your code to Cohere’s servers. For sensitive codebases, self-hosting is the way to go. For personal projects or non-sensitive work, the API is faster to get started with.
GPU Comparison for Running North Mini Code
Not all GPUs are equal. Here’s a practical comparison for this specific model:
| GPU | VRAM | Can Run FP8? | Can Run BF16? | Notes |
|---|---|---|---|---|
| H100 80GB | 80GB | ✅ Comfortable | ✅ Tight | Best single-GPU option |
| A100 80GB | 80GB | ✅ Comfortable | ✅ Tight | Good alternative |
| A100 40GB | 40GB | ✅ Tight | ❌ Need 2x | Budget multi-GPU |
| RTX 4090 | 24GB | ❌ | ❌ | Wait for INT4 |
| RTX 5090 | 32GB | ❌ (barely) | ❌ | Might work with INT4+offload |
For a broader discussion of GPU options, see our GPU vs CPU for AI inference guide.
What About Cloud GPUs?
If you don’t own the hardware, cloud GPU providers offer H100s on demand:
- RunPod: H100 from ~$3.50/hr
- Lambda Labs: H100 from ~$3.00/hr
- AWS (p5 instances): H100 available, higher cost but more features
- Vast.ai: Community GPUs, cheapest option but less reliable
For occasional use, cloud GPUs are much cheaper than buying hardware. For regular daily use, the math starts favoring ownership.
FAQ
Why can’t I use Ollama with North Mini Code?
North Mini Code uses a custom 128-expert MoE architecture that isn’t yet supported by llama.cpp (which Ollama is built on). Support needs to be added upstream. Until then, use vLLM or SGLang. For Ollama-compatible alternatives in the same class, try Qwen 3.6 35B-A3B.
What’s the minimum hardware I need?
The absolute minimum is a single GPU with 35GB+ VRAM (for FP8). Practically, that means an H100 80GB or A100 80GB. Consumer GPUs like the RTX 4090 (24GB) cannot run this model at any currently available precision.
Is FP8 quality significantly worse than BF16?
No. Cohere specifically optimized the FP8 variant, and benchmarks show negligible quality difference. FP8 is the recommended format for deployment. You’re halving your memory requirement with essentially no quality loss.
How does the speed compare to running via the Cohere API?
The Cohere API achieves ~199 tok/s, which is very fast. Self-hosted performance depends heavily on your hardware and serving configuration. On a single H100 with vLLM, expect 80-150 tok/s for single requests. The API will generally be faster due to Cohere’s optimized infrastructure, but self-hosting gives you privacy and no per-token costs.
Can I fine-tune North Mini Code?
Yes — it’s Apache 2.0 licensed, so there are no restrictions on fine-tuning. However, fine-tuning a 30B MoE model requires significant compute. You’ll need multiple H100s and a framework that supports MoE fine-tuning (like Megatron-LM or specialized forks of DeepSpeed). For most use cases, prompt engineering with the base model is sufficient.
How do I choose between vLLM and SGLang?
Both work well. vLLM is more mature and has broader community support. SGLang offers RadixAttention (great for repeated context in coding workflows) and constrained generation. If you’re unsure, start with vLLM — it’s simpler to set up and has more documentation available.