Apr 30, 2026 · 5 min read

Last updated on Apr 20, 2026

How to Run Mistral Large 2 Locally — Setup Guide (2026)

Mistral Large 2 at 123B parameters is the largest model you can realistically run on a single high-end GPU. Here’s how to set it up.

Hardware requirements

Running a 123B parameter model locally demands serious hardware. The amount of VRAM you need depends entirely on the precision and quantization format you choose.

Precision	Memory	Hardware	Tokens/sec
FP16	~250GB	4x A100 80GB	30-40
INT8	~125GB	2x A100 80GB	25-35
Q5_K_M (GGUF)	~85GB	2x RTX 4090 or Mac Ultra 192GB	8-15
Q4_K_M (GGUF)	~65GB	1x H100 or Mac Ultra 192GB	10-18
Q4 (GPTQ)	~65GB	1x H100 or Mac Ultra 192GB	15-25
Q3_K_M (GGUF)	~52GB	Mac Ultra 128GB	5-8
Q2_K (GGUF)	~42GB	2x RTX 3090	3-6

Minimum system RAM: 64GB (for model loading overhead). Recommended: 128GB+ if using CPU offloading.

If you don’t have multi-GPU hardware at home, cloud GPU providers offer H100 and multi-A100 instances that can run Mistral Large 2 at full speed for a few dollars per hour.

Quantization options explained

Choosing the right quantization format is critical for balancing quality and performance at this model size.

GGUF (llama.cpp / Ollama):

Best for: CPU+GPU hybrid inference, Mac systems
Q4_K_M offers the best quality-to-size ratio
Q5_K_M is nearly lossless but requires more memory
Q3_K_M is usable but shows noticeable quality degradation on complex reasoning

GPTQ:

Best for: Pure GPU inference with vLLM or text-generation-inference
4-bit is the standard choice
Requires calibration dataset during quantization
Slightly faster than GGUF on pure GPU setups

AWQ:

Best for: vLLM deployments with activation-aware quantization
Better quality than GPTQ at the same bit width for most tasks
Supported natively by vLLM with --quantization awq

For Mistral Large 2 specifically, AWQ 4-bit through vLLM gives the best speed-to-quality ratio on NVIDIA hardware. On Mac, GGUF Q4_K_M through Ollama is your only practical option.

Option 1: Ollama (easiest)

ollama pull mistral-large:123b-q4
ollama serve

Then use with Aider:

aider --model ollama/mistral-large:123b-q4

Or Continue.dev:

{"models": [{"provider": "ollama", "model": "mistral-large:123b-q4"}]}

To check that the model loaded correctly and see memory usage:

ollama ps

Option 2: vLLM (fastest)

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Large-Instruct-2411 \
  --tensor-parallel-size 2 \
  --quantization awq \
  --port 8000

For maximum throughput with batched requests, add:

  --max-num-batched-tokens 8192 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92

Option 3: llama.cpp (most flexible)

For fine-grained control over GPU layer offloading:

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
make LLAMA_CUDA=1 -j$(nproc)

./llama-server \
  -m Mistral-Large-2-123B-Q4_K_M.gguf \
  -ngl 60 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

Adjust -ngl (number of GPU layers) based on your available VRAM. With 24GB VRAM, you can offload roughly 20-25 layers; the rest runs on CPU.

Option 3: Mac Studio Ultra

The Mac Studio Ultra with 192GB unified memory can run Q4 Mistral Large 2:

ollama pull mistral-large:123b-q4
ollama run mistral-large:123b-q4

Expect ~5-8 tokens/second — slow but usable for code review and analysis. See our best AI models for Mac guide.

With Q5_K_M you’ll get slightly better quality at ~4-6 tokens/second, which is still acceptable for non-interactive tasks like code review.

Performance benchmarks

Real-world performance measured on common hardware configurations:

Setup	Quant	Context	Tokens/sec	Time to first token
2x A100 80GB	AWQ 4-bit	4096	22 t/s	1.2s
1x H100 80GB	AWQ 4-bit	4096	28 t/s	0.8s
Mac Ultra 192GB	Q4_K_M	4096	7 t/s	3.5s
2x RTX 4090	Q4_K_M	4096	12 t/s	2.1s
RTX 4090 + CPU offload	Q4_K_M	4096	4 t/s	6.0s

Context length significantly impacts performance. At 32K context, expect roughly 40-50% slower generation compared to 4K context.

Troubleshooting

Out of memory errors:

Reduce -ngl layers (llama.cpp) or lower --gpu-memory-utilization (vLLM)
Use a smaller quantization: Q3_K_M instead of Q4_K_M
Close other GPU-consuming applications
Check actual free VRAM with nvidia-smi

Slow generation speed:

Ensure you’re offloading enough layers to GPU — CPU-bound layers are 10x slower
Reduce context length if you don’t need it: -c 4096 instead of default
On Mac, ensure you’re not running other memory-intensive apps

Model fails to load:

Verify file integrity: md5sum against the published hash
Ensure sufficient system RAM (not just VRAM) — the model needs RAM during initial loading
For vLLM tensor parallelism, ensure NCCL is properly installed

Garbled or low-quality output:

Q2_K quantization loses significant quality at 123B — upgrade to Q4_K_M minimum
Check that your prompt template matches Mistral’s expected format
Ensure temperature isn’t set too high for code tasks (use 0.1-0.3)

Ollama shows “model not found”:

Run ollama list to verify available models
Check disk space — the Q4 model requires ~65GB of free disk
Try ollama pull again; downloads can silently fail on unstable connections

Practical alternatives

If 123B is too large for your hardware:

Model	Size	VRAM	Quality
Qwen 3.5 72B	Q4: 40GB	2x RTX 4090	Very good
Qwen 3.5 27B	Q4: 16GB	1x RTX 4090	Good
Devstral Small 24B	Q4: 14GB	1x RTX 4070	Good for coding
Gemma 4 27B	Q4: 16GB	1x RTX 4090	Good

For most developers, a 27B-72B model at Q4 provides 85-90% of Mistral Large 2’s quality at a fraction of the hardware cost. Only run the full 123B if you specifically need its multilingual capabilities or long-context reasoning.

How to Run Mistral Large 2 Locally — Setup Guide (2026)

Hardware requirements

Quantization options explained

Option 1: Ollama (easiest)

Option 2: vLLM (fastest)

Option 3: llama.cpp (most flexible)

Option 3: Mac Studio Ultra

Performance benchmarks

Troubleshooting

Practical alternatives

📬 AI Dev Weekly

You might also like

How to Run Kimi K2.5 Locally — Hardware, Quantization, and Setup Guide

How to Run Mistral Models Locally — Ollama Setup Guide (2026)

How to Run GLM-5.1 Locally — Hardware, Setup, and Quantization Guide (2026)

How to Run Apertus Locally: Complete Setup Guide (All Sizes)