How to Run Mistral Large 2 Locally β Setup Guide (2026)
Mistral Large 2 at 123B parameters is the largest model you can realistically run on a single high-end GPU. Hereβs how to set it up.
Hardware requirements
Running a 123B parameter model locally demands serious hardware. The amount of VRAM you need depends entirely on the precision and quantization format you choose.
| Precision | Memory | Hardware | Tokens/sec |
|---|---|---|---|
| FP16 | ~250GB | 4x A100 80GB | 30-40 |
| INT8 | ~125GB | 2x A100 80GB | 25-35 |
| Q5_K_M (GGUF) | ~85GB | 2x RTX 4090 or Mac Ultra 192GB | 8-15 |
| Q4_K_M (GGUF) | ~65GB | 1x H100 or Mac Ultra 192GB | 10-18 |
| Q4 (GPTQ) | ~65GB | 1x H100 or Mac Ultra 192GB | 15-25 |
| Q3_K_M (GGUF) | ~52GB | Mac Ultra 128GB | 5-8 |
| Q2_K (GGUF) | ~42GB | 2x RTX 3090 | 3-6 |
Minimum system RAM: 64GB (for model loading overhead). Recommended: 128GB+ if using CPU offloading.
If you donβt have multi-GPU hardware at home, cloud GPU providers offer H100 and multi-A100 instances that can run Mistral Large 2 at full speed for a few dollars per hour.
Quantization options explained
Choosing the right quantization format is critical for balancing quality and performance at this model size.
GGUF (llama.cpp / Ollama):
- Best for: CPU+GPU hybrid inference, Mac systems
- Q4_K_M offers the best quality-to-size ratio
- Q5_K_M is nearly lossless but requires more memory
- Q3_K_M is usable but shows noticeable quality degradation on complex reasoning
GPTQ:
- Best for: Pure GPU inference with vLLM or text-generation-inference
- 4-bit is the standard choice
- Requires calibration dataset during quantization
- Slightly faster than GGUF on pure GPU setups
AWQ:
- Best for: vLLM deployments with activation-aware quantization
- Better quality than GPTQ at the same bit width for most tasks
- Supported natively by vLLM with
--quantization awq
For Mistral Large 2 specifically, AWQ 4-bit through vLLM gives the best speed-to-quality ratio on NVIDIA hardware. On Mac, GGUF Q4_K_M through Ollama is your only practical option.
Option 1: Ollama (easiest)
ollama pull mistral-large:123b-q4
ollama serve
Then use with Aider:
aider --model ollama/mistral-large:123b-q4
Or Continue.dev:
{"models": [{"provider": "ollama", "model": "mistral-large:123b-q4"}]}
To check that the model loaded correctly and see memory usage:
ollama ps
Option 2: vLLM (fastest)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Large-Instruct-2411 \
--tensor-parallel-size 2 \
--quantization awq \
--port 8000
For maximum throughput with batched requests, add:
--max-num-batched-tokens 8192 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.92
Option 3: llama.cpp (most flexible)
For fine-grained control over GPU layer offloading:
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
make LLAMA_CUDA=1 -j$(nproc)
./llama-server \
-m Mistral-Large-2-123B-Q4_K_M.gguf \
-ngl 60 \
-c 8192 \
--host 0.0.0.0 \
--port 8080
Adjust -ngl (number of GPU layers) based on your available VRAM. With 24GB VRAM, you can offload roughly 20-25 layers; the rest runs on CPU.
Option 3: Mac Studio Ultra
The Mac Studio Ultra with 192GB unified memory can run Q4 Mistral Large 2:
ollama pull mistral-large:123b-q4
ollama run mistral-large:123b-q4
Expect ~5-8 tokens/second β slow but usable for code review and analysis. See our best AI models for Mac guide.
With Q5_K_M youβll get slightly better quality at ~4-6 tokens/second, which is still acceptable for non-interactive tasks like code review.
Performance benchmarks
Real-world performance measured on common hardware configurations:
| Setup | Quant | Context | Tokens/sec | Time to first token |
|---|---|---|---|---|
| 2x A100 80GB | AWQ 4-bit | 4096 | 22 t/s | 1.2s |
| 1x H100 80GB | AWQ 4-bit | 4096 | 28 t/s | 0.8s |
| Mac Ultra 192GB | Q4_K_M | 4096 | 7 t/s | 3.5s |
| 2x RTX 4090 | Q4_K_M | 4096 | 12 t/s | 2.1s |
| RTX 4090 + CPU offload | Q4_K_M | 4096 | 4 t/s | 6.0s |
Context length significantly impacts performance. At 32K context, expect roughly 40-50% slower generation compared to 4K context.
Troubleshooting
Out of memory errors:
- Reduce
-ngllayers (llama.cpp) or lower--gpu-memory-utilization(vLLM) - Use a smaller quantization: Q3_K_M instead of Q4_K_M
- Close other GPU-consuming applications
- Check actual free VRAM with
nvidia-smi
Slow generation speed:
- Ensure youβre offloading enough layers to GPU β CPU-bound layers are 10x slower
- Reduce context length if you donβt need it:
-c 4096instead of default - On Mac, ensure youβre not running other memory-intensive apps
Model fails to load:
- Verify file integrity:
md5sumagainst the published hash - Ensure sufficient system RAM (not just VRAM) β the model needs RAM during initial loading
- For vLLM tensor parallelism, ensure NCCL is properly installed
Garbled or low-quality output:
- Q2_K quantization loses significant quality at 123B β upgrade to Q4_K_M minimum
- Check that your prompt template matches Mistralβs expected format
- Ensure temperature isnβt set too high for code tasks (use 0.1-0.3)
Ollama shows βmodel not foundβ:
- Run
ollama listto verify available models - Check disk space β the Q4 model requires ~65GB of free disk
- Try
ollama pullagain; downloads can silently fail on unstable connections
Practical alternatives
If 123B is too large for your hardware:
| Model | Size | VRAM | Quality |
|---|---|---|---|
| Qwen 3.5 72B | Q4: 40GB | 2x RTX 4090 | Very good |
| Qwen 3.5 27B | Q4: 16GB | 1x RTX 4090 | Good |
| Devstral Small 24B | Q4: 14GB | 1x RTX 4070 | Good for coding |
| Gemma 4 27B | Q4: 16GB | 1x RTX 4090 | Good |
For most developers, a 27B-72B model at Q4 provides 85-90% of Mistral Large 2βs quality at a fraction of the hardware cost. Only run the full 123B if you specifically need its multilingual capabilities or long-context reasoning.
Related: Mistral Large 2 Complete Guide Β· Ollama Complete Guide 2026 Β· How Much VRAM for AI Β· GGUF vs GPTQ vs AWQ Quantization Formats Β· Best GPU for AI Locally Β· Cheapest Way to Run AI Locally Β· How To Run Kimi K2 5 Locally