πŸ€– AI Tools
Β· 5 min read
Last updated on

How to Run Mistral Large 2 Locally β€” Setup Guide (2026)


Mistral Large 2 at 123B parameters is the largest model you can realistically run on a single high-end GPU. Here’s how to set it up.

Hardware requirements

Running a 123B parameter model locally demands serious hardware. The amount of VRAM you need depends entirely on the precision and quantization format you choose.

PrecisionMemoryHardwareTokens/sec
FP16~250GB4x A100 80GB30-40
INT8~125GB2x A100 80GB25-35
Q5_K_M (GGUF)~85GB2x RTX 4090 or Mac Ultra 192GB8-15
Q4_K_M (GGUF)~65GB1x H100 or Mac Ultra 192GB10-18
Q4 (GPTQ)~65GB1x H100 or Mac Ultra 192GB15-25
Q3_K_M (GGUF)~52GBMac Ultra 128GB5-8
Q2_K (GGUF)~42GB2x RTX 30903-6

Minimum system RAM: 64GB (for model loading overhead). Recommended: 128GB+ if using CPU offloading.

If you don’t have multi-GPU hardware at home, cloud GPU providers offer H100 and multi-A100 instances that can run Mistral Large 2 at full speed for a few dollars per hour.

Quantization options explained

Choosing the right quantization format is critical for balancing quality and performance at this model size.

GGUF (llama.cpp / Ollama):

  • Best for: CPU+GPU hybrid inference, Mac systems
  • Q4_K_M offers the best quality-to-size ratio
  • Q5_K_M is nearly lossless but requires more memory
  • Q3_K_M is usable but shows noticeable quality degradation on complex reasoning

GPTQ:

  • Best for: Pure GPU inference with vLLM or text-generation-inference
  • 4-bit is the standard choice
  • Requires calibration dataset during quantization
  • Slightly faster than GGUF on pure GPU setups

AWQ:

  • Best for: vLLM deployments with activation-aware quantization
  • Better quality than GPTQ at the same bit width for most tasks
  • Supported natively by vLLM with --quantization awq

For Mistral Large 2 specifically, AWQ 4-bit through vLLM gives the best speed-to-quality ratio on NVIDIA hardware. On Mac, GGUF Q4_K_M through Ollama is your only practical option.

Option 1: Ollama (easiest)

ollama pull mistral-large:123b-q4
ollama serve

Then use with Aider:

aider --model ollama/mistral-large:123b-q4

Or Continue.dev:

{"models": [{"provider": "ollama", "model": "mistral-large:123b-q4"}]}

To check that the model loaded correctly and see memory usage:

ollama ps

Option 2: vLLM (fastest)

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Large-Instruct-2411 \
  --tensor-parallel-size 2 \
  --quantization awq \
  --port 8000

For maximum throughput with batched requests, add:

  --max-num-batched-tokens 8192 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92

Option 3: llama.cpp (most flexible)

For fine-grained control over GPU layer offloading:

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
make LLAMA_CUDA=1 -j$(nproc)

./llama-server \
  -m Mistral-Large-2-123B-Q4_K_M.gguf \
  -ngl 60 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

Adjust -ngl (number of GPU layers) based on your available VRAM. With 24GB VRAM, you can offload roughly 20-25 layers; the rest runs on CPU.

Option 3: Mac Studio Ultra

The Mac Studio Ultra with 192GB unified memory can run Q4 Mistral Large 2:

ollama pull mistral-large:123b-q4
ollama run mistral-large:123b-q4

Expect ~5-8 tokens/second β€” slow but usable for code review and analysis. See our best AI models for Mac guide.

With Q5_K_M you’ll get slightly better quality at ~4-6 tokens/second, which is still acceptable for non-interactive tasks like code review.

Performance benchmarks

Real-world performance measured on common hardware configurations:

SetupQuantContextTokens/secTime to first token
2x A100 80GBAWQ 4-bit409622 t/s1.2s
1x H100 80GBAWQ 4-bit409628 t/s0.8s
Mac Ultra 192GBQ4_K_M40967 t/s3.5s
2x RTX 4090Q4_K_M409612 t/s2.1s
RTX 4090 + CPU offloadQ4_K_M40964 t/s6.0s

Context length significantly impacts performance. At 32K context, expect roughly 40-50% slower generation compared to 4K context.

Troubleshooting

Out of memory errors:

  • Reduce -ngl layers (llama.cpp) or lower --gpu-memory-utilization (vLLM)
  • Use a smaller quantization: Q3_K_M instead of Q4_K_M
  • Close other GPU-consuming applications
  • Check actual free VRAM with nvidia-smi

Slow generation speed:

  • Ensure you’re offloading enough layers to GPU β€” CPU-bound layers are 10x slower
  • Reduce context length if you don’t need it: -c 4096 instead of default
  • On Mac, ensure you’re not running other memory-intensive apps

Model fails to load:

  • Verify file integrity: md5sum against the published hash
  • Ensure sufficient system RAM (not just VRAM) β€” the model needs RAM during initial loading
  • For vLLM tensor parallelism, ensure NCCL is properly installed

Garbled or low-quality output:

  • Q2_K quantization loses significant quality at 123B β€” upgrade to Q4_K_M minimum
  • Check that your prompt template matches Mistral’s expected format
  • Ensure temperature isn’t set too high for code tasks (use 0.1-0.3)

Ollama shows β€œmodel not found”:

  • Run ollama list to verify available models
  • Check disk space β€” the Q4 model requires ~65GB of free disk
  • Try ollama pull again; downloads can silently fail on unstable connections

Practical alternatives

If 123B is too large for your hardware:

ModelSizeVRAMQuality
Qwen 3.5 72BQ4: 40GB2x RTX 4090Very good
Qwen 3.5 27BQ4: 16GB1x RTX 4090Good
Devstral Small 24BQ4: 14GB1x RTX 4070Good for coding
Gemma 4 27BQ4: 16GB1x RTX 4090Good

For most developers, a 27B-72B model at Q4 provides 85-90% of Mistral Large 2’s quality at a fraction of the hardware cost. Only run the full 123B if you specifically need its multilingual capabilities or long-context reasoning.

Related: Mistral Large 2 Complete Guide Β· Ollama Complete Guide 2026 Β· How Much VRAM for AI Β· GGUF vs GPTQ vs AWQ Quantization Formats Β· Best GPU for AI Locally Β· Cheapest Way to Run AI Locally Β· How To Run Kimi K2 5 Locally