🤖 AI Tools
· 3 min read

How to Run AI Without a GPU — CPU-Only Inference Guide (2026)


You don’t need a GPU to run AI models locally. CPU-only inference is slower, but with the right model and quantization, it’s usable for many tasks. Here’s how to do it and what to expect.

Realistic speed expectations

CPU inference speed depends on your processor, RAM speed, and model size. Here are real-world numbers:

CPUModelQuantizationSpeed
Intel i7-12700Qwen3.5-9BQ4_K_M~5-8 tok/s
Intel i5-10400Qwen3.5-4BQ4_K_M~8-12 tok/s
AMD Ryzen 7 5800XQwen3.5-9BQ4_K_M~6-10 tok/s
Apple M4 (CPU only)Qwen3.5-9BQ4_K_M~15-20 tok/s
Intel i9-9900KQwen3.5-9BQ4_K_M~4-6 tok/s
Any modern CPUQwen3.5-0.8BQ4_K_M~15-25 tok/s

For comparison, a GPU (RTX 4090) runs the same Qwen3.5-9B at ~45 tok/s. CPU is 5-10x slower, but 5-8 tok/s is still readable — about the speed of someone typing.

Apple Silicon is the exception. Even in CPU-only mode, M-series chips are significantly faster than Intel/AMD for AI inference thanks to their memory bandwidth.

Setup with Ollama

Ollama automatically uses CPU if no GPU is detected:

curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3.5:4b

That’s it. No GPU drivers, no CUDA, no configuration. It just works.

Setup with llama.cpp (more control)

# Build without GPU support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download a model
huggingface-cli download Qwen/Qwen3.5-4B-GGUF \
  --include "*Q4_K_M*" --local-dir ./models

# Run with all CPU threads
./llama-server -m ./models/Qwen3.5-4B-Q4_K_M.gguf \
  --threads $(nproc) --ctx-size 4096 --port 8080

Key flags for CPU performance:

  • --threads $(nproc) — use all CPU cores
  • --ctx-size 4096 — smaller context = faster
  • --batch-size 512 — optimize batch processing

Best models for CPU-only

RAM availableBest modelWhy
4GBQwen3.5-2BBest quality at this size
8GBQwen3.5-4BGood balance
16GBQwen3.5-9BBest quality-per-resource
32GBQwen3.5-27B (Q4)Strong but slow (~2-4 tok/s)

Stick to models where the quantized size fits comfortably in RAM with room to spare. If the model barely fits, it’ll use swap and become unusably slow.

When CPU-only makes sense

Good for:

  • Batch processing (speed doesn’t matter, just cost)
  • Background tasks (summarize documents overnight)
  • Simple Q&A and chat (5-8 tok/s is fine for conversation)
  • Servers with lots of RAM but no GPU
  • Learning and experimentation

Not good for:

  • Real-time coding assistance (too slow for autocomplete)
  • Interactive applications needing fast responses
  • Models larger than 14B (too slow to be useful)
  • Concurrent users (CPU saturates quickly)

Speed optimization tips

  1. Use all cores. Set --threads to your CPU core count.
  2. Use Q4_K_M quantization. Best speed/quality balance for CPU.
  3. Reduce context size. 2048-4096 is enough for most tasks.
  4. Close other applications. CPU inference uses all cores — other apps compete.
  5. Use fast RAM. DDR5 > DDR4. Higher MHz = faster inference.
  6. Consider a used GPU. A $200 RTX 3060 12GB will be 5-10x faster than any CPU.

The honest take

CPU-only inference works, but it’s a compromise. If you’re generating a few responses per day for personal use, it’s fine. If you need AI as part of your daily workflow, invest in a GPU or Apple Silicon Mac. The speed difference is transformative.

The best “no GPU” option in 2026 is actually a Mac Mini M4 ($599). Its unified memory architecture means the GPU and CPU share RAM, so even the base model runs AI faster than most discrete CPU setups.