Apr 4, 2026 · 3 min read

How to Run AI Without a GPU — CPU-Only Inference Guide (2026)

You don’t need a GPU to run AI models locally. CPU-only inference is slower, but with the right model and quantization, it’s usable for many tasks. Here’s how to do it and what to expect.

Realistic speed expectations

CPU inference speed depends on your processor, RAM speed, and model size. Here are real-world numbers:

CPU	Model	Quantization	Speed
Intel i7-12700	Qwen3.5-9B	Q4_K_M	~5-8 tok/s
Intel i5-10400	Qwen3.5-4B	Q4_K_M	~8-12 tok/s
AMD Ryzen 7 5800X	Qwen3.5-9B	Q4_K_M	~6-10 tok/s
Apple M4 (CPU only)	Qwen3.5-9B	Q4_K_M	~15-20 tok/s
Intel i9-9900K	Qwen3.5-9B	Q4_K_M	~4-6 tok/s
Any modern CPU	Qwen3.5-0.8B	Q4_K_M	~15-25 tok/s

For comparison, a GPU (RTX 4090) runs the same Qwen3.5-9B at ~45 tok/s. CPU is 5-10x slower, but 5-8 tok/s is still readable — about the speed of someone typing.

Apple Silicon is the exception. Even in CPU-only mode, M-series chips are significantly faster than Intel/AMD for AI inference thanks to their memory bandwidth.

Setup with Ollama

Ollama automatically uses CPU if no GPU is detected:

curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3.5:4b

That’s it. No GPU drivers, no CUDA, no configuration. It just works.

Setup with llama.cpp (more control)

# Build without GPU support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download a model
huggingface-cli download Qwen/Qwen3.5-4B-GGUF \
  --include "*Q4_K_M*" --local-dir ./models

# Run with all CPU threads
./llama-server -m ./models/Qwen3.5-4B-Q4_K_M.gguf \
  --threads $(nproc) --ctx-size 4096 --port 8080

Key flags for CPU performance:

--threads $(nproc) — use all CPU cores
--ctx-size 4096 — smaller context = faster
--batch-size 512 — optimize batch processing

Best models for CPU-only

RAM available	Best model	Why
4GB	Qwen3.5-2B	Best quality at this size
8GB	Qwen3.5-4B	Good balance
16GB	Qwen3.5-9B	Best quality-per-resource
32GB	Qwen3.5-27B (Q4)	Strong but slow (~2-4 tok/s)

Stick to models where the quantized size fits comfortably in RAM with room to spare. If the model barely fits, it’ll use swap and become unusably slow.

When CPU-only makes sense

Good for:

Batch processing (speed doesn’t matter, just cost)
Background tasks (summarize documents overnight)
Simple Q&A and chat (5-8 tok/s is fine for conversation)
Servers with lots of RAM but no GPU
Learning and experimentation

Not good for:

Real-time coding assistance (too slow for autocomplete)
Interactive applications needing fast responses
Models larger than 14B (too slow to be useful)
Concurrent users (CPU saturates quickly)

Speed optimization tips

Use all cores. Set --threads to your CPU core count.
Use Q4_K_M quantization. Best speed/quality balance for CPU.
Reduce context size. 2048-4096 is enough for most tasks.
Close other applications. CPU inference uses all cores — other apps compete.
Use fast RAM. DDR5 > DDR4. Higher MHz = faster inference.
Consider a used GPU. A $200 RTX 3060 12GB will be 5-10x faster than any CPU.

The honest take

CPU-only inference works, but it’s a compromise. If you’re generating a few responses per day for personal use, it’s fine. If you need AI as part of your daily workflow, invest in a GPU or Apple Silicon Mac. The speed difference is transformative.

The best “no GPU” option in 2026 is actually a Mac Mini M4 ($599). Its unified memory architecture means the GPU and CPU share RAM, so even the base model runs AI faster than most discrete CPU setups.

How to Run AI Without a GPU — CPU-Only Inference Guide (2026)

Realistic speed expectations

Setup with Ollama

Setup with llama.cpp (more control)

Best models for CPU-only

When CPU-only makes sense

Speed optimization tips

The honest take

Related

📬 Get weekly dev tools & AI tips

You might also like

How to Run GLM-5.1 Locally — Hardware, Setup, and Quantization Guide

How to Replace GitHub Copilot for Free — Step-by-Step Guide (2026)

How to Run DeepSeek Locally — V3 and R1 Setup Guide

How to Run Llama 4 Locally — Scout and Maverick Setup Guide