You donβt need a GPU to run AI models locally. CPU-only inference is slower, but with the right model and quantization, itβs usable for many tasks. Hereβs how to do it and what to expect.
Realistic speed expectations
CPU inference speed depends on your processor, RAM speed, and model size. Here are real-world numbers:
| CPU | Model | Quantization | Speed |
|---|---|---|---|
| Intel i7-12700 | Qwen3.5-9B | Q4_K_M | ~5-8 tok/s |
| Intel i5-10400 | Qwen3.5-4B | Q4_K_M | ~8-12 tok/s |
| AMD Ryzen 7 5800X | Qwen3.5-9B | Q4_K_M | ~6-10 tok/s |
| Apple M4 (CPU only) | Qwen3.5-9B | Q4_K_M | ~15-20 tok/s |
| Intel i9-9900K | Qwen3.5-9B | Q4_K_M | ~4-6 tok/s |
| Any modern CPU | Qwen3.5-0.8B | Q4_K_M | ~15-25 tok/s |
For comparison, a GPU (RTX 4090) runs the same Qwen3.5-9B at ~45 tok/s. CPU is 5-10x slower, but 5-8 tok/s is still readable β about the speed of someone typing.
Apple Silicon is the exception. Even in CPU-only mode, M-series chips are significantly faster than Intel/AMD for AI inference thanks to their memory bandwidth.
Setup with Ollama
Ollama automatically uses CPU if no GPU is detected:
curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3.5:4b
Thatβs it. No GPU drivers, no CUDA, no configuration. It just works.
Setup with llama.cpp (more control)
# Build without GPU support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Download a model
huggingface-cli download Qwen/Qwen3.5-4B-GGUF \
--include "*Q4_K_M*" --local-dir ./models
# Run with all CPU threads
./llama-server -m ./models/Qwen3.5-4B-Q4_K_M.gguf \
--threads $(nproc) --ctx-size 4096 --port 8080
Key flags for CPU performance:
--threads $(nproc)β use all CPU cores--ctx-size 4096β smaller context = faster--batch-size 512β optimize batch processing
Best models for CPU-only
| RAM available | Best model | Why |
|---|---|---|
| 4GB | Qwen3.5-2B | Best quality at this size |
| 8GB | Qwen3.5-4B | Good balance |
| 16GB | Qwen3.5-9B | Best quality-per-resource |
| 32GB | Qwen3.5-27B (Q4) | Strong but slow (~2-4 tok/s) |
Stick to models where the quantized size fits comfortably in RAM with room to spare. If the model barely fits, itβll use swap and become unusably slow.
When CPU-only makes sense
Good for:
- Batch processing (speed doesnβt matter, just cost)
- Background tasks (summarize documents overnight)
- Simple Q&A and chat (5-8 tok/s is fine for conversation)
- Servers with lots of RAM but no GPU
- Learning and experimentation
Not good for:
- Real-time coding assistance (too slow for autocomplete)
- Interactive applications needing fast responses
- Models larger than 14B (too slow to be useful)
- Concurrent users (CPU saturates quickly)
Speed optimization tips
- Use all cores. Set
--threadsto your CPU core count. - Use Q4_K_M quantization. Best speed/quality balance for CPU.
- Reduce context size. 2048-4096 is enough for most tasks.
- Close other applications. CPU inference uses all cores β other apps compete.
- Use fast RAM. DDR5 > DDR4. Higher MHz = faster inference.
- Consider a used GPU. A $200 RTX 3060 12GB will be 5-10x faster than any CPU.
The honest take
CPU-only inference works, but itβs a compromise. If youβre generating a few responses per day for personal use, itβs fine. If you need AI as part of your daily workflow, invest in a GPU or Apple Silicon Mac. The speed difference is transformative.
The best βno GPUβ option in 2026 is actually a Mac Mini M4 ($599). Its unified memory architecture means the GPU and CPU share RAM, so even the base model runs AI faster than most discrete CPU setups.
Related
- Best Self-Hosted AI Models in 2026
- Best AI Models Under 4GB RAM
- Best GPU for Running AI Models Locally in 2026
- Ollama vs llama.cpp vs vLLM β Which Should You Use?
Related: How to Choose an AI Coding Agent Β· AI Coding Tools Pricing Β· Best Cloud GPU Providers