You don’t need a GPU to run AI models locally. CPU-only inference is slower, but with the right model and quantization, it’s usable for many tasks. Here’s how to do it and what to expect.
Realistic speed expectations
CPU inference speed depends on your processor, RAM speed, and model size. Here are real-world numbers:
| CPU | Model | Quantization | Speed |
|---|---|---|---|
| Intel i7-12700 | Qwen3.5-9B | Q4_K_M | ~5-8 tok/s |
| Intel i5-10400 | Qwen3.5-4B | Q4_K_M | ~8-12 tok/s |
| AMD Ryzen 7 5800X | Qwen3.5-9B | Q4_K_M | ~6-10 tok/s |
| Apple M4 (CPU only) | Qwen3.5-9B | Q4_K_M | ~15-20 tok/s |
| Intel i9-9900K | Qwen3.5-9B | Q4_K_M | ~4-6 tok/s |
| Any modern CPU | Qwen3.5-0.8B | Q4_K_M | ~15-25 tok/s |
For comparison, a GPU (RTX 4090) runs the same Qwen3.5-9B at ~45 tok/s. CPU is 5-10x slower, but 5-8 tok/s is still readable — about the speed of someone typing.
Apple Silicon is the exception. Even in CPU-only mode, M-series chips are significantly faster than Intel/AMD for AI inference thanks to their memory bandwidth.
Setup with Ollama
Ollama automatically uses CPU if no GPU is detected:
curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3.5:4b
That’s it. No GPU drivers, no CUDA, no configuration. It just works.
Setup with llama.cpp (more control)
# Build without GPU support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Download a model
huggingface-cli download Qwen/Qwen3.5-4B-GGUF \
--include "*Q4_K_M*" --local-dir ./models
# Run with all CPU threads
./llama-server -m ./models/Qwen3.5-4B-Q4_K_M.gguf \
--threads $(nproc) --ctx-size 4096 --port 8080
Key flags for CPU performance:
--threads $(nproc)— use all CPU cores--ctx-size 4096— smaller context = faster--batch-size 512— optimize batch processing
Best models for CPU-only
| RAM available | Best model | Why |
|---|---|---|
| 4GB | Qwen3.5-2B | Best quality at this size |
| 8GB | Qwen3.5-4B | Good balance |
| 16GB | Qwen3.5-9B | Best quality-per-resource |
| 32GB | Qwen3.5-27B (Q4) | Strong but slow (~2-4 tok/s) |
Stick to models where the quantized size fits comfortably in RAM with room to spare. If the model barely fits, it’ll use swap and become unusably slow.
When CPU-only makes sense
Good for:
- Batch processing (speed doesn’t matter, just cost)
- Background tasks (summarize documents overnight)
- Simple Q&A and chat (5-8 tok/s is fine for conversation)
- Servers with lots of RAM but no GPU
- Learning and experimentation
Not good for:
- Real-time coding assistance (too slow for autocomplete)
- Interactive applications needing fast responses
- Models larger than 14B (too slow to be useful)
- Concurrent users (CPU saturates quickly)
Speed optimization tips
- Use all cores. Set
--threadsto your CPU core count. - Use Q4_K_M quantization. Best speed/quality balance for CPU.
- Reduce context size. 2048-4096 is enough for most tasks.
- Close other applications. CPU inference uses all cores — other apps compete.
- Use fast RAM. DDR5 > DDR4. Higher MHz = faster inference.
- Consider a used GPU. A $200 RTX 3060 12GB will be 5-10x faster than any CPU.
The honest take
CPU-only inference works, but it’s a compromise. If you’re generating a few responses per day for personal use, it’s fine. If you need AI as part of your daily workflow, invest in a GPU or Apple Silicon Mac. The speed difference is transformative.
The best “no GPU” option in 2026 is actually a Mac Mini M4 ($599). Its unified memory architecture means the GPU and CPU share RAM, so even the base model runs AI faster than most discrete CPU setups.