GPU inference is faster. But CPU inference is cheaper, simpler, and sometimes good enough. The right choice depends on model size, traffic volume, latency requirements, and budget. This guide breaks down when each makes sense so you avoid overspending on GPU compute you do not need.
The speed difference
The performance gap widens as model size increases:
| Model size | CPU (modern server) | GPU (RTX 4090) | Difference |
|---|---|---|---|
| 3B parameters | 15β20 tok/s | 60β80 tok/s | 4x |
| 7B parameters | 5β10 tok/s | 40β50 tok/s | 5β8x |
| 13B parameters | 2β5 tok/s | 25β35 tok/s | 7β10x |
| 27B parameters | 1β2 tok/s | 20β30 tok/s | 15β20x |
At 3B, CPU inference is fast enough for many use cases. At 27B, it is unusable for interactive applications.
When CPU is enough
Small models at 7B or below run well on modern CPUs. A server-grade CPU handles Qwen 3.5 4B at 15β20 tokens per second β fast enough for batch processing and simple chat.
Low-volume workloads are a sweet spot. Fewer than 100 requests per day on a $5/month VPS is dramatically cheaper than any GPU option.
Edge and embedded deployments often have no GPU. Raspberry Pi, IoT devices, and laptops without discrete GPUs rely on CPU inference. llama.cpp is optimized for these environments. See our guide on running AI without a GPU for practical setups.
CI/CD pipelines running AI checks in GitHub Actions have no GPU available. CPU is the only option, and for small models it works.
Cost optimization is the final argument. A $50/month CPU server running a 7B model around the clock costs less than equivalent API calls for the same workload.
When you need GPU
Interactive applications demand GPU. Coding assistants and chat interfaces need at least 15 tokens per second. Only GPU delivers this for 13B+ models. Check our guide on the best GPUs for local AI for hardware recommendations.
Multi-user serving is where GPU shines. Continuous batching serves 10β100x more concurrent users because the GPU processes multiple requests simultaneously.
Large models above 13B are impractically slow on CPU. If you need a 70B model, GPU is not optional. See our VRAM planning guide to understand memory requirements.
High throughput workloads processing thousands of requests per hour need GPU parallelism.
The best CPU inference stack
llama.cpp is the gold standard for CPU inference. It uses AVX2/AVX-512 SIMD instructions, GGUF quantization, and memory-mapped files. It powers Ollama under the hood.
# Pure CPU inference with llama.cpp
./llama-server -m model-q4.gguf -c 4096 --host 0.0.0.0 -ngl 0
The -ngl 0 flag forces pure CPU execution. Q4 quantization cuts memory by 75% with minimal quality loss. For best results, use Q4_K_M quantization and ensure your CPU supports AVX2 at minimum.
Hybrid: CPU plus GPU
If you have a small GPU with 8 GB VRAM, split the model between GPU and CPU:
# 20 layers on GPU, rest on CPU
./llama-server -m model-q4.gguf -ngl 20
Slower than full GPU but faster than pure CPU. Useful for models slightly larger than your VRAM allows.
Apple Silicon as a middle ground
Appleβs M-series chips blur the line between CPU and GPU inference. Unified memory means CPU and GPU share the same pool, eliminating the VRAM bottleneck. An M4 Pro with 48 GB runs a 70B Q4 model entirely in memory.
Performance falls between traditional CPU and discrete GPU. A 7B model runs at 25β35 tokens per second on M4 β faster than x86 CPU, slower than RTX 4090. For local development, Apple Silicon offers an excellent balance of performance and efficiency.
Decision framework
| Question | CPU | GPU |
|---|---|---|
| Model β€7B? | β Works well | Overkill |
| Need >15 tok/s on 13B+? | β Too slow | β Required |
| Budget under $50/month? | β Fits easily | β Not possible |
| Serving multiple users? | β Limited | β Scales well |
| Edge deployment? | β Only option | Usually unavailable |
| Apple Silicon? | β Great middle ground | N/A |
FAQ
Can I run LLMs on CPU?
Yes. Models up to 7B run well on modern CPUs using llama.cpp or Ollama. A server-grade CPU delivers 5β10 tokens per second on a 7B Q4 model β adequate for batch processing, low-volume chat, and background tasks. Above 13B, CPU becomes impractically slow for interactive use.
How much faster is GPU than CPU?
GPU is 4β20x faster depending on model size. For 3B models the gap is about 4x. For 27B models it widens to 15β20x. Larger models benefit more from GPU parallelism and high memory bandwidth.
Which GPU is best for LLMs?
For local use, the NVIDIA RTX 4090 with 24 GB VRAM offers the best performance per dollar. It runs 13B models at full speed and 27B with quantization. For production, the A100 (80 GB) and H100 are standard. On a budget, the RTX 3090 (24 GB) still performs well used.
Is Apple Silicon good for LLMs?
Yes. Unified memory lets you run models that would require expensive GPUs on other platforms. An M4 Pro with 48 GB handles 70B quantized models. Performance is 25β35 tokens per second on 7B β between CPU and discrete GPU. The main limitation is no support from production frameworks like vLLM.