πŸ€– AI Tools
Β· 4 min read
Last updated on

When to Use CPU vs GPU for LLM Inference


GPU inference is faster. But CPU inference is cheaper, simpler, and sometimes good enough. The right choice depends on model size, traffic volume, latency requirements, and budget. This guide breaks down when each makes sense so you avoid overspending on GPU compute you do not need.

The speed difference

The performance gap widens as model size increases:

Model sizeCPU (modern server)GPU (RTX 4090)Difference
3B parameters15–20 tok/s60–80 tok/s4x
7B parameters5–10 tok/s40–50 tok/s5–8x
13B parameters2–5 tok/s25–35 tok/s7–10x
27B parameters1–2 tok/s20–30 tok/s15–20x

At 3B, CPU inference is fast enough for many use cases. At 27B, it is unusable for interactive applications.

When CPU is enough

Small models at 7B or below run well on modern CPUs. A server-grade CPU handles Qwen 3.5 4B at 15–20 tokens per second β€” fast enough for batch processing and simple chat.

Low-volume workloads are a sweet spot. Fewer than 100 requests per day on a $5/month VPS is dramatically cheaper than any GPU option.

Edge and embedded deployments often have no GPU. Raspberry Pi, IoT devices, and laptops without discrete GPUs rely on CPU inference. llama.cpp is optimized for these environments. See our guide on running AI without a GPU for practical setups.

CI/CD pipelines running AI checks in GitHub Actions have no GPU available. CPU is the only option, and for small models it works.

Cost optimization is the final argument. A $50/month CPU server running a 7B model around the clock costs less than equivalent API calls for the same workload.

When you need GPU

Interactive applications demand GPU. Coding assistants and chat interfaces need at least 15 tokens per second. Only GPU delivers this for 13B+ models. Check our guide on the best GPUs for local AI for hardware recommendations.

Multi-user serving is where GPU shines. Continuous batching serves 10–100x more concurrent users because the GPU processes multiple requests simultaneously.

Large models above 13B are impractically slow on CPU. If you need a 70B model, GPU is not optional. See our VRAM planning guide to understand memory requirements.

High throughput workloads processing thousands of requests per hour need GPU parallelism.

The best CPU inference stack

llama.cpp is the gold standard for CPU inference. It uses AVX2/AVX-512 SIMD instructions, GGUF quantization, and memory-mapped files. It powers Ollama under the hood.

# Pure CPU inference with llama.cpp
./llama-server -m model-q4.gguf -c 4096 --host 0.0.0.0 -ngl 0

The -ngl 0 flag forces pure CPU execution. Q4 quantization cuts memory by 75% with minimal quality loss. For best results, use Q4_K_M quantization and ensure your CPU supports AVX2 at minimum.

Hybrid: CPU plus GPU

If you have a small GPU with 8 GB VRAM, split the model between GPU and CPU:

# 20 layers on GPU, rest on CPU
./llama-server -m model-q4.gguf -ngl 20

Slower than full GPU but faster than pure CPU. Useful for models slightly larger than your VRAM allows.

Apple Silicon as a middle ground

Apple’s M-series chips blur the line between CPU and GPU inference. Unified memory means CPU and GPU share the same pool, eliminating the VRAM bottleneck. An M4 Pro with 48 GB runs a 70B Q4 model entirely in memory.

Performance falls between traditional CPU and discrete GPU. A 7B model runs at 25–35 tokens per second on M4 β€” faster than x86 CPU, slower than RTX 4090. For local development, Apple Silicon offers an excellent balance of performance and efficiency.

Decision framework

QuestionCPUGPU
Model ≀7B?βœ… Works wellOverkill
Need >15 tok/s on 13B+?❌ Too slowβœ… Required
Budget under $50/month?βœ… Fits easily❌ Not possible
Serving multiple users?❌ Limitedβœ… Scales well
Edge deployment?βœ… Only optionUsually unavailable
Apple Silicon?βœ… Great middle groundN/A

FAQ

Can I run LLMs on CPU?

Yes. Models up to 7B run well on modern CPUs using llama.cpp or Ollama. A server-grade CPU delivers 5–10 tokens per second on a 7B Q4 model β€” adequate for batch processing, low-volume chat, and background tasks. Above 13B, CPU becomes impractically slow for interactive use.

How much faster is GPU than CPU?

GPU is 4–20x faster depending on model size. For 3B models the gap is about 4x. For 27B models it widens to 15–20x. Larger models benefit more from GPU parallelism and high memory bandwidth.

Which GPU is best for LLMs?

For local use, the NVIDIA RTX 4090 with 24 GB VRAM offers the best performance per dollar. It runs 13B models at full speed and 27B with quantization. For production, the A100 (80 GB) and H100 are standard. On a budget, the RTX 3090 (24 GB) still performs well used.

Is Apple Silicon good for LLMs?

Yes. Unified memory lets you run models that would require expensive GPUs on other platforms. An M4 Pro with 48 GB handles 70B quantized models. Performance is 25–35 tokens per second on 7B β€” between CPU and discrete GPU. The main limitation is no support from production frameworks like vLLM.