Apr 24, 2026 · 4 min read

Last updated on Apr 19, 2026

When to Use CPU vs GPU for LLM Inference

GPU inference is faster. But CPU inference is cheaper, simpler, and sometimes good enough. The right choice depends on model size, traffic volume, latency requirements, and budget. This guide breaks down when each makes sense so you avoid overspending on GPU compute you do not need.

The speed difference

The performance gap widens as model size increases:

Model size	CPU (modern server)	GPU (RTX 4090)	Difference
3B parameters	15–20 tok/s	60–80 tok/s	4x
7B parameters	5–10 tok/s	40–50 tok/s	5–8x
13B parameters	2–5 tok/s	25–35 tok/s	7–10x
27B parameters	1–2 tok/s	20–30 tok/s	15–20x

At 3B, CPU inference is fast enough for many use cases. At 27B, it is unusable for interactive applications.

When CPU is enough

Small models at 7B or below run well on modern CPUs. A server-grade CPU handles Qwen 3.5 4B at 15–20 tokens per second — fast enough for batch processing and simple chat.

Low-volume workloads are a sweet spot. Fewer than 100 requests per day on a $5/month VPS is dramatically cheaper than any GPU option.

Edge and embedded deployments often have no GPU. Raspberry Pi, IoT devices, and laptops without discrete GPUs rely on CPU inference. llama.cpp is optimized for these environments. See our guide on running AI without a GPU for practical setups.

CI/CD pipelines running AI checks in GitHub Actions have no GPU available. CPU is the only option, and for small models it works.

Cost optimization is the final argument. A $50/month CPU server running a 7B model around the clock costs less than equivalent API calls for the same workload.

When you need GPU

Interactive applications demand GPU. Coding assistants and chat interfaces need at least 15 tokens per second. Only GPU delivers this for 13B+ models. Check our guide on the best GPUs for local AI for hardware recommendations.

Multi-user serving is where GPU shines. Continuous batching serves 10–100x more concurrent users because the GPU processes multiple requests simultaneously.

Large models above 13B are impractically slow on CPU. If you need a 70B model, GPU is not optional. If you don’t own a suitable GPU, cloud GPU providers offer A100s and H100s on demand for a fraction of the purchase price. See our VRAM planning guide to understand memory requirements.

High throughput workloads processing thousands of requests per hour need GPU parallelism.

The best CPU inference stack

llama.cpp is the gold standard for CPU inference. It uses AVX2/AVX-512 SIMD instructions, GGUF quantization, and memory-mapped files. It powers Ollama under the hood.

# Pure CPU inference with llama.cpp
./llama-server -m model-q4.gguf -c 4096 --host 0.0.0.0 -ngl 0

The -ngl 0 flag forces pure CPU execution. Q4 quantization cuts memory by 75% with minimal quality loss. For best results, use Q4_K_M quantization and ensure your CPU supports AVX2 at minimum.

Hybrid: CPU plus GPU

If you have a small GPU with 8 GB VRAM, split the model between GPU and CPU:

# 20 layers on GPU, rest on CPU
./llama-server -m model-q4.gguf -ngl 20

Slower than full GPU but faster than pure CPU. Useful for models slightly larger than your VRAM allows.

Apple Silicon as a middle ground

Apple’s M-series chips blur the line between CPU and GPU inference. Unified memory means CPU and GPU share the same pool, eliminating the VRAM bottleneck. An M4 Pro with 48 GB runs a 70B Q4 model entirely in memory.

Performance falls between traditional CPU and discrete GPU. A 7B model runs at 25–35 tokens per second on M4 — faster than x86 CPU, slower than RTX 4090. For local development, Apple Silicon offers an excellent balance of performance and efficiency.

Decision framework

Question	CPU	GPU
Model ≤7B?	✅ Works well	Overkill
Need >15 tok/s on 13B+?	❌ Too slow	✅ Required
Budget under $50/month?	✅ Fits easily	❌ Not possible
Serving multiple users?	❌ Limited	✅ Scales well
Edge deployment?	✅ Only option	Usually unavailable
Apple Silicon?	✅ Great middle ground	N/A

FAQ

Can I run LLMs on CPU?

Yes. Models up to 7B run well on modern CPUs using llama.cpp or Ollama. A server-grade CPU delivers 5–10 tokens per second on a 7B Q4 model — adequate for batch processing, low-volume chat, and background tasks. Above 13B, CPU becomes impractically slow for interactive use.

How much faster is GPU than CPU?

GPU is 4–20x faster depending on model size. For 3B models the gap is about 4x. For 27B models it widens to 15–20x. Larger models benefit more from GPU parallelism and high memory bandwidth.

Which GPU is best for LLMs?

For local use, the NVIDIA RTX 4090 with 24 GB VRAM offers the best performance per dollar. It runs 13B models at full speed and 27B with quantization. For production, the A100 (80 GB) and H100 are standard. On a budget, the RTX 3090 (24 GB) still performs well used.

Is Apple Silicon good for LLMs?

Yes. Unified memory lets you run models that would require expensive GPUs on other platforms. An M4 Pro with 48 GB handles 70B quantized models. Performance is 25–35 tokens per second on 7B — between CPU and discrete GPU. The main limitation is no support from production frameworks like vLLM.

When to Use CPU vs GPU for LLM Inference

The speed difference

When CPU is enough

When you need GPU

The best CPU inference stack

Hybrid: CPU plus GPU

Apple Silicon as a middle ground

Decision framework

FAQ

Can I run LLMs on CPU?

How much faster is GPU than CPU?

Which GPU is best for LLMs?

Is Apple Silicon good for LLMs?

📬 AI Dev Weekly

You might also like

How to Run Multiple Models on One GPU

How Much VRAM Do You Need for AI Models? (2026 Calculator)

GPU vs CPU for AI Inference — When Do You Actually Need a GPU?

LLM Inference on Apple Silicon — M4 Performance Guide (2026)