📝 Tutorials
· 5 min read

Run AI on a Raspberry Pi — Yes, It Actually Works (2026)


Can you run a large language model on a $80 single-board computer the size of a credit card? Yes. Should you expect ChatGPT-level speed? Absolutely not. But if you calibrate your expectations correctly, running AI on a Raspberry Pi is genuinely useful — and honestly kind of magical.

Let’s walk through exactly how to do it, which models actually fit, and what you can realistically build with it.

What You Need (Hardware)

The Raspberry Pi 5 with 8GB RAM is the sweet spot here. You can try a 4GB model, but you’ll be limited to the tiniest models and swapping to disk constantly. Not fun.

Here’s the recommended setup:

  • Raspberry Pi 5 (8GB RAM) — the only Pi worth doing this on
  • A good microSD card (64GB+, A2 rated) — or better yet, an NVMe SSD via the Pi 5’s PCIe slot. Model loading from SD is painfully slow
  • Active cooling — a fan or heatsink case. Inference will push the CPU hard and thermal throttling kills performance
  • A decent power supply (27W USB-C) — the Pi 5 is hungrier than its predecessors

Optional but nice: an NVMe HAT with a 256GB SSD. Model files are large, and loading them from NVMe vs. microSD is a night-and-day difference.

Installing Ollama on the Pi

Ollama has had solid ARM64 support for a while now, and installation on Raspberry Pi OS (64-bit) is a one-liner:

curl -fsSL https://ollama.com/install.sh | sh

That’s it. Seriously. It detects the ARM architecture and installs the right binary.

Verify it’s running:

ollama --version

Start the service if it didn’t auto-start:

sudo systemctl start ollama

Now pull a small model to test:

ollama pull qwen2.5:0.5b

And run it:

ollama run qwen2.5:0.5b

If you see a chat prompt, congratulations — you’re running an LLM on a Raspberry Pi.

Best Models That Actually Fit

This is the critical part. On 8GB of RAM, you need models under ~4GB in size. Here’s what works:

ModelSize (Q4)Speed (Pi 5 8GB)QualityBest For
Qwen2.5 0.5B~0.4 GB~15 tok/sBasicFast responses, simple tasks
TinyLlama 1.1B~0.6 GB~10 tok/sDecentChat, learning, experimentation
Qwen2.5 1.5B~1.0 GB~8 tok/sGoodGeneral assistant, summarization
Gemma 2B~1.5 GB~5 tok/sGoodInstruction following, Q&A
Phi-3 Mini (3.8B)~2.3 GB~3 tok/sGreatBest quality that fits
Qwen2.5 3B~2.0 GB~3-4 tok/sGreatCoding help, reasoning

Speeds are approximate with Q4_K_M quantization. Your mileage will vary.

The sweet spot? Qwen2.5 1.5B gives you surprisingly good responses at a readable speed. Phi-3 Mini is the best quality you can squeeze in, but at 3 tokens per second you’ll be waiting a bit.

Anything above 4B parameters? Don’t bother. It’ll either not fit or swap so aggressively that you’ll get one token every few seconds.

Realistic Performance Expectations

Let’s be brutally honest here. A Raspberry Pi 5 has a quad-core Cortex-A76 running at 2.4GHz. There’s no GPU acceleration for inference — this is pure CPU work.

What that means in practice:

  • Short answers (a sentence or two): totally fine, feels responsive
  • A paragraph of text: takes 10-30 seconds depending on the model
  • Long-form generation: go make coffee. Seriously
  • First token latency: 2-5 seconds for smaller models, up to 15 seconds for Phi-3 Mini

This is not a speed demon. But it’s a private, offline, always-on AI running on hardware that costs less than one month of a cloud API subscription. That trade-off is worth it for many use cases.

For more on running AI without a GPU, we’ve got a deeper dive on CPU inference optimization.

What Can You Actually Do With It?

Here’s where it gets fun. A Pi running a small LLM is surprisingly practical for:

🏠 Home Assistant Integration Hook it up to Home Assistant as a local AI for natural language commands. “Turn off the living room lights and set the thermostat to 68” — parsed locally, no cloud needed.

🔒 Private Chatbot A personal assistant that never sends your data anywhere. Journal prompts, brainstorming, writing help — all on your local network.

📡 IoT Edge Inference Classify sensor data, generate alerts, or summarize logs at the edge. Perfect for situations where you can’t (or don’t want to) phone home to an API.

📚 Learning Tool Honestly one of the best uses. Want to understand how LLMs work? Having one running on your desk that you can poke, prod, and experiment with is invaluable.

🧪 API Prototyping Ollama exposes a REST API. Build your app against localhost:11434 on the Pi, then swap in a bigger model on beefier hardware later. Same API, different backend.

Tips for Getting the Most Out of It

  1. Use quantized models (Q4_K_M or Q4_K_S). The quality difference from full precision is minimal, and the size/speed savings are massive.

  2. Boot from NVMe, not microSD. Model loading time drops from “go make a sandwich” to “take a sip of coffee.”

  3. Keep the Pi cool. Sustained inference will thermal throttle without active cooling. A $10 fan case solves this completely.

  4. Set num_ctx lower. The default context window eats RAM. For a Pi, try num_ctx 1024 or 2048 instead of the default:

    ollama run qwen2.5:1.5b --num-ctx 2048
  5. Run it headless. Don’t waste RAM on a desktop environment. SSH in or access Ollama’s API over the network.

  6. Use ollama serve on boot. Set it as a systemd service so it’s always ready when you need it.

  7. Try the smallest model first. Start with Qwen2.5 0.5B to verify everything works, then scale up.

The Bottom Line

A Raspberry Pi 5 running Ollama won’t replace your M-series Mac or your cloud GPU instance. But it will give you a private, always-on, zero-cost-per-query AI that fits in your palm. For home automation, learning, prototyping, and privacy-first use cases, it’s genuinely hard to beat at this price point.

It’s also the cheapest way to run AI locally if you already have a Pi sitting in a drawer — and let’s be honest, most of us do.

Now go dig that Pi out of the drawer and give it a real job.