🤖 AI Tools
· 5 min read

How to Run Gemma 4 Locally — Complete Setup Guide (2026)


Google’s Gemma 4 family includes four models that run on everything from a Raspberry Pi to a single GPU. This guide covers three ways to run them locally: Ollama (easiest), llama.cpp (most control), and vLLM (best for serving).

Before you start: pick your model

ModelMin RAM (Q4)Min VRAMSpeedQuality
E2B2 GB2 GB⚡⚡⚡⚡⚡⭐⭐
E4B4 GB4 GB⚡⚡⚡⚡⭐⭐⭐
26B (MoE)8 GB8 GB⚡⚡⚡⭐⭐⭐⭐
31B (Dense)16 GB24 GB⚡⚡⭐⭐⭐⭐⭐

Most people should start with the 26B MoE model. It only activates 3.8B parameters per inference, so it runs at near-8B speeds while delivering near-30B quality. If you have a laptop with 8 GB RAM, you can run it.

If you’re on constrained hardware (Raspberry Pi, old laptop), the E2B model at Q4 quantization fits in 2 GB. See our guide on running AI on a Raspberry Pi for more on that setup.

Ollama is the fastest way to get Gemma 4 running. One command to install, one command to run.

Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows — download from ollama.com

Run Gemma 4

# 26B MoE — best balance of speed and quality
ollama run gemma4:26b

# Edge model — fastest, runs on anything
ollama run gemma4:e2b

# Dense model — highest quality
ollama run gemma4:31b

# With specific quantization
ollama run gemma4:26b-q4_K_M

Use as an API

Ollama exposes an OpenAI-compatible API on port 11434:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:26b",
    "messages": [{"role": "user", "content": "Explain MoE architecture in 3 sentences"}]
  }'

This works with any tool that supports the OpenAI API format — including Continue.dev for VS Code integration and most AI coding tools.

Method 2: llama.cpp (most control)

If you need fine-grained control over quantization, context length, or batch size, llama.cpp gives you direct access.

Build llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)

# With CUDA support (NVIDIA GPUs)
make -j$(nproc) GGML_CUDA=1

# With Metal support (Apple Silicon)
make -j$(nproc) GGML_METAL=1

Download the model

# Download GGUF quantized model from Hugging Face
# Q4_K_M is the best balance of size and quality
huggingface-cli download google/gemma-4-26b-GGUF gemma-4-26b-Q4_K_M.gguf

Run inference

./llama-cli \
  -m gemma-4-26b-Q4_K_M.gguf \
  -c 8192 \
  -n 512 \
  --temp 0.7 \
  -p "Write a Python function that validates email addresses"

Run as a server

./llama-server \
  -m gemma-4-26b-Q4_K_M.gguf \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

This exposes an OpenAI-compatible API, just like Ollama but with more configuration options.

Method 3: vLLM (best for serving)

For production serving with high throughput, vLLM handles batching and memory management automatically.

pip install vllm

# Serve the model
vllm serve google/gemma-4-26b \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

vLLM requires a GPU with enough VRAM for the full model. For the 26B model at FP16, that’s about 26 GB — an A100 40GB or two consumer GPUs. For quantized inference, use llama.cpp instead.

Quantization options

Quantization reduces model size and memory usage at the cost of some quality. Here’s how the 26B model performs at different quantization levels:

QuantizationSizeRAM neededQuality lossSpeed
FP1652 GB52 GBNoneBaseline
Q8_026 GB28 GBMinimal~Same
Q5_K_M18 GB20 GBVery smallFaster
Q4_K_M14 GB16 GBSmallFaster
Q3_K_M11 GB13 GBNoticeableFastest
Q2_K8 GB10 GBSignificantFastest

Q4_K_M is the sweet spot for most users. The quality loss is barely noticeable in conversation and coding tasks, while cutting memory usage by 70%.

If you’re running without a GPU, Q4_K_M on CPU is still usable — expect 5-15 tokens per second on a modern laptop.

Performance tips

Use GPU offloading

If you have a GPU but not enough VRAM for the full model, offload some layers to GPU and keep the rest in RAM:

# llama.cpp: offload 20 layers to GPU
./llama-cli -m gemma-4-26b-Q4_K_M.gguf -ngl 20 -p "Hello"

# Ollama: set GPU layers in Modelfile
echo 'PARAMETER num_gpu 20' >> Modelfile

Adjust context length

The default 256K context window uses a lot of memory. If you don’t need it, reduce it:

# llama.cpp: use 4K context (saves ~2GB RAM)
./llama-cli -m gemma-4-26b-Q4_K_M.gguf -c 4096

# Ollama: set in Modelfile
echo 'PARAMETER num_ctx 4096' >> Modelfile

Use Flash Attention

Both llama.cpp and vLLM support Flash Attention, which significantly reduces memory usage for long contexts:

# llama.cpp
./llama-cli -m gemma-4-26b-Q4_K_M.gguf -fa

# vLLM enables it automatically

Docker setup

For a reproducible environment:

FROM ollama/ollama:latest
RUN ollama pull gemma4:26b
EXPOSE 11434
CMD ["ollama", "serve"]
docker build -t gemma4-local .
docker run -d --gpus all -p 11434:11434 gemma4-local

Comparing local AI runtimes

Not sure which runtime to use? Here’s the quick version:

Ollamallama.cppvLLM
Setup time2 min10 min5 min
GPU requiredNoNoYes
QuantizationAutoFull controlLimited
APIOpenAI-compatOpenAI-compatOpenAI-compat
Best forGetting startedTweakingProduction

For a deeper comparison, see our Ollama vs llama.cpp vs vLLM guide.

What’s next

Once you have Gemma 4 running locally, you can:

Gemma 4 26B running locally on a laptop is genuinely useful for daily coding and writing tasks. The MoE architecture makes it feel like a much larger model than the hardware suggests. If you’ve been waiting for open models to be “good enough” — they are now.