Apr 30, 2026 · 10 min read

How to Run IBM Granite 4.1 Locally — Ollama, vLLM, and llama.cpp Setup (2026)

Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.

Granite 4.1 is one of the easiest high-quality models to run locally. The 8B instruct variant needs about 5 GB of VRAM, fits on any modern GPU or Apple Silicon Mac, and scores 87.2 on HumanEval — matching models four times its size. The 3B runs on a Raspberry Pi. The 30B fits on a single RTX 4090 with FP8 quantization.

This guide covers every local deployment option: Ollama for quick setup, vLLM for production serving, and llama.cpp for maximum hardware flexibility. Plus cloud GPU alternatives when local hardware is not enough.

For a full overview of what Granite 4.1 is and how it compares to competitors, start with the Granite 4.1 complete guide.

Hardware requirements

Before choosing a deployment method, check what you need:

Model	Parameters	VRAM (FP16)	VRAM (FP8)	VRAM (Q4)	RAM (CPU)	Context
Granite 4.1 3B	3B	~6 GB	~3 GB	~2 GB	4 GB	128K
Granite 4.1 8B	8B	~16 GB	~8 GB	~5 GB	10 GB	512K
Granite 4.1 30B	30B	~60 GB	~30 GB	~18 GB	36 GB	512K

These are model weight sizes. Actual memory usage increases with context length — the KV cache for 512K tokens adds significant overhead. For practical use:

3B — Any laptop, any Mac, most phones. 4 GB total RAM is enough.
8B at Q4 — RTX 3060 (12 GB), RTX 4060 (8 GB), any Apple Silicon Mac with 8 GB+.
8B at FP16 — RTX 4090 (24 GB), Mac with 16 GB+, or any GPU with 16 GB+ VRAM.
30B at FP8 — RTX 4090 (24 GB) is tight, RTX 5090 (32 GB) comfortable. Mac with 32 GB+.
30B at Q4 — RTX 4090 fits it. Mac with 32 GB unified memory works well.

For a deeper dive into VRAM planning, see our VRAM requirements guide.

Option 1: Ollama (recommended for most users)

Ollama is the fastest path from zero to running Granite 4.1. One command to install, one command to pull the model, one command to chat.

Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download from https://ollama.com for macOS app / Windows

Pull and run Granite 4.1

# 8B instruct (recommended default)
ollama pull granite4.1:8b
ollama run granite4.1:8b

# 3B for lighter hardware
ollama pull granite4.1:3b
ollama run granite4.1:3b

# 30B for maximum quality
ollama pull granite4.1:30b
ollama run granite4.1:30b

That is it. Ollama handles quantization, memory management, and GPU offloading automatically. The 8B model downloads as roughly 5 GB and starts generating in seconds.

Configure context length

By default, Ollama uses a 2048-token context window. Granite 4.1 supports up to 512K. To increase it:

# Set context to 32K tokens
ollama run granite4.1:8b --ctx-size 32768

# Or create a Modelfile for persistent config
cat > Modelfile << 'EOF'
FROM granite4.1:8b
PARAMETER num_ctx 32768
PARAMETER temperature 0.7
EOF

ollama create granite4.1-32k -f Modelfile
ollama run granite4.1-32k

Larger context windows consume more memory. At 32K context, the 8B model uses roughly 8–10 GB total. At 128K, expect 16–20 GB. Going to the full 512K requires 40+ GB and is only practical on high-memory machines.

Use the Ollama API

Ollama exposes an OpenAI-compatible API on port 11434:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite4.1:8b",
    "messages": [{"role": "user", "content": "Write a Python function to parse CSV files"}]
  }'

This works with any OpenAI SDK client. Point your base_url to http://localhost:11434/v1 and use granite4.1:8b as the model name.

For a comparison of Ollama against other inference engines, see Ollama vs llama.cpp vs vLLM.

Option 2: vLLM (production serving)

vLLM is the standard for high-throughput production inference. It supports continuous batching, PagedAttention, and tensor parallelism — all of which matter when serving Granite 4.1 to multiple users.

Install vLLM

pip install vllm

Serve Granite 4.1

# 8B model, single GPU
vllm serve ibm-granite/granite-4.1-8b-instruct \
  --max-model-len 32768 \
  --port 8000

# 30B model, 2 GPUs with tensor parallelism
vllm serve ibm-granite/granite-4.1-30b-instruct \
  --max-model-len 32768 \
  --tensor-parallel-size 2 \
  --port 8000

# FP8 quantized 30B for single GPU
vllm serve ibm-granite/granite-4.1-30b-instruct \
  --quantization fp8 \
  --max-model-len 32768 \
  --port 8000

vLLM downloads the model from HuggingFace automatically. The server exposes an OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ibm-granite/granite-4.1-8b-instruct",
    "messages": [{"role": "user", "content": "Explain PagedAttention in two sentences."}],
    "max_tokens": 256
  }'

vLLM performance tips

Start with --max-model-len 32768 and increase only if you need longer context. Lower values use less GPU memory and allow more concurrent requests.
Use --gpu-memory-utilization 0.9 to let vLLM use 90% of available VRAM for KV cache.
Enable chunked prefill with --enable-chunked-prefill for better latency on long inputs.
Use FP8 (--quantization fp8) to halve memory usage with minimal quality loss. IBM provides official FP8 variants.

Option 3: llama.cpp (maximum flexibility)

llama.cpp gives you the most control over quantization, memory layout, and hardware targeting. It runs on CPUs, GPUs, and mixed configurations.

Install llama.cpp

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # Use -DGGML_METAL=ON for Mac
cmake --build build --config Release -j

Download GGUF models

Granite 4.1 GGUF files are available on HuggingFace. Look for community quantizations:

# Example: download Q4_K_M quantization of the 8B model
huggingface-cli download ibm-granite/granite-4.1-8b-instruct-GGUF \
  granite-4.1-8b-instruct-Q4_K_M.gguf \
  --local-dir ./models

Run inference

# Interactive chat
./build/bin/llama-cli \
  -m ./models/granite-4.1-8b-instruct-Q4_K_M.gguf \
  -c 32768 \
  -ngl 99 \
  --chat-template chatml

# Server mode (OpenAI-compatible API)
./build/bin/llama-server \
  -m ./models/granite-4.1-8b-instruct-Q4_K_M.gguf \
  -c 32768 \
  -ngl 99 \
  --port 8080

The -ngl 99 flag offloads all layers to GPU. Reduce this number to split between GPU and CPU if you do not have enough VRAM.

Quantization options

Quantization	Size (8B)	Quality	Speed	Use case
Q2_K	~3 GB	Low	Fastest	Testing only
Q4_K_M	~5 GB	Good	Fast	Best balance for most users
Q5_K_M	~6 GB	Very good	Medium	Quality-focused
Q6_K	~7 GB	Near-FP16	Slower	When quality matters most
Q8_0	~8.5 GB	Excellent	Slower	Near-lossless
FP16	~16 GB	Perfect	Slowest	Reference quality

For the 8B model, Q4_K_M is the sweet spot. You get 95%+ of FP16 quality in a 5 GB package. For the 30B, Q4_K_M brings it down to ~18 GB, fitting on an RTX 4090.

Cloud GPU alternatives

If your local hardware cannot handle the 30B model, or you need the full 512K context window, cloud GPUs are the practical solution.

RunPod

RunPod offers on-demand GPU instances starting at $0.20/hour for an RTX 4090. For Granite 4.1:

8B model — Any GPU with 8+ GB VRAM. An RTX 3090 instance at ~$0.30/hour works well.
30B model — An A100 40 GB instance at ~$1.00/hour gives you room for the model plus a generous context window.
512K context — You need 80+ GB VRAM. An A100 80 GB or H100 instance handles it.

RunPod supports vLLM templates, so you can deploy Granite 4.1 as a serverless endpoint or a persistent pod with a few clicks.

For a broader comparison of cloud GPU providers, see our best cloud GPU providers guide.

Other cloud options

Replicate — Granite 4.1 is available as a hosted model. Pay per prediction.
HuggingFace Inference Endpoints — Deploy directly from the ibm-granite org.
watsonx.ai — IBM’s managed platform with enterprise SLAs.

When to use 3B vs 8B vs 30B

The choice depends on your hardware and quality requirements:

Pick the 3B when:

You are deploying to edge devices, mobile, or embedded systems
You need the fastest possible inference speed
Your tasks are simple: summarization, classification, short Q&A
You have less than 8 GB of RAM available
Latency matters more than quality

The 3B scores 79.27 on HumanEval and 86.88 on GSM8K. That is strong for a 3B model — better than many 7B models from a year ago. But it drops off on complex reasoning and long-form generation.

Pick the 8B when:

You want the best quality-to-resource ratio
You have a modern GPU (8+ GB VRAM) or Apple Silicon Mac
You need 512K context for long documents or codebases
You are building coding assistants, chatbots, or API services
You want one model that handles most tasks well

The 8B is the default recommendation. It matches the previous 32B MoE model, fits on consumer hardware, and handles 512K context. Unless you have a specific reason to go smaller or larger, start here.

Pick the 30B when:

You need maximum quality, especially for tool calling (73.68 BFCL V3)
You are running complex coding tasks (82.7 EvalPlus)
You have 24+ GB VRAM or are using cloud GPUs
You are serving multiple users and can justify the hardware cost
Enterprise compliance requires the highest-quality output

The 30B is the ceiling of the Granite 4.1 family. It leads open-source models in tool calling and scores near 90 on HumanEval. The cost is ~3.5× the memory of the 8B.

Performance expectations

Real-world inference speed depends on your hardware, quantization, context length, and batch size. Here are rough expectations for single-user interactive use:

Setup	Model	Tokens/sec (generation)
M2 MacBook Air 16 GB	8B Q4	~25–35 tok/s
RTX 4060 8 GB	8B Q4	~40–60 tok/s
RTX 4090 24 GB	8B FP16	~80–120 tok/s
RTX 4090 24 GB	30B Q4	~20–35 tok/s
A100 80 GB	30B FP16	~50–80 tok/s

These are approximate generation speeds for short context (under 4K tokens). Longer context windows reduce throughput due to KV cache overhead. Prompt processing (prefill) is typically 5–10× faster than generation.

For the 8B model on a Mac or mid-range GPU, expect a responsive conversational experience — fast enough for interactive coding assistance and chat.

Troubleshooting common issues

Model downloads slowly — Ollama and HuggingFace downloads depend on your internet connection. For large models, use huggingface-cli download with --resume-download to handle interruptions.

Out of memory errors — Reduce --max-model-len in vLLM, reduce -c in llama.cpp, or use a more aggressive quantization. Q4_K_M is usually the right tradeoff.

Slow generation on CPU — Granite 4.1 is designed for GPU inference. CPU-only generation for the 8B model will be 2–5 tok/s, which is usable but not interactive. Use the 3B for CPU-only setups.

Context window errors — If you request more context than your memory supports, the server will crash or refuse the request. Start with 4K–8K context and increase gradually until you find your hardware’s limit.

Apple Silicon memory pressure — macOS shares memory between CPU and GPU. If you see memory pressure warnings, close other applications or reduce the context window. Ollama handles this automatically by reducing GPU layers.

FAQ

Can I run Granite 4.1 8B on a laptop without a dedicated GPU?

Yes. The 8B model at Q4 quantization needs about 5 GB of RAM. Any Apple Silicon Mac (M1 or later) runs it well using the Metal GPU. On Intel/AMD laptops without a dedicated GPU, it runs on CPU at 2–5 tokens per second — usable for batch processing but not interactive chat. For the best laptop experience, use the 3B model on CPU-only hardware.

Which quantization should I use?

Q4_K_M for most users. It reduces the 8B model to ~5 GB with minimal quality loss (roughly 95% of FP16 quality). If you have extra VRAM, Q6_K or Q8_0 get you closer to full precision. Avoid Q2_K for anything beyond testing — the quality drop is noticeable. IBM also provides official FP8 variants, which are the best option if your GPU supports FP8 natively (RTX 4090, A100, H100).

How much VRAM do I need for the full 512K context?

A lot. The KV cache for 512K tokens at FP16 on the 8B model requires roughly 32–40 GB on top of the model weights. Practically, you need an A100 80 GB or H100 to use the full 512K context with the 8B model. For the 30B at 512K, you need multiple GPUs. Most local users should stick to 32K–128K context, which is still far more than most competitors offer.

Is vLLM faster than Ollama for Granite 4.1?

For single-user interactive chat, they are comparable. vLLM’s advantage shows up with multiple concurrent users — continuous batching and PagedAttention let it serve 5–10× more users on the same hardware. If you are building an API that serves multiple clients, use vLLM. If you are running Granite 4.1 for personal use, Ollama is simpler and just as fast. See our Ollama vs llama.cpp vs vLLM comparison for detailed benchmarks.

Can I fine-tune Granite 4.1 locally?

Yes. The Apache 2.0 license allows unrestricted fine-tuning. Tools like Unsloth, Axolotl, and HuggingFace TRL all support Granite 4.1. The 3B and 8B models can be fine-tuned on a single consumer GPU using QLoRA (4-bit quantized LoRA). The 30B requires at least 2× 24 GB GPUs or a cloud instance. IBM’s training pipeline used 15 trillion tokens, but effective fine-tuning for specific tasks can work with as few as 1,000–10,000 high-quality examples.

What is the difference between Ollama, vLLM, and llama.cpp for Granite 4.1?

Ollama is the easiest to set up — one command to install, one to run. It handles quantization and GPU offloading automatically. llama.cpp gives you the most control over quantization formats, memory layout, and mixed CPU/GPU inference. vLLM is built for production serving with high throughput and concurrent users. For personal use, start with Ollama. For production APIs, use vLLM. For custom hardware configurations or maximum flexibility, use llama.cpp.