How to Run IBM Granite 4.1 Locally β Ollama, vLLM, and llama.cpp Setup (2026)
Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
Granite 4.1 is one of the easiest high-quality models to run locally. The 8B instruct variant needs about 5 GB of VRAM, fits on any modern GPU or Apple Silicon Mac, and scores 87.2 on HumanEval β matching models four times its size. The 3B runs on a Raspberry Pi. The 30B fits on a single RTX 4090 with FP8 quantization.
This guide covers every local deployment option: Ollama for quick setup, vLLM for production serving, and llama.cpp for maximum hardware flexibility. Plus cloud GPU alternatives when local hardware is not enough.
For a full overview of what Granite 4.1 is and how it compares to competitors, start with the Granite 4.1 complete guide.
Hardware requirements
Before choosing a deployment method, check what you need:
| Model | Parameters | VRAM (FP16) | VRAM (FP8) | VRAM (Q4) | RAM (CPU) | Context |
|---|---|---|---|---|---|---|
| Granite 4.1 3B | 3B | ~6 GB | ~3 GB | ~2 GB | 4 GB | 128K |
| Granite 4.1 8B | 8B | ~16 GB | ~8 GB | ~5 GB | 10 GB | 512K |
| Granite 4.1 30B | 30B | ~60 GB | ~30 GB | ~18 GB | 36 GB | 512K |
These are model weight sizes. Actual memory usage increases with context length β the KV cache for 512K tokens adds significant overhead. For practical use:
- 3B β Any laptop, any Mac, most phones. 4 GB total RAM is enough.
- 8B at Q4 β RTX 3060 (12 GB), RTX 4060 (8 GB), any Apple Silicon Mac with 8 GB+.
- 8B at FP16 β RTX 4090 (24 GB), Mac with 16 GB+, or any GPU with 16 GB+ VRAM.
- 30B at FP8 β RTX 4090 (24 GB) is tight, RTX 5090 (32 GB) comfortable. Mac with 32 GB+.
- 30B at Q4 β RTX 4090 fits it. Mac with 32 GB unified memory works well.
For a deeper dive into VRAM planning, see our VRAM requirements guide.
Option 1: Ollama (recommended for most users)
Ollama is the fastest path from zero to running Granite 4.1. One command to install, one command to pull the model, one command to chat.
Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or download from https://ollama.com for macOS app / Windows
Pull and run Granite 4.1
# 8B instruct (recommended default)
ollama pull granite4.1:8b
ollama run granite4.1:8b
# 3B for lighter hardware
ollama pull granite4.1:3b
ollama run granite4.1:3b
# 30B for maximum quality
ollama pull granite4.1:30b
ollama run granite4.1:30b
That is it. Ollama handles quantization, memory management, and GPU offloading automatically. The 8B model downloads as roughly 5 GB and starts generating in seconds.
Configure context length
By default, Ollama uses a 2048-token context window. Granite 4.1 supports up to 512K. To increase it:
# Set context to 32K tokens
ollama run granite4.1:8b --ctx-size 32768
# Or create a Modelfile for persistent config
cat > Modelfile << 'EOF'
FROM granite4.1:8b
PARAMETER num_ctx 32768
PARAMETER temperature 0.7
EOF
ollama create granite4.1-32k -f Modelfile
ollama run granite4.1-32k
Larger context windows consume more memory. At 32K context, the 8B model uses roughly 8β10 GB total. At 128K, expect 16β20 GB. Going to the full 512K requires 40+ GB and is only practical on high-memory machines.
Use the Ollama API
Ollama exposes an OpenAI-compatible API on port 11434:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "granite4.1:8b",
"messages": [{"role": "user", "content": "Write a Python function to parse CSV files"}]
}'
This works with any OpenAI SDK client. Point your base_url to http://localhost:11434/v1 and use granite4.1:8b as the model name.
For a comparison of Ollama against other inference engines, see Ollama vs llama.cpp vs vLLM.
Option 2: vLLM (production serving)
vLLM is the standard for high-throughput production inference. It supports continuous batching, PagedAttention, and tensor parallelism β all of which matter when serving Granite 4.1 to multiple users.
Install vLLM
pip install vllm
Serve Granite 4.1
# 8B model, single GPU
vllm serve ibm-granite/granite-4.1-8b-instruct \
--max-model-len 32768 \
--port 8000
# 30B model, 2 GPUs with tensor parallelism
vllm serve ibm-granite/granite-4.1-30b-instruct \
--max-model-len 32768 \
--tensor-parallel-size 2 \
--port 8000
# FP8 quantized 30B for single GPU
vllm serve ibm-granite/granite-4.1-30b-instruct \
--quantization fp8 \
--max-model-len 32768 \
--port 8000
vLLM downloads the model from HuggingFace automatically. The server exposes an OpenAI-compatible API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ibm-granite/granite-4.1-8b-instruct",
"messages": [{"role": "user", "content": "Explain PagedAttention in two sentences."}],
"max_tokens": 256
}'
vLLM performance tips
- Start with
--max-model-len 32768and increase only if you need longer context. Lower values use less GPU memory and allow more concurrent requests. - Use
--gpu-memory-utilization 0.9to let vLLM use 90% of available VRAM for KV cache. - Enable chunked prefill with
--enable-chunked-prefillfor better latency on long inputs. - Use FP8 (
--quantization fp8) to halve memory usage with minimal quality loss. IBM provides official FP8 variants.
Option 3: llama.cpp (maximum flexibility)
llama.cpp gives you the most control over quantization, memory layout, and hardware targeting. It runs on CPUs, GPUs, and mixed configurations.
Install llama.cpp
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # Use -DGGML_METAL=ON for Mac
cmake --build build --config Release -j
Download GGUF models
Granite 4.1 GGUF files are available on HuggingFace. Look for community quantizations:
# Example: download Q4_K_M quantization of the 8B model
huggingface-cli download ibm-granite/granite-4.1-8b-instruct-GGUF \
granite-4.1-8b-instruct-Q4_K_M.gguf \
--local-dir ./models
Run inference
# Interactive chat
./build/bin/llama-cli \
-m ./models/granite-4.1-8b-instruct-Q4_K_M.gguf \
-c 32768 \
-ngl 99 \
--chat-template chatml
# Server mode (OpenAI-compatible API)
./build/bin/llama-server \
-m ./models/granite-4.1-8b-instruct-Q4_K_M.gguf \
-c 32768 \
-ngl 99 \
--port 8080
The -ngl 99 flag offloads all layers to GPU. Reduce this number to split between GPU and CPU if you do not have enough VRAM.
Quantization options
| Quantization | Size (8B) | Quality | Speed | Use case |
|---|---|---|---|---|
| Q2_K | ~3 GB | Low | Fastest | Testing only |
| Q4_K_M | ~5 GB | Good | Fast | Best balance for most users |
| Q5_K_M | ~6 GB | Very good | Medium | Quality-focused |
| Q6_K | ~7 GB | Near-FP16 | Slower | When quality matters most |
| Q8_0 | ~8.5 GB | Excellent | Slower | Near-lossless |
| FP16 | ~16 GB | Perfect | Slowest | Reference quality |
For the 8B model, Q4_K_M is the sweet spot. You get 95%+ of FP16 quality in a 5 GB package. For the 30B, Q4_K_M brings it down to ~18 GB, fitting on an RTX 4090.
Cloud GPU alternatives
If your local hardware cannot handle the 30B model, or you need the full 512K context window, cloud GPUs are the practical solution.
RunPod
RunPod offers on-demand GPU instances starting at $0.20/hour for an RTX 4090. For Granite 4.1:
- 8B model β Any GPU with 8+ GB VRAM. An RTX 3090 instance at ~$0.30/hour works well.
- 30B model β An A100 40 GB instance at ~$1.00/hour gives you room for the model plus a generous context window.
- 512K context β You need 80+ GB VRAM. An A100 80 GB or H100 instance handles it.
RunPod supports vLLM templates, so you can deploy Granite 4.1 as a serverless endpoint or a persistent pod with a few clicks.
For a broader comparison of cloud GPU providers, see our best cloud GPU providers guide.
Other cloud options
- Replicate β Granite 4.1 is available as a hosted model. Pay per prediction.
- HuggingFace Inference Endpoints β Deploy directly from the
ibm-graniteorg. - watsonx.ai β IBMβs managed platform with enterprise SLAs.
When to use 3B vs 8B vs 30B
The choice depends on your hardware and quality requirements:
Pick the 3B when:
- You are deploying to edge devices, mobile, or embedded systems
- You need the fastest possible inference speed
- Your tasks are simple: summarization, classification, short Q&A
- You have less than 8 GB of RAM available
- Latency matters more than quality
The 3B scores 79.27 on HumanEval and 86.88 on GSM8K. That is strong for a 3B model β better than many 7B models from a year ago. But it drops off on complex reasoning and long-form generation.
Pick the 8B when:
- You want the best quality-to-resource ratio
- You have a modern GPU (8+ GB VRAM) or Apple Silicon Mac
- You need 512K context for long documents or codebases
- You are building coding assistants, chatbots, or API services
- You want one model that handles most tasks well
The 8B is the default recommendation. It matches the previous 32B MoE model, fits on consumer hardware, and handles 512K context. Unless you have a specific reason to go smaller or larger, start here.
Pick the 30B when:
- You need maximum quality, especially for tool calling (73.68 BFCL V3)
- You are running complex coding tasks (82.7 EvalPlus)
- You have 24+ GB VRAM or are using cloud GPUs
- You are serving multiple users and can justify the hardware cost
- Enterprise compliance requires the highest-quality output
The 30B is the ceiling of the Granite 4.1 family. It leads open-source models in tool calling and scores near 90 on HumanEval. The cost is ~3.5Γ the memory of the 8B.
Performance expectations
Real-world inference speed depends on your hardware, quantization, context length, and batch size. Here are rough expectations for single-user interactive use:
| Setup | Model | Tokens/sec (generation) |
|---|---|---|
| M2 MacBook Air 16 GB | 8B Q4 | ~25β35 tok/s |
| RTX 4060 8 GB | 8B Q4 | ~40β60 tok/s |
| RTX 4090 24 GB | 8B FP16 | ~80β120 tok/s |
| RTX 4090 24 GB | 30B Q4 | ~20β35 tok/s |
| A100 80 GB | 30B FP16 | ~50β80 tok/s |
These are approximate generation speeds for short context (under 4K tokens). Longer context windows reduce throughput due to KV cache overhead. Prompt processing (prefill) is typically 5β10Γ faster than generation.
For the 8B model on a Mac or mid-range GPU, expect a responsive conversational experience β fast enough for interactive coding assistance and chat.
Troubleshooting common issues
Model downloads slowly β Ollama and HuggingFace downloads depend on your internet connection. For large models, use huggingface-cli download with --resume-download to handle interruptions.
Out of memory errors β Reduce --max-model-len in vLLM, reduce -c in llama.cpp, or use a more aggressive quantization. Q4_K_M is usually the right tradeoff.
Slow generation on CPU β Granite 4.1 is designed for GPU inference. CPU-only generation for the 8B model will be 2β5 tok/s, which is usable but not interactive. Use the 3B for CPU-only setups.
Context window errors β If you request more context than your memory supports, the server will crash or refuse the request. Start with 4Kβ8K context and increase gradually until you find your hardwareβs limit.
Apple Silicon memory pressure β macOS shares memory between CPU and GPU. If you see memory pressure warnings, close other applications or reduce the context window. Ollama handles this automatically by reducing GPU layers.
FAQ
Can I run Granite 4.1 8B on a laptop without a dedicated GPU?
Yes. The 8B model at Q4 quantization needs about 5 GB of RAM. Any Apple Silicon Mac (M1 or later) runs it well using the Metal GPU. On Intel/AMD laptops without a dedicated GPU, it runs on CPU at 2β5 tokens per second β usable for batch processing but not interactive chat. For the best laptop experience, use the 3B model on CPU-only hardware.
Which quantization should I use?
Q4_K_M for most users. It reduces the 8B model to ~5 GB with minimal quality loss (roughly 95% of FP16 quality). If you have extra VRAM, Q6_K or Q8_0 get you closer to full precision. Avoid Q2_K for anything beyond testing β the quality drop is noticeable. IBM also provides official FP8 variants, which are the best option if your GPU supports FP8 natively (RTX 4090, A100, H100).
How much VRAM do I need for the full 512K context?
A lot. The KV cache for 512K tokens at FP16 on the 8B model requires roughly 32β40 GB on top of the model weights. Practically, you need an A100 80 GB or H100 to use the full 512K context with the 8B model. For the 30B at 512K, you need multiple GPUs. Most local users should stick to 32Kβ128K context, which is still far more than most competitors offer.
Is vLLM faster than Ollama for Granite 4.1?
For single-user interactive chat, they are comparable. vLLMβs advantage shows up with multiple concurrent users β continuous batching and PagedAttention let it serve 5β10Γ more users on the same hardware. If you are building an API that serves multiple clients, use vLLM. If you are running Granite 4.1 for personal use, Ollama is simpler and just as fast. See our Ollama vs llama.cpp vs vLLM comparison for detailed benchmarks.
Can I fine-tune Granite 4.1 locally?
Yes. The Apache 2.0 license allows unrestricted fine-tuning. Tools like Unsloth, Axolotl, and HuggingFace TRL all support Granite 4.1. The 3B and 8B models can be fine-tuned on a single consumer GPU using QLoRA (4-bit quantized LoRA). The 30B requires at least 2Γ 24 GB GPUs or a cloud instance. IBMβs training pipeline used 15 trillion tokens, but effective fine-tuning for specific tasks can work with as few as 1,000β10,000 high-quality examples.
What is the difference between Ollama, vLLM, and llama.cpp for Granite 4.1?
Ollama is the easiest to set up β one command to install, one to run. It handles quantization and GPU offloading automatically. llama.cpp gives you the most control over quantization formats, memory layout, and mixed CPU/GPU inference. vLLM is built for production serving with high throughput and concurrent users. For personal use, start with Ollama. For production APIs, use vLLM. For custom hardware configurations or maximum flexibility, use llama.cpp.