IBM Granite 4.1 is one of the strongest open-weight coding models available in 2026. The 8B model scores 87.2 on HumanEval and the 30B leads BFCL V3 tool calling at 73.68 — both under Apache 2.0. But raw benchmarks don’t matter if you can’t plug the model into your actual workflow. Here’s how to set up Granite 4.1 with Aider, Continue.dev, and other popular coding tools.
Prerequisites
Before configuring any tool, you need Granite 4.1 running somewhere. You have three options:
Option 1: Local with Ollama (recommended for development)
# Install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh
# Pull Granite 4.1 (pick your size)
ollama pull granite4.1:3b # ~2 GB, runs on any machine
ollama pull granite4.1:8b # ~5 GB, needs 8+ GB RAM
ollama pull granite4.1:30b # ~18 GB quantized, needs 32+ GB RAM
Ollama serves models on http://localhost:11434 by default with an OpenAI-compatible API.
Option 2: Local with LM Studio
- Download LM Studio from lmstudio.ai
- Search for “granite 4.1” in the model library
- Download your preferred size
- Start the local server (default:
http://localhost:1234/v1)
Option 3: Cloud API
For the 30B model without local hardware, use a cloud provider:
- OpenRouter:
ibm-granite/granite-4.1-30b-instruct— pay per token - watsonx.ai: IBM’s managed platform — enterprise SLAs available
- Replicate: On-demand inference — no GPU management
Each provider gives you an OpenAI-compatible endpoint and API key.
Setting up Aider with Granite 4.1
Aider is a terminal-based AI coding assistant that edits files directly in your repository. It works with any OpenAI-compatible API, which makes Granite 4.1 integration straightforward.
Install Aider
pip install aider-chat
Configure for local Ollama
Create or edit ~/.aider.conf.yml:
# Granite 4.1 via Ollama
model: ollama/granite4.1:8b
Or pass it directly:
aider --model ollama/granite4.1:8b
Aider auto-detects Ollama on localhost:11434. No API key needed.
Configure for OpenRouter
export OPENROUTER_API_KEY=your-key-here
aider --model openrouter/ibm-granite/granite-4.1-30b-instruct
Or in ~/.aider.conf.yml:
model: openrouter/ibm-granite/granite-4.1-30b-instruct
Configure for a generic OpenAI-compatible endpoint
If you’re running Granite 4.1 via vLLM, LM Studio, or any other OpenAI-compatible server:
export OPENAI_API_BASE=http://localhost:1234/v1
export OPENAI_API_KEY=not-needed
aider --model openai/granite-4.1-8b-instruct
Aider performance tips for Granite 4.1
Use the 8B for most tasks. Granite 4.1 8B scores 87.2 on HumanEval — that’s strong enough for most code editing tasks. The 30B is better for complex multi-file refactors but slower locally.
Enable the editor model pattern. Use the 3B as a fast editor model and the 8B or 30B as the main model:
aider --model ollama/granite4.1:8b --editor-model ollama/granite4.1:3b
The 3B handles simple edits quickly while the 8B tackles complex reasoning.
Set appropriate context. Granite 4.1 8B supports 512K tokens, so you can add many files to context:
aider --model ollama/granite4.1:8b
# Then in Aider:
/add src/**/*.py
Use map-tokens for large repos. Aider’s repo map helps Granite understand your codebase structure:
# ~/.aider.conf.yml
model: ollama/granite4.1:8b
map-tokens: 2048
Setting up Continue.dev with Granite 4.1
Continue.dev is an open-source AI coding assistant that runs inside VS Code and JetBrains. It supports local models through Ollama and remote APIs.
Install Continue.dev
- Open VS Code
- Go to Extensions (Ctrl+Shift+X)
- Search “Continue” and install
- Or install from the marketplace:
continue.continue
Configure for local Ollama
Edit your Continue config file at ~/.continue/config.json:
{
"models": [
{
"title": "Granite 4.1 8B",
"provider": "ollama",
"model": "granite4.1:8b",
"contextLength": 32768
}
],
"tabAutocompleteModel": {
"title": "Granite 4.1 3B",
"provider": "ollama",
"model": "granite4.1:3b",
"contextLength": 4096
}
}
This configuration uses the 8B for chat/edit tasks and the 3B for fast tab autocomplete.
Configure for OpenRouter
{
"models": [
{
"title": "Granite 4.1 30B",
"provider": "openrouter",
"model": "ibm-granite/granite-4.1-30b-instruct",
"apiKey": "your-openrouter-key"
}
]
}
Configure for a custom OpenAI-compatible endpoint
{
"models": [
{
"title": "Granite 4.1 8B",
"provider": "openai",
"model": "granite-4.1-8b-instruct",
"apiBase": "http://localhost:1234/v1",
"apiKey": "not-needed"
}
]
}
Configure for watsonx.ai
{
"models": [
{
"title": "Granite 4.1 30B (watsonx)",
"provider": "openai",
"model": "ibm/granite-4-1-30b-instruct",
"apiBase": "https://us-south.ml.cloud.ibm.com/ml/v1/text/generation",
"apiKey": "your-watsonx-api-key",
"requestOptions": {
"headers": {
"Content-Type": "application/json"
}
}
}
]
}
Continue.dev performance tips for Granite 4.1
Use the 3B for autocomplete. Tab completion needs to be fast — under 200ms ideally. Granite 4.1 3B is small enough to deliver near-instant completions while still being accurate (79.27 on HumanEval).
Set context length appropriately. While Granite supports up to 512K tokens, Continue.dev works best with a focused context window. Set contextLength to 32768 for chat and 4096 for autocomplete to keep responses fast.
Enable codebase indexing. Continue.dev can index your codebase for retrieval-augmented generation:
{
"contextProviders": [
{
"name": "codebase",
"params": {
"nRetrieve": 15,
"nFinal": 5
}
}
]
}
This helps Granite understand your project structure without loading everything into context.
Use slash commands. Continue.dev supports slash commands that work well with Granite’s instruction-following capabilities (IFEval 89.65 for the 30B):
/edit— edit selected code/comment— add comments to code/test— generate tests for selected code/fix— fix errors in selected code
Setting up other coding tools
vLLM (high-throughput serving)
For team deployments or production use, vLLM provides optimized inference:
pip install vllm
vllm serve ibm-granite/granite-4.1-8b-instruct \
--max-model-len 32768 \
--gpu-memory-utilization 0.9
This serves an OpenAI-compatible API on http://localhost:8000/v1 that any tool can connect to.
For the 30B with FP8 quantization:
vllm serve ibm-granite/granite-4.1-30b-instruct \
--quantization fp8 \
--max-model-len 32768 \
--tensor-parallel-size 2
Open Interpreter
pip install open-interpreter
# With Ollama
interpreter --model ollama/granite4.1:8b
# With OpenRouter
export OPENROUTER_API_KEY=your-key
interpreter --model openrouter/ibm-granite/granite-4.1-30b-instruct
Cursor / Windsurf (custom model)
Both Cursor and Windsurf support custom OpenAI-compatible endpoints. Point them at your Ollama, vLLM, or LM Studio server:
- API Base:
http://localhost:11434/v1(Ollama) orhttp://localhost:8000/v1(vLLM) - Model:
granite4.1:8b - API Key:
not-needed(for local)
AnythingLLM
AnythingLLM supports Granite 4.1 directly:
- Open AnythingLLM settings
- Select Ollama as the LLM provider
- Choose
granite4.1:8bfrom the model list - Set your workspace preferences
Which Granite 4.1 size for which tool?
| Tool | Recommended size | Why |
|---|---|---|
| Aider (main model) | 8B or 30B | Needs strong code editing; 8B is the sweet spot |
| Aider (editor model) | 3B | Fast simple edits |
| Continue.dev (chat) | 8B | Good balance of quality and speed |
| Continue.dev (autocomplete) | 3B | Speed is critical for tab completion |
| vLLM (team server) | 30B | Maximum quality for shared deployment |
| Open Interpreter | 8B | General-purpose coding tasks |
| Quick prototyping | 3B | Fast iteration, good enough quality |
The 8B is the sweet spot for most individual developer workflows. It scores 87.2 on HumanEval, 80.2 on EvalPlus, and 68.27 on BFCL V3 tool calling — all while fitting in 16 GB of RAM with quantization.
The 30B is worth the hardware cost for team deployments or when you need maximum quality for complex tasks. It leads BFCL V3 at 73.68 and scores 89.63 on HumanEval.
The 3B is ideal for autocomplete and fast editing. At 79.27 on HumanEval, it’s remarkably capable for its size and runs on virtually any hardware.
Performance expectations
Here’s what to expect in terms of speed and quality:
Local inference speed (approximate)
| Model | Hardware | Tokens/sec | Latency feel |
|---|---|---|---|
| 3B (Q4) | MacBook Air M2 8GB | ~40-60 | Instant |
| 3B (Q4) | RTX 3060 12GB | ~50-70 | Instant |
| 8B (Q4) | MacBook Pro M3 16GB | ~25-40 | Fast |
| 8B (Q4) | RTX 4070 12GB | ~30-50 | Fast |
| 30B (Q4) | Mac Studio M2 Ultra 64GB | ~10-20 | Acceptable |
| 30B (FP8) | A100 80GB | ~30-50 | Fast |
Quality expectations by task
| Task | 3B | 8B | 30B |
|---|---|---|---|
| Single function generation | Good | Excellent | Excellent |
| Multi-file refactoring | Fair | Good | Excellent |
| Test generation | Good | Excellent | Excellent |
| Bug fixing | Good | Excellent | Excellent |
| API integration / tool calling | Fair | Good | Excellent |
| Code explanation | Fair | Good | Excellent |
| Complex architecture decisions | Poor | Fair | Good |
Troubleshooting common issues
”Model not found” in Aider
Make sure Ollama is running and the model is pulled:
ollama list # Check available models
ollama pull granite4.1:8b # Pull if missing
Slow responses in Continue.dev
Reduce contextLength in your config. Large context windows require more memory and compute:
{
"models": [
{
"title": "Granite 4.1 8B",
"provider": "ollama",
"model": "granite4.1:8b",
"contextLength": 16384
}
]
}
Out of memory errors
Switch to a smaller model or enable quantization:
# Use the 3B instead
ollama pull granite4.1:3b
# Or use a quantized variant
ollama pull granite4.1:8b-q4_0
Tool calling not working
Granite 4.1 supports tool calling natively, but the tool needs to format the request correctly. Make sure your tool is sending the tools parameter in the OpenAI-compatible format. Aider and Continue.dev handle this automatically.
Connection refused errors
Check that your local server is running:
# For Ollama
curl http://localhost:11434/api/tags
# For vLLM
curl http://localhost:8000/v1/models
# For LM Studio
curl http://localhost:1234/v1/models
Granite 4.1 vs other local coding models in these tools
How does Granite 4.1 compare to other popular choices for Aider and Continue.dev?
| Model | HumanEval | Tool calling | Context | Best for |
|---|---|---|---|---|
| Granite 4.1 8B | 87.2 | 68.27 (BFCL V3) | 512K | Tool calling, enterprise |
| Granite 4.1 30B | 89.63 | 73.68 (BFCL V3) | 512K | Maximum coding quality |
| Qwen 3-Coder 30B | ~88 | ~65 | 128K | General coding |
| Gemma 4 27B | ~83 | ~72.7 | 256K | Multimodal coding |
| Phi-4 14B | ~82 | ~60 | 128K | Compact reasoning |
Granite 4.1 8B offers the best coding performance per GB of VRAM. The 30B leads on tool calling. For Aider specifically, Granite’s strong instruction following (IFEval 89.65) translates to reliable code edits — it follows Aider’s edit format consistently.
For more details on the model itself, see our Granite 4.1 complete guide. For API configuration options, check the Granite 4.1 API guide. For Aider setup with other models, see our Aider complete guide, and for Continue.dev, the Continue.dev complete guide.
FAQ
Which Granite 4.1 size should I use with Aider?
The 8B is the best default. It scores 87.2 on HumanEval, fits in 16 GB RAM with quantization, and generates tokens fast enough for interactive use. Use the 30B for complex multi-file refactors if you have the hardware (32+ GB RAM). Use the 3B as an editor model for fast simple edits alongside the 8B as the main model.
Does Granite 4.1 work with Continue.dev autocomplete?
Yes. Use the 3B model for tab autocomplete — it’s fast enough for sub-200ms completions while scoring 79.27 on HumanEval. Configure it as the tabAutocompleteModel in your Continue config with a short context length (4096 tokens) for maximum speed.
Can I use Granite 4.1 with Aider without an API key?
Yes, when running locally through Ollama or LM Studio. No API key is needed — Aider connects directly to the local server. For cloud providers (OpenRouter, watsonx.ai), you’ll need an API key from that provider.
How does Granite 4.1 compare to Qwen 3-Coder in Aider?
Granite 4.1 8B and Qwen 3-Coder are close on raw coding benchmarks, but Granite has two advantages in Aider: stronger instruction following (IFEval 89.65) means it follows Aider’s edit format more reliably, and better tool calling (BFCL V3 68.27) means it handles structured operations better. Granite also offers 512K context vs Qwen’s 128K.
Is the 30B model worth the extra hardware for coding tools?
It depends on your tasks. The 30B scores 89.63 on HumanEval vs 87.2 for the 8B — a modest improvement. Where the 30B shines is tool calling (73.68 vs 68.27 on BFCL V3) and complex multi-file operations. If you’re doing simple function generation and bug fixes, the 8B is sufficient. For complex refactoring, API integration, or team deployments, the 30B justifies the hardware.
Can I run Granite 4.1 and another model simultaneously?
Yes, with Ollama. Ollama can serve multiple models — it loads them on demand and keeps recently used models in memory. You can configure Aider to use Granite 4.1 8B as the main model and a different model (or Granite 3B) as the editor model. Continue.dev also supports multiple model configurations for different tasks.
What’s the minimum hardware for a usable Granite 4.1 coding setup?
A MacBook with 8 GB RAM or any machine with 8 GB VRAM can run the 3B model comfortably. For the 8B (recommended), you need 16 GB RAM/VRAM. A MacBook Pro M2/M3 with 16 GB unified memory runs the 8B well for Aider and Continue.dev. The 30B needs 32+ GB — a Mac Studio or a GPU like the RTX 4090 or A6000.
Related: Granite 4.1 complete guide · Granite 4.1 API guide · Aider complete guide · Continue.dev complete guide