πŸ€– AI Tools
Β· 9 min read

IBM Granite 4.1 API Guide β€” watsonx, HuggingFace, and Ollama Endpoints (2026)


Granite 4.1 is available through multiple API providers: IBM watsonx.ai for enterprise deployments, HuggingFace Inference Endpoints for flexible cloud hosting, Ollama for local API serving, OpenRouter as a unified gateway, and Replicate for serverless inference. All expose OpenAI-compatible endpoints, so switching between providers requires changing a URL and an API key β€” not rewriting your application.

This guide covers authentication, chat completions, streaming, function calling, and the 512K context window across every major provider. All code examples are in Python.

For background on the model itself, see the Granite 4.1 complete guide. For local deployment without an API layer, see how to run Granite 4.1 locally.

API options at a glance

ProviderModel sizesAuthPricingBest for
watsonx.ai3B, 8B, 30BIBM Cloud IAMPer-token (IBM pricing)Enterprise, compliance
HuggingFace Inference3B, 8B, 30BHF API tokenPer-hour (dedicated) or per-request (serverless)Flexible cloud hosting
Ollama (local)3B, 8B, 30BNone (localhost)Free (your hardware)Development, privacy
OpenRouter8B, 30BOpenRouter API keyPer-token (varies)Multi-model routing
Replicate8B, 30BReplicate API tokenPer-secondServerless, burst workloads

Every provider uses the OpenAI chat completions format. If your code works with one, it works with all of them.

Ollama local API

The simplest way to get a Granite 4.1 API running. No authentication, no billing, no rate limits.

Setup

# Install Ollama and pull the model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull granite4.1:8b

Ollama starts a server automatically on port 11434. The API is OpenAI-compatible.

Chat completions

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required by SDK but not validated
)

response = client.chat.completions.create(
    model="granite4.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function that validates email addresses using regex."}
    ],
    temperature=0.7,
    max_tokens=1024
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="granite4.1:8b",
    messages=[
        {"role": "user", "content": "Explain how DNS resolution works step by step."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Adjusting context length

By default, Ollama uses 2048 tokens of context. For longer conversations or documents:

# Create a model variant with 32K context
cat > Modelfile << 'EOF'
FROM granite4.1:8b
PARAMETER num_ctx 32768
EOF

ollama create granite4.1-32k -f Modelfile

Then use granite4.1-32k as your model name in API calls.

watsonx.ai API

IBM’s managed platform. Enterprise SLAs, governance tools, and integration with IBM’s AI ecosystem.

Authentication

watsonx uses IBM Cloud IAM tokens. You need a watsonx.ai project ID and an IBM Cloud API key.

import requests

# Get IAM token
iam_response = requests.post(
    "https://iam.cloud.ibm.com/identity/token",
    headers={"Content-Type": "application/x-www-form-urlencoded"},
    data={
        "grant_type": "urn:ibm:params:oauth:grant-type:apikey",
        "apikey": "YOUR_IBM_CLOUD_API_KEY"
    }
)
access_token = iam_response.json()["access_token"]

Chat completions

import requests

url = "https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2024-05-01"

headers = {
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json",
    "Accept": "application/json"
}

payload = {
    "model_id": "ibm/granite-4-1-8b-instruct",
    "project_id": "YOUR_PROJECT_ID",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the benefits of dense transformer architectures?"}
    ],
    "max_tokens": 1024,
    "temperature": 0.7
}

response = requests.post(url, headers=headers, json=payload)
print(response.json()["choices"][0]["message"]["content"])

Using the watsonx Python SDK

from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

credentials = Credentials(
    url="https://us-south.ml.cloud.ibm.com",
    api_key="YOUR_IBM_CLOUD_API_KEY"
)

model = ModelInference(
    model_id="ibm/granite-4-1-8b-instruct",
    credentials=credentials,
    project_id="YOUR_PROJECT_ID"
)

response = model.chat(
    messages=[
        {"role": "user", "content": "Write a SQL query to find duplicate records in a users table."}
    ],
    max_tokens=512
)

print(response["choices"][0]["message"]["content"])

watsonx provides additional features not available through other providers: model governance, prompt templates, AI factsheet tracking, and integration with Granite Guardian models for safety filtering.

HuggingFace Inference API

Two options: the free Inference API (rate-limited, shared infrastructure) and dedicated Inference Endpoints (your own GPU).

Free Inference API

from openai import OpenAI

client = OpenAI(
    base_url="https://api-inference.huggingface.co/v1",
    api_key="YOUR_HF_TOKEN"
)

response = client.chat.completions.create(
    model="ibm-granite/granite-4.1-8b-instruct",
    messages=[
        {"role": "user", "content": "Explain the difference between REST and GraphQL."}
    ],
    max_tokens=512
)

print(response.choices[0].message.content)

The free API has rate limits and may queue requests during peak usage. For production workloads, use dedicated endpoints.

Dedicated Inference Endpoints

Deploy your own instance through the HuggingFace UI or API:

from huggingface_hub import InferenceClient

# After creating an endpoint in the HuggingFace UI
client = InferenceClient(
    model="https://YOUR_ENDPOINT_URL.endpoints.huggingface.cloud",
    token="YOUR_HF_TOKEN"
)

response = client.chat_completion(
    messages=[
        {"role": "user", "content": "Write a Dockerfile for a Python FastAPI application."}
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

Dedicated endpoints give you a persistent GPU with predictable latency. Pricing depends on the GPU type β€” an A10G instance for the 8B model runs roughly $1–2/hour.

OpenRouter

OpenRouter acts as a unified gateway to multiple model providers. One API key, one endpoint, access to Granite 4.1 alongside hundreds of other models.

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY"
)

response = client.chat.completions.create(
    model="ibm-granite/granite-4.1-8b-instruct",
    messages=[
        {"role": "user", "content": "Compare async/await in Python vs JavaScript."}
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

OpenRouter pricing varies by provider and model. Check their pricing page for current Granite 4.1 rates. The advantage is flexibility β€” you can switch between Granite, Qwen, Llama, and proprietary models without changing your code.

Replicate

Replicate offers serverless inference β€” you pay per prediction with no idle costs.

import replicate

output = replicate.run(
    "ibm-granite/granite-4.1-8b-instruct",
    input={
        "prompt": "Write a Python script that monitors disk usage and sends alerts.",
        "max_tokens": 1024,
        "temperature": 0.7
    }
)

print("".join(output))

Replicate bills per second of compute time. Good for burst workloads where you do not want to maintain a persistent server.

Function calling

Granite 4.1 supports function calling (tool use) natively. The 30B model leads BFCL V3 at 73.68, and the 8B scores 68.27. Function calling works through the standard OpenAI tool-use format.

Defining tools

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g., 'San Francisco, CA'"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search a database table with a query",
            "parameters": {
                "type": "object",
                "properties": {
                    "table": {"type": "string", "description": "Table name"},
                    "query": {"type": "string", "description": "Search query"},
                    "limit": {"type": "integer", "description": "Max results", "default": 10}
                },
                "required": ["table", "query"]
            }
        }
    }
]

Making a tool-use request

response = client.chat.completions.create(
    model="granite4.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant with access to tools."},
        {"role": "user", "content": "What's the weather in Tokyo and find recent orders from the sales table?"}
    ],
    tools=tools,
    tool_choice="auto"
)

message = response.choices[0].message

if message.tool_calls:
    for tool_call in message.tool_calls:
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")

Handling tool responses

import json

# After executing the tool calls, feed results back
messages = [
    {"role": "system", "content": "You are a helpful assistant with access to tools."},
    {"role": "user", "content": "What's the weather in Tokyo?"},
    message,  # The assistant's tool call message
    {
        "role": "tool",
        "tool_call_id": message.tool_calls[0].id,
        "content": json.dumps({"temperature": 22, "unit": "celsius", "condition": "partly cloudy"})
    }
]

final_response = client.chat.completions.create(
    model="granite4.1:8b",
    messages=messages,
    tools=tools
)

print(final_response.choices[0].message.content)

Function calling works across all providers that support the OpenAI tools format β€” Ollama, vLLM, OpenRouter, and watsonx all handle it.

Working with the 512K context window

Granite 4.1’s 512K context window (8B and 30B models) lets you process entire codebases, long documents, or extended conversation histories in a single request. Here is how to use it effectively.

Sending long documents

# Read a large file
with open("large_codebase.txt", "r") as f:
    codebase = f.read()

response = client.chat.completions.create(
    model="granite4.1:8b",
    messages=[
        {"role": "system", "content": "You are a code review assistant."},
        {"role": "user", "content": f"Review this codebase for security issues:\n\n{codebase}"}
    ],
    max_tokens=4096
)

Context window considerations

  • Token counting β€” 512K tokens is roughly 380K–400K words of English text, or about 1.5–2 million characters of code. Use tiktoken or the model’s tokenizer to count precisely.
  • Cost β€” Longer contexts mean more compute. On pay-per-token providers, a 512K input costs 64Γ— more than an 8K input.
  • Quality β€” Performance degrades at extreme context lengths. IBM’s RULER benchmark shows the 8B scoring 83.6 at 32K but dropping to 73.0 at 128K. Use the minimum context needed for your task.
  • Memory β€” The KV cache for 512K tokens requires significant VRAM. On local deployments, ensure your hardware can handle it.

For strategies to manage API costs with large context windows, see our guide on how to reduce LLM API costs.

Pricing comparison

Approximate costs as of April 2026 (prices change frequently):

ProviderInput (per 1M tokens)Output (per 1M tokens)Notes
Ollama (local)$0$0Your hardware costs
watsonx.ai~$0.30–1.00~$0.60–2.00Varies by plan
HuggingFace (serverless)~$0.10–0.30~$0.30–0.60Rate-limited
HuggingFace (dedicated)~$1–2/hour~$1–2/hourFixed GPU cost
OpenRouter~$0.05–0.20~$0.15–0.40Varies by upstream
ReplicatePer-second billingPer-second billing~$0.001/sec

For development and testing, Ollama is free. For production with predictable costs, dedicated HuggingFace endpoints or watsonx give you fixed pricing. For variable workloads, OpenRouter or Replicate’s per-use pricing avoids idle costs.

Error handling and best practices

Retry logic

import time
from openai import OpenAI, APIError, RateLimitError

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

def chat_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="granite4.1:8b",
                messages=messages,
                max_tokens=1024
            )
        except RateLimitError:
            wait = 2 ** attempt
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
        except APIError as e:
            if attempt == max_retries - 1:
                raise
            print(f"API error: {e}. Retrying...")
            time.sleep(1)
    raise Exception("Max retries exceeded")

Token counting

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-4.1-8b-instruct")

text = "Your input text here..."
tokens = tokenizer.encode(text)
print(f"Token count: {len(tokens)}")

# Check if it fits in context
max_context = 512_000
if len(tokens) > max_context - 4096:  # Leave room for output
    print("Input too long, truncating...")
    text = tokenizer.decode(tokens[:max_context - 4096])

Structured output

response = client.chat.completions.create(
    model="granite4.1:8b",
    messages=[
        {"role": "system", "content": "Always respond with valid JSON."},
        {"role": "user", "content": "List 3 Python web frameworks with their key features."}
    ],
    response_format={"type": "json_object"},
    max_tokens=1024
)

import json
data = json.loads(response.choices[0].message.content)

Note: JSON mode support depends on the provider. Ollama and vLLM support it natively. Check your provider’s documentation.

FAQ

Which API provider should I use for Granite 4.1?

For development and testing, use Ollama locally β€” it is free, fast, and requires no authentication. For production with enterprise requirements (SLAs, compliance, governance), use watsonx.ai. For flexible cloud hosting without vendor lock-in, use HuggingFace Inference Endpoints. For multi-model applications where you want to switch between providers easily, use OpenRouter. For burst workloads with no idle costs, use Replicate.

Does Granite 4.1 function calling work with all API providers?

Function calling works with any provider that supports the OpenAI tools format. Ollama, vLLM, OpenRouter, and watsonx all support it. The quality of function calling depends on the model size β€” the 30B scores 73.68 on BFCL V3, the 8B scores 68.27, and the 3B scores 60.8. For production tool-use applications, the 8B or 30B instruct variants are recommended.

How do I switch between API providers without rewriting my code?

All providers use the OpenAI-compatible format. Use the OpenAI Python SDK and change only the base_url and api_key. Your message format, tool definitions, and streaming code stay the same. Store the base URL and API key in environment variables so switching is a config change, not a code change.

Can I use the full 512K context through the API?

Yes, but with caveats. The 512K context is supported by the 8B and 30B models (the 3B caps at 128K). On local Ollama, you need to configure the context size explicitly. On cloud providers, the maximum context may be limited by the provider’s configuration. watsonx and HuggingFace dedicated endpoints typically support the full context. OpenRouter and Replicate may have lower limits. Always check the provider’s documentation for maximum supported context length.

Is the Granite 4.1 API free?

The model weights are free under Apache 2.0. Running it locally via Ollama costs nothing beyond your electricity. Cloud API providers charge for compute β€” watsonx, HuggingFace, and Replicate all have per-token or per-hour pricing. OpenRouter sometimes offers free tiers for new models. The HuggingFace free Inference API is rate-limited but does not charge per request.

How does Granite 4.1 API latency compare to proprietary models?

On local Ollama with the 8B model, first-token latency is typically under 500ms and generation runs at 25–60 tokens per second depending on hardware. Cloud providers add network latency β€” expect 200–500ms additional round-trip time. Compared to proprietary APIs like GPT-4 or Claude, Granite 4.1 on dedicated infrastructure is often faster because you control the hardware and there is no shared queue. The dense architecture also helps β€” no MoE routing overhead means more predictable latency per token.