Claude Fable 5 API Guide: Authentication, Pricing, and First Request (2026)
So you want to hit the Claude Fable 5 API directly. Maybe you’re building an AI-powered feature, running batch evaluations, or integrating it into your own coding tool. This guide gets you from zero to working API calls in both Python and TypeScript, with real examples for streaming, extended thinking, batch processing, and handling those annoying safeguard blocks.
The Essentials
Before we write any code, here’s what you need to know:
| Property | Value |
|---|---|
| Model ID | claude-fable-5 |
| Context window | 1,000,000 tokens |
| Max output | 128,000 tokens |
| Input pricing | $10 / million tokens |
| Output pricing | $50 / million tokens |
| Batch input | $5 / million tokens |
| Batch output | $25 / million tokens |
| Extended thinking | Supported |
| Data retention | 30 days |
For a full pricing comparison across models, check our AI API pricing comparison for 2026.
Authentication Setup
You’ll need an API key from the Anthropic Console. Once you have it:
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
The API uses the x-api-key header for authentication and requires an anthropic-version header. If you’re using the official SDKs, these are handled for you.
Your First Request — Python
Install the SDK:
pip install anthropic
Basic completion:
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-fable-5",
max_tokens=4096,
messages=[
{
"role": "user",
"content": "Explain the difference between a mutex and a semaphore. Include a practical example in Rust."
}
]
)
print(message.content[0].text)
That’s the simplest possible call. The SDK picks up your ANTHROPIC_API_KEY from the environment automatically.
Your First Request — TypeScript
Install the SDK:
npm install @anthropic-ai/sdk
Basic completion:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const message = await client.messages.create({
model: "claude-fable-5",
max_tokens: 4096,
messages: [
{
role: "user",
content: "Explain the difference between a mutex and a semaphore. Include a practical example in Rust.",
},
],
});
console.log(message.content[0].text);
Same pattern — API key from environment, model ID set to claude-fable-5.
Streaming Responses
For long outputs (and Fable 5 can produce long outputs at 128K max), streaming is essential for good UX. Nobody wants to stare at a blank screen for 30 seconds.
Python Streaming
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-fable-5",
max_tokens=8192,
messages=[
{
"role": "user",
"content": "Write a comprehensive guide to implementing a B-tree in Python with full insert, delete, and search operations."
}
]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
TypeScript Streaming
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const stream = client.messages.stream({
model: "claude-fable-5",
max_tokens: 8192,
messages: [
{
role: "user",
content: "Write a comprehensive guide to implementing a B-tree in TypeScript with full insert, delete, and search operations.",
},
],
});
for await (const event of stream) {
if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
process.stdout.write(event.delta.text);
}
}
Streaming doesn’t change pricing — you pay the same per token regardless of whether you stream or wait for the full response.
Extended Thinking
This is where Fable 5 gets interesting. Extended thinking lets the model reason through complex problems step-by-step before producing its final answer. The thinking happens in a separate content block that you can inspect (or discard).
Python with Extended Thinking
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-fable-5",
max_tokens=16384,
thinking={
"type": "enabled",
"budget_tokens": 10000
},
messages=[
{
"role": "user",
"content": "I have a race condition in my Go service. Two goroutines are reading and writing to a shared map concurrently. Sometimes the service panics with 'concurrent map writes'. What's the best fix considering I need high read throughput?"
}
]
)
for block in message.content:
if block.type == "thinking":
print(f"[Thinking]: {block.thinking}")
elif block.type == "text":
print(f"\n[Response]: {block.text}")
TypeScript with Extended Thinking
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const message = await client.messages.create({
model: "claude-fable-5",
max_tokens: 16384,
thinking: {
type: "enabled",
budget_tokens: 10000,
},
messages: [
{
role: "user",
content: "I have a race condition in my Go service. Two goroutines are reading and writing to a shared map concurrently. Sometimes the service panics with 'concurrent map writes'. What's the best fix considering I need high read throughput?",
},
],
});
for (const block of message.content) {
if (block.type === "thinking") {
console.log(`[Thinking]: ${block.thinking}`);
} else if (block.type === "text") {
console.log(`\n[Response]: ${block.text}`);
}
}
The budget_tokens parameter caps how many tokens the model spends on thinking. Set it based on problem complexity — 5K for straightforward questions, 10-20K for complex debugging, up to 50K+ for deep architectural analysis.
Important: Thinking tokens count toward your output token costs at $50/M. A 10K thinking budget adds up to $0.50 per request if fully used. Be deliberate about when you enable it.
For understanding how reasoning impacts your overall costs, see our guide on reducing LLM API costs.
Batch API
If you’re processing large volumes — running evaluations, generating documentation, analyzing codebases — the Batch API halves your costs: $5/M input and $25/M output.
Python Batch Example
import anthropic
client = anthropic.Anthropic()
batch = client.batches.create(
requests=[
{
"custom_id": "review-1",
"params": {
"model": "claude-fable-5",
"max_tokens": 4096,
"messages": [
{"role": "user", "content": "Review this code for security issues: ..."}
]
}
},
{
"custom_id": "review-2",
"params": {
"model": "claude-fable-5",
"max_tokens": 4096,
"messages": [
{"role": "user", "content": "Review this code for performance issues: ..."}
]
}
}
]
)
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")
import time
while True:
batch = client.batches.retrieve(batch.id)
if batch.processing_status == "ended":
break
time.sleep(10)
results = client.batches.results(batch.id)
for result in results:
print(f"{result.custom_id}: {result.result.message.content[0].text[:100]}...")
Batch requests don’t stream and may take minutes to hours depending on load. Perfect for overnight processing jobs, CI/CD pipelines, or any workload that doesn’t need real-time responses.
Error Handling: Safeguard Blocks
Fable 5 has content safety filters that can trigger mid-response. When they do, you get a response with a stop_reason of "end_turn" but potentially incomplete content, or in some cases a 400 error with a specific error type.
Here’s how to handle them gracefully:
Python Error Handling
import anthropic
client = anthropic.Anthropic()
try:
message = client.messages.create(
model="claude-fable-5",
max_tokens=4096,
messages=[
{"role": "user", "content": "Your prompt here"}
]
)
# Check stop reason
if message.stop_reason == "max_tokens":
print("Warning: Response was truncated. Consider increasing max_tokens.")
print(message.content[0].text)
except anthropic.BadRequestError as e:
if "content_policy" in str(e):
print("Request was blocked by content safety filters.")
print("Try rephrasing your request.")
else:
raise
except anthropic.RateLimitError:
print("Rate limited. Implement exponential backoff.")
# Implement retry logic
except anthropic.APIError as e:
print(f"API error: {e.status_code} - {e.message}")
TypeScript Error Handling
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
try {
const message = await client.messages.create({
model: "claude-fable-5",
max_tokens: 4096,
messages: [{ role: "user", content: "Your prompt here" }],
});
if (message.stop_reason === "max_tokens") {
console.warn("Response was truncated. Consider increasing max_tokens.");
}
console.log(message.content[0].text);
} catch (error) {
if (error instanceof Anthropic.BadRequestError) {
console.error("Request blocked by content safety filters.");
} else if (error instanceof Anthropic.RateLimitError) {
console.error("Rate limited. Implement exponential backoff.");
} else if (error instanceof Anthropic.APIError) {
console.error(`API error: ${error.status} - ${error.message}`);
} else {
throw error;
}
}
System Prompts and Context Engineering
With a 1M token context window, you can feed Fable 5 enormous amounts of context. But more context isn’t always better — it costs more and can dilute the signal. Here’s the pattern I use:
message = client.messages.create(
model="claude-fable-5",
max_tokens=8192,
system="You are a senior backend engineer specializing in distributed systems. Be concise. Prioritize correctness over completeness.",
messages=[
{
"role": "user",
"content": f"Here's the relevant source code:\n\n{source_code}\n\nThe test failure is:\n\n{test_output}\n\nIdentify the root cause and provide a fix."
}
]
)
Key principles for context engineering with Fable 5:
- Put the most important information first
- Use clear delimiters between different pieces of context
- Tell the model what you want before giving it the context to analyze
- Use the system prompt for persistent instructions, user messages for task-specific content
Prompt Caching for Cost Reduction
If you’re making multiple requests with the same system prompt or context prefix, prompt caching can dramatically reduce costs. Cached tokens are billed at 10% of the normal input rate.
message = client.messages.create(
model="claude-fable-5",
max_tokens=4096,
system=[
{
"type": "text",
"text": "You are a code review assistant. Here are the project coding standards: ...",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "Review this pull request: ..."}
]
)
The cache_control marker tells the API to cache everything up to that point. Subsequent requests with the same prefix hit the cache. This is covered in detail in our prompt caching explained guide.
Pricing Strategies
For production usage, here’s how to keep costs sane:
- Use the Batch API for anything that doesn’t need real-time responses (50% savings)
- Enable prompt caching for repeated system prompts (up to 90% savings on cached tokens)
- Set appropriate max_tokens — don’t default to 128K if you expect 2K responses
- Use extended thinking selectively — only enable it for problems that genuinely benefit from reasoning
- Implement request-level budgets — track spend per API call and alert on anomalies
For a comprehensive cost management strategy, see how to reduce LLM API costs and monitoring AI API spending.
Comparing Direct API vs. OpenRouter
You can also access Fable 5 through OpenRouter, which provides unified access and fallback routing. The trade-off: slightly higher latency and potentially a small markup, but you gain multi-provider resilience. See our OpenRouter vs Direct API comparison for the full breakdown.
Frequently Asked Questions
What’s the difference between claude-fable-5 and claude-opus-4 for API usage?
Fable 5 has higher benchmark scores (95% SWE-bench Verified vs Opus 4), a larger max output (128K vs Opus’s limit), and extended thinking support. Pricing is $10/$50 for Fable vs what Opus commands. For coding and reasoning tasks, Fable 5 is the stronger choice. For a detailed comparison, see our Claude Opus 4 guide.
How does the 30-day data retention affect my API usage?
All requests and responses sent through the Claude API are retained for 30 days by Anthropic. This applies to all traffic regardless of plan. If you’re handling sensitive code or proprietary data, factor this into your compliance decisions.
Can I use extended thinking with streaming?
Yes. When streaming with extended thinking enabled, you’ll receive thinking content blocks before text content blocks. The thinking streams first, then the final response streams. This gives you real-time visibility into the model’s reasoning process.
What happens when I hit rate limits?
Anthropic enforces rate limits based on your plan tier. When hit, you’ll receive a 429 status code. Implement exponential backoff with jitter. For high-throughput workloads, consider the Batch API which has separate, higher limits.
Is the Batch API suitable for time-sensitive workloads?
No. Batch processing can take minutes to hours depending on queue depth. Use it for overnight jobs, bulk processing, evaluation runs, or any workload where latency doesn’t matter. The 50% cost savings make it worth restructuring workloads around when possible.
How do I estimate costs before making requests?
Count your input tokens (system prompt + messages + context) and estimate output tokens based on task complexity. The Anthropic tokenizer is available via the SDK: client.count_tokens(text). For budget planning across multiple models, see our AI API pricing comparison.
What’s Next
You’ve got the fundamentals. From here, you can integrate Fable 5 into your own tools, CI pipelines, or applications. The model’s combination of 1M context, 128K output, and extended thinking makes it capable of tasks that simply weren’t possible with smaller models — full codebase analysis, end-to-end feature implementation, and deep architectural reasoning.
Start with the free tier through June 22 if you’re on a Pro, Max, Team, or Enterprise plan. Once you’ve calibrated your usage patterns, you’ll know exactly where Fable 5’s quality-to-cost ratio makes sense for your workflow.