📝 Tutorials
· 9 min read

Claude Fable 5 API Guide: Authentication, Pricing, and First Request (2026)


So you want to hit the Claude Fable 5 API directly. Maybe you’re building an AI-powered feature, running batch evaluations, or integrating it into your own coding tool. This guide gets you from zero to working API calls in both Python and TypeScript, with real examples for streaming, extended thinking, batch processing, and handling those annoying safeguard blocks.

The Essentials

Before we write any code, here’s what you need to know:

PropertyValue
Model IDclaude-fable-5
Context window1,000,000 tokens
Max output128,000 tokens
Input pricing$10 / million tokens
Output pricing$50 / million tokens
Batch input$5 / million tokens
Batch output$25 / million tokens
Extended thinkingSupported
Data retention30 days

For a full pricing comparison across models, check our AI API pricing comparison for 2026.

Authentication Setup

You’ll need an API key from the Anthropic Console. Once you have it:

export ANTHROPIC_API_KEY="sk-ant-your-key-here"

The API uses the x-api-key header for authentication and requires an anthropic-version header. If you’re using the official SDKs, these are handled for you.

Your First Request — Python

Install the SDK:

pip install anthropic

Basic completion:

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-fable-5",
    max_tokens=4096,
    messages=[
        {
            "role": "user",
            "content": "Explain the difference between a mutex and a semaphore. Include a practical example in Rust."
        }
    ]
)

print(message.content[0].text)

That’s the simplest possible call. The SDK picks up your ANTHROPIC_API_KEY from the environment automatically.

Your First Request — TypeScript

Install the SDK:

npm install @anthropic-ai/sdk

Basic completion:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const message = await client.messages.create({
  model: "claude-fable-5",
  max_tokens: 4096,
  messages: [
    {
      role: "user",
      content: "Explain the difference between a mutex and a semaphore. Include a practical example in Rust.",
    },
  ],
});

console.log(message.content[0].text);

Same pattern — API key from environment, model ID set to claude-fable-5.

Streaming Responses

For long outputs (and Fable 5 can produce long outputs at 128K max), streaming is essential for good UX. Nobody wants to stare at a blank screen for 30 seconds.

Python Streaming

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-fable-5",
    max_tokens=8192,
    messages=[
        {
            "role": "user",
            "content": "Write a comprehensive guide to implementing a B-tree in Python with full insert, delete, and search operations."
        }
    ]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

TypeScript Streaming

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const stream = client.messages.stream({
  model: "claude-fable-5",
  max_tokens: 8192,
  messages: [
    {
      role: "user",
      content: "Write a comprehensive guide to implementing a B-tree in TypeScript with full insert, delete, and search operations.",
    },
  ],
});

for await (const event of stream) {
  if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
    process.stdout.write(event.delta.text);
  }
}

Streaming doesn’t change pricing — you pay the same per token regardless of whether you stream or wait for the full response.

Extended Thinking

This is where Fable 5 gets interesting. Extended thinking lets the model reason through complex problems step-by-step before producing its final answer. The thinking happens in a separate content block that you can inspect (or discard).

Python with Extended Thinking

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-fable-5",
    max_tokens=16384,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000
    },
    messages=[
        {
            "role": "user",
            "content": "I have a race condition in my Go service. Two goroutines are reading and writing to a shared map concurrently. Sometimes the service panics with 'concurrent map writes'. What's the best fix considering I need high read throughput?"
        }
    ]
)

for block in message.content:
    if block.type == "thinking":
        print(f"[Thinking]: {block.thinking}")
    elif block.type == "text":
        print(f"\n[Response]: {block.text}")

TypeScript with Extended Thinking

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const message = await client.messages.create({
  model: "claude-fable-5",
  max_tokens: 16384,
  thinking: {
    type: "enabled",
    budget_tokens: 10000,
  },
  messages: [
    {
      role: "user",
      content: "I have a race condition in my Go service. Two goroutines are reading and writing to a shared map concurrently. Sometimes the service panics with 'concurrent map writes'. What's the best fix considering I need high read throughput?",
    },
  ],
});

for (const block of message.content) {
  if (block.type === "thinking") {
    console.log(`[Thinking]: ${block.thinking}`);
  } else if (block.type === "text") {
    console.log(`\n[Response]: ${block.text}`);
  }
}

The budget_tokens parameter caps how many tokens the model spends on thinking. Set it based on problem complexity — 5K for straightforward questions, 10-20K for complex debugging, up to 50K+ for deep architectural analysis.

Important: Thinking tokens count toward your output token costs at $50/M. A 10K thinking budget adds up to $0.50 per request if fully used. Be deliberate about when you enable it.

For understanding how reasoning impacts your overall costs, see our guide on reducing LLM API costs.

Batch API

If you’re processing large volumes — running evaluations, generating documentation, analyzing codebases — the Batch API halves your costs: $5/M input and $25/M output.

Python Batch Example

import anthropic

client = anthropic.Anthropic()

batch = client.batches.create(
    requests=[
        {
            "custom_id": "review-1",
            "params": {
                "model": "claude-fable-5",
                "max_tokens": 4096,
                "messages": [
                    {"role": "user", "content": "Review this code for security issues: ..."}
                ]
            }
        },
        {
            "custom_id": "review-2",
            "params": {
                "model": "claude-fable-5",
                "max_tokens": 4096,
                "messages": [
                    {"role": "user", "content": "Review this code for performance issues: ..."}
                ]
            }
        }
    ]
)

print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")

import time
while True:
    batch = client.batches.retrieve(batch.id)
    if batch.processing_status == "ended":
        break
    time.sleep(10)

results = client.batches.results(batch.id)
for result in results:
    print(f"{result.custom_id}: {result.result.message.content[0].text[:100]}...")

Batch requests don’t stream and may take minutes to hours depending on load. Perfect for overnight processing jobs, CI/CD pipelines, or any workload that doesn’t need real-time responses.

Error Handling: Safeguard Blocks

Fable 5 has content safety filters that can trigger mid-response. When they do, you get a response with a stop_reason of "end_turn" but potentially incomplete content, or in some cases a 400 error with a specific error type.

Here’s how to handle them gracefully:

Python Error Handling

import anthropic

client = anthropic.Anthropic()

try:
    message = client.messages.create(
        model="claude-fable-5",
        max_tokens=4096,
        messages=[
            {"role": "user", "content": "Your prompt here"}
        ]
    )
    
    # Check stop reason
    if message.stop_reason == "max_tokens":
        print("Warning: Response was truncated. Consider increasing max_tokens.")
    
    print(message.content[0].text)

except anthropic.BadRequestError as e:
    if "content_policy" in str(e):
        print("Request was blocked by content safety filters.")
        print("Try rephrasing your request.")
    else:
        raise

except anthropic.RateLimitError:
    print("Rate limited. Implement exponential backoff.")
    # Implement retry logic

except anthropic.APIError as e:
    print(f"API error: {e.status_code} - {e.message}")

TypeScript Error Handling

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

try {
  const message = await client.messages.create({
    model: "claude-fable-5",
    max_tokens: 4096,
    messages: [{ role: "user", content: "Your prompt here" }],
  });

  if (message.stop_reason === "max_tokens") {
    console.warn("Response was truncated. Consider increasing max_tokens.");
  }

  console.log(message.content[0].text);
} catch (error) {
  if (error instanceof Anthropic.BadRequestError) {
    console.error("Request blocked by content safety filters.");
  } else if (error instanceof Anthropic.RateLimitError) {
    console.error("Rate limited. Implement exponential backoff.");
  } else if (error instanceof Anthropic.APIError) {
    console.error(`API error: ${error.status} - ${error.message}`);
  } else {
    throw error;
  }
}

System Prompts and Context Engineering

With a 1M token context window, you can feed Fable 5 enormous amounts of context. But more context isn’t always better — it costs more and can dilute the signal. Here’s the pattern I use:

message = client.messages.create(
    model="claude-fable-5",
    max_tokens=8192,
    system="You are a senior backend engineer specializing in distributed systems. Be concise. Prioritize correctness over completeness.",
    messages=[
        {
            "role": "user",
            "content": f"Here's the relevant source code:\n\n{source_code}\n\nThe test failure is:\n\n{test_output}\n\nIdentify the root cause and provide a fix."
        }
    ]
)

Key principles for context engineering with Fable 5:

  • Put the most important information first
  • Use clear delimiters between different pieces of context
  • Tell the model what you want before giving it the context to analyze
  • Use the system prompt for persistent instructions, user messages for task-specific content

Prompt Caching for Cost Reduction

If you’re making multiple requests with the same system prompt or context prefix, prompt caching can dramatically reduce costs. Cached tokens are billed at 10% of the normal input rate.

message = client.messages.create(
    model="claude-fable-5",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": "You are a code review assistant. Here are the project coding standards: ...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Review this pull request: ..."}
    ]
)

The cache_control marker tells the API to cache everything up to that point. Subsequent requests with the same prefix hit the cache. This is covered in detail in our prompt caching explained guide.

Pricing Strategies

For production usage, here’s how to keep costs sane:

  1. Use the Batch API for anything that doesn’t need real-time responses (50% savings)
  2. Enable prompt caching for repeated system prompts (up to 90% savings on cached tokens)
  3. Set appropriate max_tokens — don’t default to 128K if you expect 2K responses
  4. Use extended thinking selectively — only enable it for problems that genuinely benefit from reasoning
  5. Implement request-level budgets — track spend per API call and alert on anomalies

For a comprehensive cost management strategy, see how to reduce LLM API costs and monitoring AI API spending.

Comparing Direct API vs. OpenRouter

You can also access Fable 5 through OpenRouter, which provides unified access and fallback routing. The trade-off: slightly higher latency and potentially a small markup, but you gain multi-provider resilience. See our OpenRouter vs Direct API comparison for the full breakdown.

Frequently Asked Questions

What’s the difference between claude-fable-5 and claude-opus-4 for API usage?

Fable 5 has higher benchmark scores (95% SWE-bench Verified vs Opus 4), a larger max output (128K vs Opus’s limit), and extended thinking support. Pricing is $10/$50 for Fable vs what Opus commands. For coding and reasoning tasks, Fable 5 is the stronger choice. For a detailed comparison, see our Claude Opus 4 guide.

How does the 30-day data retention affect my API usage?

All requests and responses sent through the Claude API are retained for 30 days by Anthropic. This applies to all traffic regardless of plan. If you’re handling sensitive code or proprietary data, factor this into your compliance decisions.

Can I use extended thinking with streaming?

Yes. When streaming with extended thinking enabled, you’ll receive thinking content blocks before text content blocks. The thinking streams first, then the final response streams. This gives you real-time visibility into the model’s reasoning process.

What happens when I hit rate limits?

Anthropic enforces rate limits based on your plan tier. When hit, you’ll receive a 429 status code. Implement exponential backoff with jitter. For high-throughput workloads, consider the Batch API which has separate, higher limits.

Is the Batch API suitable for time-sensitive workloads?

No. Batch processing can take minutes to hours depending on queue depth. Use it for overnight jobs, bulk processing, evaluation runs, or any workload where latency doesn’t matter. The 50% cost savings make it worth restructuring workloads around when possible.

How do I estimate costs before making requests?

Count your input tokens (system prompt + messages + context) and estimate output tokens based on task complexity. The Anthropic tokenizer is available via the SDK: client.count_tokens(text). For budget planning across multiple models, see our AI API pricing comparison.

What’s Next

You’ve got the fundamentals. From here, you can integrate Fable 5 into your own tools, CI pipelines, or applications. The model’s combination of 1M context, 128K output, and extended thinking makes it capable of tasks that simply weren’t possible with smaller models — full codebase analysis, end-to-end feature implementation, and deep architectural reasoning.

Start with the free tier through June 22 if you’re on a Pro, Max, Team, or Enterprise plan. Once you’ve calibrated your usage patterns, you’ll know exactly where Fable 5’s quality-to-cost ratio makes sense for your workflow.