πŸ€– AI Tools
Β· 7 min read

DeepSeek V4 API Guide: Setup, Pricing, Thinking Modes, and Code Examples (2026)


DeepSeek V4 ships two API models: deepseek-v4-pro and deepseek-v4-flash. Both are OpenAI SDK-compatible, support a 1M-token context window, and offer three thinking modes. This guide covers everything you need to start building with the V4 API, from pricing and caching to code examples and migration from V3.

For model-specific deep dives, see the V4 Pro guide and the V4 Flash guide.

Model IDs and Endpoints

The DeepSeek V4 API is served at https://api.deepseek.com. Both models use the same base URL and authentication method. The only difference is the model ID you pass in your request.

ModelModel IDBest For
V4 Prodeepseek-v4-proComplex reasoning, long-context analysis, code generation
V4 Flashdeepseek-v4-flashFast responses, high-volume workloads, cost-sensitive apps

You can also access both models through OpenRouter and other compatible providers.

Pricing

All prices are per 1M tokens. DeepSeek uses a cache hit/miss pricing model for input tokens, which can dramatically reduce costs for repetitive workloads. For a broader comparison, see AI API pricing compared.

ModelInput (Cache Hit)Input (Cache Miss)Output
V4-Flash$0.028$0.14$0.28
V4-Pro$0.145$1.74$3.48

V4-Flash is one of the cheapest capable models on the market. V4-Pro costs more but delivers stronger reasoning performance, especially in Think Max mode.

How Cache Hits Work

DeepSeek caches the KV representations of your input tokens on their servers. When you send a request that shares a prefix with a recent request (same system prompt, same conversation history up to a point), those shared tokens are served from cache at the lower β€œcache hit” price.

In practice, this means:

  • Repeating system prompts across requests in the same session get cached automatically.
  • Multi-turn conversations benefit because earlier turns are already cached.
  • Batch processing with identical instructions sees up to 90% savings on input costs.

You do not need to enable caching. It happens automatically. The API response includes a cache_hit_tokens field so you can track your savings.

{
  "usage": {
    "prompt_tokens": 4500,
    "cache_hit_tokens": 4000,
    "completion_tokens": 320
  }
}

In this example, 4000 of 4500 input tokens were served from cache. Only 500 tokens were charged at the cache miss rate. For tips on maximizing cache hits, read how to reduce LLM API costs.

Thinking Modes

Both V4-Pro and V4-Flash support three thinking modes that control how much internal reasoning the model performs before responding. For a full breakdown, see the V4 thinking modes guide.

ModeBehaviorToken OverheadWhen to Use
Non-thinkNo chain-of-thought. Direct answer.NoneSimple lookups, formatting, low-latency needs
Think HighModerate internal reasoning.~2x output tokensMost coding and analysis tasks
Think MaxExtended reasoning with self-verification.~4x output tokensMath proofs, complex debugging, research

You control the thinking mode by setting the thinking parameter in your request.

Non-think Mode

Use this when you need fast, direct answers without internal reasoning.

from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    extra_body={"thinking": "none"}
)

print(response.choices[0].message.content)

Think High Mode

Good default for most tasks. The model reasons internally before answering.

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "system", "content": "You are an expert Python developer."},
        {"role": "user", "content": "Write a function that finds all prime factors of a number."}
    ],
    extra_body={"thinking": "high"}
)

print(response.choices[0].message.content)

Think Max Mode

Maximum reasoning depth. The model spends significantly more tokens on internal chain-of-thought and self-checks. Best for hard problems.

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "system", "content": "You are a mathematics researcher."},
        {"role": "user", "content": "Prove that there are infinitely many primes of the form 4k+3."}
    ],
    extra_body={"thinking": "max"}
)

print(response.choices[0].message.content)

Important: Think Max can consume a large number of tokens for its internal reasoning. DeepSeek recommends reserving at least 384K tokens of context for Think Max output on complex problems. If your input is already long, consider using Think High instead.

Python Setup (OpenAI SDK)

DeepSeek V4 is fully compatible with the OpenAI Python SDK. Install it and point the client at the DeepSeek base URL.

pip install openai
from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

Streaming Example

stream = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Explain Python decorators with examples."}
    ],
    stream=True,
    extra_body={"thinking": "high"}
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Multi-turn Conversation

messages = [
    {"role": "system", "content": "You are a senior software architect."},
    {"role": "user", "content": "Design a rate limiter for a REST API."},
]

# First turn
response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=messages,
    extra_body={"thinking": "high"}
)

assistant_reply = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_reply})

# Second turn (system prompt + first turn cached)
messages.append({"role": "user", "content": "Now add Redis-based distributed support."})

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=messages,
    extra_body={"thinking": "high"}
)

print(response.choices[0].message.content)

The second request benefits from cache hits on the system prompt and the first user/assistant turn.

Context Window

Both V4-Pro and V4-Flash support a 1M-token context window. This is large enough for entire codebases, long documents, or extended multi-turn sessions.

Practical recommendations:

  • Non-think mode: You can use close to the full 1M tokens for input.
  • Think High mode: Reserve roughly 100K tokens for reasoning and output.
  • Think Max mode: Reserve at least 384K tokens. This means your input should stay under roughly 600K tokens for best results.

If you exceed the context window, the API returns a 400 error with a context_length_exceeded code.

Migrating from V3

DeepSeek has announced that the V3 model IDs deepseek-chat and deepseek-reasoner will be retired on July 24, 2026. After that date, requests using those model IDs will return errors.

Migration steps

  1. Replace model IDs. Change deepseek-chat to deepseek-v4-flash (or deepseek-v4-pro for heavier tasks). Change deepseek-reasoner to deepseek-v4-pro with thinking set to high or max.
  2. Update thinking parameters. V3’s reasoner used a single reasoning mode. V4 gives you three levels. Start with Think High and upgrade to Think Max only where needed.
  3. Test output quality. V4 models produce different output distributions than V3. Run your evaluation suite before switching production traffic.
  4. Check pricing. V4-Flash is cheaper than V3 for most workloads. V4-Pro is priced higher but delivers better results. See the pricing table above.
V3 Model IDV4 ReplacementNotes
deepseek-chatdeepseek-v4-flashDirect replacement for general tasks
deepseek-chatdeepseek-v4-proUse for tasks needing stronger reasoning
deepseek-reasonerdeepseek-v4-pro + Think High/MaxSet thinking mode explicitly

Rate Limits and Best Practices

DeepSeek applies rate limits based on your API tier. Free-tier accounts have lower limits than paid accounts.

TierRequests/minTokens/minConcurrent Requests
Free1040,0005
Paid (Standard)3002,000,00050
Paid (Enterprise)CustomCustomCustom

Best practices

  • Reuse system prompts across requests to maximize cache hits.
  • Use V4-Flash for simple tasks and reserve V4-Pro for complex reasoning. This keeps costs low without sacrificing quality where it matters.
  • Set the right thinking mode. Non-think for lookups, Think High for most work, Think Max only for hard problems. Overusing Think Max wastes tokens and money.
  • Stream responses for user-facing applications to reduce perceived latency.
  • Handle rate limit errors with exponential backoff. The API returns HTTP 429 when you hit limits.
  • Monitor cache_hit_tokens in responses to verify your caching strategy is working.
  • Keep inputs under 600K tokens if you plan to use Think Max, to leave room for reasoning.

FAQ

Can I use the DeepSeek V4 API with the OpenAI SDK in other languages?

Yes. Any OpenAI-compatible SDK works, including the official Node.js, Go, and Java libraries. Set the base URL to https://api.deepseek.com and use your DeepSeek API key. The model IDs and parameters are the same.

What happens if I do not specify a thinking mode?

The API defaults to Think High. If you want Non-think behavior, you must explicitly set "thinking": "none". If you want Think Max, set "thinking": "max".

Is there a way to see the model’s internal reasoning tokens?

By default, internal reasoning tokens are not included in the response. You can request them by setting "show_thinking": true in the extra_body. The reasoning will appear in a separate thinking_content field on the message object. Note that this increases response size and is mainly useful for debugging.

Next Steps