🤖 AI Tools
· 7 min read

How to Use DeepSeek Vision API: Python Tutorial with Examples


DeepSeek’s V4 models now understand images, and the best part is you don’t need to learn a new SDK. The API is OpenAI-compatible, so if you’ve used the OpenAI Python library before, you already know 90% of what you need. This tutorial walks through everything from your first API call to production-ready batch processing.

If you’re looking for a broader overview of what DeepSeek Vision can do, check out our complete guide. This article is purely hands-on code.

Prerequisites

You’ll need:

Install the dependency:

pip install openai

Set your API key as an environment variable:

export DEEPSEEK_API_KEY="your-key-here"

Basic Setup

Since DeepSeek uses an OpenAI-compatible API, you configure the standard OpenAI client with a different base URL:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com"
)

That’s it. Every example below uses this same client instance. You’ve got two model choices:

  • deepseek-v4-flash - Fast and cheap ($0.14/$0.28 per M tokens). Great for straightforward tasks.
  • deepseek-v4-pro - More capable ($1.74/$3.48 per M tokens). Better for complex reasoning about images.

Example 1: Basic Image Description

Let’s start simple. Send an image URL and ask the model to describe it:

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in detail."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/photo.jpg"
                    }
                }
            ]
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

Sample response:

The image shows a golden retriever sitting on a wooden dock overlooking a
calm lake at sunset. The dog is facing away from the camera, looking out
over the water. The sky has shades of orange and pink reflected on the
lake surface. Pine trees line the far shore.

The content field takes a list of parts. You can mix text and images in any order. The model sees them all as a single prompt.

Example 2: Using Base64-Encoded Images

For local files, encode them as base64:

import base64

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

image_data = encode_image("receipt.jpg")

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract all text from this image."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data}"
                    }
                }
            ]
        }
    ],
    max_tokens=1000
)

print(response.choices[0].message.content)

Both JPEG and PNG work fine. For most use cases, JPEG is preferred since smaller file sizes mean faster uploads and lower costs.

Example 3: OCR Text Extraction

OCR is where DeepSeek Vision really shines compared to its price. Here’s a structured extraction example:

def extract_document_text(image_path: str) -> dict:
    image_data = encode_image(image_path)

    response = client.chat.completions.create(
        model="deepseek-v4-flash",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": """Extract all text from this document image.
Return it as JSON with the following structure:
{
  "document_type": "invoice/receipt/form/other",
  "extracted_text": "full text content",
  "key_fields": {"field_name": "value"}
}"""
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_data}"
                        }
                    }
                ]
            }
        ],
        max_tokens=2000,
        temperature=0.1
    )

    return response.choices[0].message.content

Setting temperature=0.1 keeps OCR output consistent across runs. You don’t want creative interpretation of text on an invoice.

Sample response for a receipt:

{
  "document_type": "receipt",
  "extracted_text": "WHOLE FOODS MARKET\n123 Main St\nAustin, TX 78701\n\nOrganic Bananas  $1.99\nAlmond Milk      $4.49\nSourdough Bread  $5.99\n\nSubtotal: $12.47\nTax: $0.62\nTotal: $13.09\n\nVISA ***1234\n06/15/2026 14:32",
  "key_fields": {
    "store": "Whole Foods Market",
    "total": "$13.09",
    "date": "06/15/2026",
    "payment_method": "VISA ***1234"
  }
}

For a deeper dive into OCR pipelines, check out our DeepSeek Vision OCR guide.

Example 4: Multiple Images in One Request

You can send multiple images in a single message. This is useful for comparing documents or processing related images together:

def compare_images(image_paths: list[str], question: str) -> str:
    content = [{"type": "text", "text": question}]

    for path in image_paths:
        image_data = encode_image(path)
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{image_data}"
            }
        })

    response = client.chat.completions.create(
        model="deepseek-v4-pro",
        messages=[{"role": "user", "content": content}],
        max_tokens=1500
    )

    return response.choices[0].message.content

# Compare two versions of a design
result = compare_images(
    ["design_v1.png", "design_v2.png"],
    "What are the differences between these two UI designs?"
)
print(result)

Each image only uses about 90 KV cache entries in DeepSeek’s architecture, so you can fit many images within the 1M context window without running into limits.

Example 5: Streaming Responses

For longer outputs (like detailed image descriptions or large OCR jobs), streaming gives you results as they generate:

stream = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe everything in this image."},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/complex-scene.jpg"}
                }
            ]
        }
    ],
    max_tokens=2000,
    stream=True
)

full_response = ""
for chunk in stream:
    if chunk.choices[0].delta.content:
        content = chunk.choices[0].delta.content
        print(content, end="", flush=True)
        full_response += content

print()  # newline at the end

Streaming is especially helpful in web applications where you want to show progress to users instead of making them wait for the entire response.

Example 6: Batch Processing Multiple Images

Here’s a practical pattern for processing a folder of images:

import os
import json
import time
from pathlib import Path

def batch_process_images(
    folder: str,
    prompt: str,
    model: str = "deepseek-v4-flash",
    delay: float = 0.5
) -> list[dict]:
    results = []
    image_extensions = {".jpg", ".jpeg", ".png", ".webp"}

    image_files = [
        f for f in Path(folder).iterdir()
        if f.suffix.lower() in image_extensions
    ]

    print(f"Processing {len(image_files)} images...")

    for i, image_path in enumerate(image_files):
        try:
            image_data = encode_image(str(image_path))

            response = client.chat.completions.create(
                model=model,
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": prompt},
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/jpeg;base64,{image_data}"
                                }
                            }
                        ]
                    }
                ],
                max_tokens=1500,
                temperature=0.1
            )

            results.append({
                "file": image_path.name,
                "result": response.choices[0].message.content,
                "tokens_used": response.usage.total_tokens
            })

            print(f"  [{i+1}/{len(image_files)}] {image_path.name} - OK")

        except Exception as e:
            results.append({
                "file": image_path.name,
                "error": str(e)
            })
            print(f"  [{i+1}/{len(image_files)}] {image_path.name} - ERROR: {e}")

        time.sleep(delay)  # rate limiting

    return results

# Usage
results = batch_process_images(
    folder="./invoices",
    prompt="Extract the invoice number, date, total amount, and vendor name as JSON."
)

# Save results
with open("results.json", "w") as f:
    json.dump(results, f, indent=2)

The delay parameter prevents hitting rate limits. For production workloads, you’d want proper retry logic, which brings us to the next section.

Example 7: Robust Error Handling

Production code needs to handle API errors gracefully:

from openai import (
    APIError,
    APIConnectionError,
    RateLimitError,
    APITimeoutError,
)
import time

def call_with_retry(
    messages: list,
    model: str = "deepseek-v4-flash",
    max_retries: int = 3,
    max_tokens: int = 1500
) -> str:
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=max_tokens,
                timeout=60.0
            )
            return response.choices[0].message.content

        except RateLimitError as e:
            wait_time = 2 ** attempt * 5  # 5s, 10s, 20s
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)

        except APITimeoutError:
            print(f"Timeout on attempt {attempt + 1}. Retrying...")
            time.sleep(2)

        except APIConnectionError as e:
            print(f"Connection error: {e}. Retrying in 5s...")
            time.sleep(5)

        except APIError as e:
            if e.status_code >= 500:
                print(f"Server error ({e.status_code}). Retrying...")
                time.sleep(3)
            else:
                raise  # client errors shouldn't be retried

    raise Exception(f"Failed after {max_retries} attempts")

This handles the most common failure modes: rate limits, timeouts, connection issues, and server errors. Client errors (4xx) are raised immediately since retrying won’t fix them.

Cost Tracking

Keep track of what you’re spending:

class CostTracker:
    def __init__(self, model: str = "deepseek-v4-flash"):
        self.model = model
        self.total_input_tokens = 0
        self.total_output_tokens = 0

        # Prices per million tokens
        self.prices = {
            "deepseek-v4-flash": {"input": 0.14, "output": 0.28},
            "deepseek-v4-pro": {"input": 1.74, "output": 3.48},
        }

    def add_usage(self, usage):
        self.total_input_tokens += usage.prompt_tokens
        self.total_output_tokens += usage.completion_tokens

    @property
    def total_cost(self) -> float:
        prices = self.prices[self.model]
        input_cost = (self.total_input_tokens / 1_000_000) * prices["input"]
        output_cost = (self.total_output_tokens / 1_000_000) * prices["output"]
        return input_cost + output_cost

    def report(self) -> str:
        return (
            f"Tokens: {self.total_input_tokens:,} in / "
            f"{self.total_output_tokens:,} out\n"
            f"Cost: ${self.total_cost:.4f}"
        )

# Usage
tracker = CostTracker("deepseek-v4-flash")

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[...],
    max_tokens=1000
)
tracker.add_usage(response.usage)

print(tracker.report())
# Tokens: 1,245 in / 387 out
# Cost: $0.0003

At $0.14 per million input tokens with V4-Flash, you can process thousands of images for pennies.

Tips and Best Practices

Choose the right model. Use V4-Flash for straightforward tasks like basic OCR, image labeling, and simple descriptions. Switch to V4-Pro when you need the model to reason about what it sees, like comparing documents or interpreting complex diagrams.

Keep images reasonable. While the API accepts large images, resizing to 800x800 or smaller usually gives the same quality at lower cost. The model’s visual primitives compress the information anyway.

Use low temperature for extraction. When you want consistent, factual output (OCR, data extraction), set temperature to 0.1 or even 0. Save higher temperatures for creative descriptions.

Batch wisely. Sending multiple images in one request is cheaper than separate requests because you only pay for the system prompt once. But don’t overdo it: if one image in a batch causes an error, you lose the whole response.

For a comparison of how DeepSeek Vision stacks up against GPT-4o and Gemini on these tasks, see our detailed benchmark comparison.

FAQ

What image formats does DeepSeek Vision support?

JPEG, PNG, WebP, and GIF (first frame only). Both URLs and base64-encoded images work. For most use cases, JPEG gives you the best quality-to-size ratio.

Is there a maximum image size?

The API accepts images up to 20MB. However, images larger than 2048x2048 pixels are automatically resized. For cost efficiency, pre-resize your images to around 800x800 before sending them.

Can I use the async OpenAI client?

Yes. The AsyncOpenAI class works identically. Just pass the same base_url and api_key, then use await client.chat.completions.create(...). This is the recommended approach for web applications and high-throughput pipelines.

How does the 1M context window work with images?

Each image uses approximately 90 KV cache entries regardless of resolution (after internal resizing). That means you could theoretically include thousands of images in a single conversation. In practice, you’re more likely limited by the base64 payload size and API timeout settings.

What’s the rate limit?

DeepSeek’s rate limits vary by plan. Free tier accounts get lower throughput. Paid accounts typically get 60 requests per minute and 1M tokens per minute. Check your dashboard at platform.deepseek.com for your specific limits.

Can I use this with LangChain or LlamaIndex?

Yes to both. Since the API is OpenAI-compatible, any framework that supports custom OpenAI endpoints works out of the box. Just set the base URL to https://api.deepseek.com in your framework’s configuration.

📘