DeepSeek just made their V4 models multimodal. Both V4-Pro and V4-Flash can now process images alongside text, turning them into direct competitors to GPT-4o and Gemini for visual understanding tasks. The kicker: pricing starts at $0.14 per million input tokens. That’s roughly 10x cheaper than the closest alternative.
Here’s what DeepSeek Vision can do, how to set it up, and where it actually makes sense to use.
What is DeepSeek Vision?
DeepSeek Vision is the multimodal capability built into DeepSeek V4-Pro and V4-Flash. It uses a novel architecture that processes images with just 90 KV cache entries per 800x800 image, compared to roughly 870 for Claude and 1,100 for Gemini. That efficiency translates directly into lower costs per image-heavy request.
The technical approach: instead of encoding images into thousands of visual tokens (the standard method), DeepSeek uses “visual primitives” that compress visual information more aggressively while maintaining understanding accuracy.
Key specs:
- Models:
deepseek-v4-pro(reasoning) anddeepseek-v4-flash(fast/cheap) - Input: text + images (URLs or base64)
- Context: 1M tokens (including visual tokens)
- API: OpenAI-compatible (drop-in replacement)
- Pricing: V4-Flash $0.14/$0.28, V4-Pro $1.74/$3.48 per million tokens
Pricing breakdown
| Model | Input | Output | Image cost (per 800x800) |
|---|---|---|---|
| DeepSeek V4-Flash | $0.14/M | $0.28/M | ~$0.000013 |
| DeepSeek V4-Pro | $1.74/M | $3.48/M | ~$0.000157 |
| GPT-4o | $2.50/M | $10.00/M | ~$0.002175 |
| Gemini 3.5 Pro | $1.25/M | $5.00/M | ~$0.001375 |
| Claude Opus 4.8 | $15.00/M | $75.00/M | ~$0.013050 |
DeepSeek V4-Flash processes an image for roughly 1/170th the cost of Claude Opus. Even V4-Pro is 14x cheaper than GPT-4o for the same image understanding task.
What it can do
Based on benchmarks and community testing:
Strong:
- OCR and document text extraction
- Chart and graph understanding
- Screenshot analysis and UI description
- General image description and visual Q&A
- Multi-image comparison
- Table extraction from images
- Handwriting recognition
Decent:
- Diagram and flowchart interpretation
- Code screenshot to text
- Object detection and counting
- Spatial reasoning (“what’s to the left of X?”)
Weaker than GPT-4o/Gemini:
- Complex multi-step visual reasoning
- Very fine-grained image detail (tiny text, subtle differences)
- Video frame analysis (not supported yet)
How to use DeepSeek Vision (API setup)
The API is OpenAI-compatible. If you’re already using the OpenAI SDK, you just swap the base URL and key.
from openai import OpenAI
client = OpenAI(
api_key="your-deepseek-key",
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "https://example.com/screenshot.png"}
},
{
"type": "text",
"text": "What does this screenshot show? Extract all visible text."
}
]
}
]
)
print(response.choices[0].message.content)
You can also pass base64-encoded images:
import base64
with open("document.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_b64}"}
},
{
"type": "text",
"text": "Extract all text from this document, preserving formatting."
}
]
}
]
)
Best use cases
1. Batch document processing At $0.14/M tokens, you can process thousands of invoices, receipts, or forms for pennies. A typical document page costs $0.001-0.003 to extract text from.
2. Screenshot-based QA testing Feed app screenshots to DeepSeek Vision and ask “does this match the design spec?” Cheap enough to run on every CI build.
3. Content moderation at scale Check user-uploaded images for policy violations. At these prices, you can scan millions of images without budget concerns.
4. Chart/graph data extraction Turn visual charts into structured data (JSON/CSV). Works well for financial reports, dashboards, and presentation slides.
5. Accessibility alt-text generation Generate image descriptions for websites. At $0.000013 per image, there’s no excuse not to add alt text to everything.
When NOT to use DeepSeek Vision
- If you need the absolute best accuracy on complex visual reasoning, GPT-4o still wins
- If data sensitivity requires US/EU processing only (DeepSeek runs through Chinese infrastructure, subject to China’s National Intelligence Law)
- If you need video understanding (not supported yet)
- If you need image generation (DeepSeek Vision is understanding only, not generation)
For sensitive data: self-host using the open weights. DeepSeek-VL2 is available on HuggingFace under MIT license if you want to avoid the API entirely.
DeepSeek Vision vs alternatives
| Feature | DeepSeek V4-Pro | GPT-4o | Gemini 3.5 Pro |
|---|---|---|---|
| Price (input) | $1.74/M | $2.50/M | $1.25/M |
| Price (output) | $3.48/M | $10.00/M | $5.00/M |
| Image efficiency | 90 KV entries | ~870 KV entries | ~1,100 KV entries |
| Context window | 1M | 128K | 2M |
| OCR quality | Very good | Excellent | Excellent |
| Complex reasoning | Good | Excellent | Very good |
| Open weights | Yes (MIT) | No | No |
| Data sovereignty | China | US | US |
| Video input | No | Yes | Yes |
Bottom line: If cost matters more than marginal accuracy differences, DeepSeek Vision is the obvious choice. If you’re processing documents, screenshots, or doing batch image analysis, the 10-14x cost savings add up to thousands of dollars on real workloads.
Getting started
- Get an API key at platform.deepseek.com
- New accounts get 5M free tokens (enough for ~55,000 image analyses on V4-Flash)
- Use any OpenAI-compatible SDK with
base_url="https://api.deepseek.com" - Start with V4-Flash for speed/cost, upgrade to V4-Pro for complex reasoning
FAQ
Is DeepSeek Vision as good as GPT-4o?
For most practical tasks (OCR, document extraction, screenshot understanding), it’s 90-95% as good at 10-14x lower cost. For complex multi-step visual reasoning, GPT-4o still has an edge.
Can I use DeepSeek Vision with sensitive documents?
The API routes through Chinese infrastructure. If data sovereignty matters, self-host using the open weights on HuggingFace (MIT license, no restrictions).
Does DeepSeek Vision support video?
No, only static images currently. For video understanding, you’ll need GPT-4o or Gemini.
How many images can I send in one request?
Multiple images are supported in a single request. The total context window is 1M tokens, and images use roughly 90 tokens each, so theoretically thousands per request.
Is there a free tier?
New accounts get 5 million free tokens. After that, V4-Flash starts at $0.14 per million tokens, which means $1 gets you roughly 7 million input tokens or ~77,000 image analyses.
How does it compare to the banned Claude Fable 5?
Fable 5 had superior visual reasoning but is currently unavailable due to US export controls. For developers affected by the ban, DeepSeek Vision is the most cost-effective alternative with open weights you can self-host without any access restrictions.