DeepSeek released their vision models under an MIT license. That means you can download the weights, run them on your own hardware, and process images without any data leaving your infrastructure. No per-token costs. No rate limits. No sending sensitive documents to servers in China.
This guide covers everything from hardware requirements to a running local inference server. We’ll look at both vLLM (for maximum performance) and Ollama (for simplicity), plus quantization options that let you run on consumer GPUs.
If you’re deciding between the API and self-hosting, this article will help you figure out which makes sense for your situation. For API usage, see our complete DeepSeek Vision guide.
Why Self-Host?
Three reasons come up repeatedly:
Data privacy. The DeepSeek API routes through servers in China. For organizations subject to GDPR, HIPAA, SOC 2, or internal data governance policies, sending documents to foreign servers may violate compliance requirements. Self-hosting eliminates this concern entirely.
China data law concerns. China’s data security laws (DSL and PIPL) give the government broad authority to access data processed within its jurisdiction. Even if DeepSeek’s privacy policy says they don’t share data, the legal framework allows compelled access. Some organizations, particularly government contractors and defense-adjacent companies, have blanket policies against data processing in China.
Cost at scale. At high volume (500K+ images per month), owning or renting GPUs becomes cheaper than API calls. The break-even depends on your hardware costs, but the math favors self-hosting once you cross that threshold.
Available Models
DeepSeek released several vision-capable models you can self-host:
| Model | Parameters | VRAM Required | Quality Level |
|---|---|---|---|
| DeepSeek-VL2-Tiny | 3B | 8GB | Basic OCR, simple descriptions |
| DeepSeek-VL2-Small | 16B | 36GB | Good general quality |
| DeepSeek-VL2 | 27B (MoE) | 48GB | Near API quality |
| DeepSeek-VL2-Large | 72B | 140GB+ | Matches V4-Flash API |
The MoE (Mixture of Experts) architecture in the 27B model means it activates only a subset of parameters per forward pass, making it faster than a dense 27B model while using less memory than you’d expect.
For most self-hosting scenarios, DeepSeek-VL2 (27B) offers the best quality-to-cost ratio. It runs on a single A100 80GB or two A6000 48GB GPUs.
Hardware Requirements
Minimum (DeepSeek-VL2-Tiny, 3B)
- GPU: RTX 3060 12GB or better
- RAM: 16GB system
- Storage: 10GB for model weights
- Good for: Testing, simple OCR, hobbyist projects
Recommended (DeepSeek-VL2, 27B)
- GPU: A100 80GB or 2x A6000 48GB
- RAM: 64GB system
- Storage: 60GB for model weights (full precision) or 15GB (quantized)
- Good for: Production workloads, high-quality results
High Performance (DeepSeek-VL2-Large, 72B)
- GPU: 2x A100 80GB or 4x A6000 48GB
- RAM: 128GB system
- Storage: 150GB for model weights
- Good for: Maximum quality, matching V4-Pro API output
Consumer GPU Options (Quantized)
| GPU | VRAM | Best Model Fit |
|---|---|---|
| RTX 4060 Ti 16GB | 16GB | VL2-Tiny (full), VL2-Small (4-bit) |
| RTX 4080 16GB | 16GB | VL2-Tiny (full), VL2-Small (4-bit) |
| RTX 4090 24GB | 24GB | VL2-Small (full), VL2-27B (4-bit) |
| RTX 5090 32GB | 32GB | VL2-27B (5-bit) |
| 2x RTX 4090 | 48GB | VL2-27B (full) |
Downloading the Weights
Models are hosted on HuggingFace. You’ll need git-lfs:
# Install git-lfs if you don't have it
sudo apt install git-lfs # Ubuntu/Debian
brew install git-lfs # macOS
git lfs install
# Clone the model (27B, recommended)
git clone https://huggingface.co/deepseek-ai/DeepSeek-VL2
# This downloads ~55GB
# Or just the small model for testing
git clone https://huggingface.co/deepseek-ai/DeepSeek-VL2-Tiny
# This downloads ~7GB
Alternatively, use the huggingface-hub Python package for more control:
pip install huggingface-hub
huggingface-cli download deepseek-ai/DeepSeek-VL2 --local-dir ./models/deepseek-vl2
Option 1: vLLM Setup (Recommended for Production)
vLLM gives you the best throughput and supports continuous batching, which means multiple requests process efficiently in parallel.
Installation
pip install vllm>=0.6.0
Starting the Server
python -m vllm.entrypoints.openai.api_server \
--model ./models/deepseek-vl2 \
--trust-remote-code \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--port 8000 \
--served-model-name deepseek-vl2
For multi-GPU setups:
python -m vllm.entrypoints.openai.api_server \
--model ./models/deepseek-vl2 \
--trust-remote-code \
--tensor-parallel-size 2 \
--max-model-len 65536 \
--gpu-memory-utilization 0.9 \
--port 8000 \
--served-model-name deepseek-vl2
Using the Server
The vLLM server exposes an OpenAI-compatible API. Your existing code works with just a URL change:
from openai import OpenAI
client = OpenAI(
api_key="not-needed", # local server doesn't check keys
base_url="http://localhost:8000/v1"
)
response = client.chat.completions.create(
model="deepseek-vl2",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,<your-base64>"}
}
]
}
],
max_tokens=500
)
print(response.choices[0].message.content)
That’s the same code from our Python tutorial, just pointing at localhost. Zero code changes needed if you abstract the base URL into an environment variable.
vLLM Performance Tuning
# Enable prefix caching (helps with repeated prompts)
--enable-prefix-caching
# Increase batch size for throughput (uses more VRAM)
--max-num-batched-tokens 8192
# Enable chunked prefill for better latency on long inputs
--enable-chunked-prefill
Option 2: Ollama Setup (Simplest)
Ollama is the easiest way to get started. One command to download, one to run. Less configuration, less throughput than vLLM, but dead simple.
Installation
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
Running DeepSeek-VL2
# Pull the model (quantized versions available)
ollama pull deepseek-vl2
# Or a specific quantization
ollama pull deepseek-vl2:q4_K_M
# Start serving
ollama serve
Using with Ollama
Ollama exposes its own API, but also supports the OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(
api_key="ollama",
base_url="http://localhost:11434/v1"
)
# Same usage as above
response = client.chat.completions.create(
model="deepseek-vl2",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Extract text from this image."},
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,<base64-data>"}
}
]
}
],
max_tokens=1000
)
Ollama Limitations
Ollama is great for simplicity but has tradeoffs:
- No continuous batching (sequential processing only)
- Lower throughput than vLLM under load
- Memory management is less sophisticated
- Limited control over quantization parameters
For production workloads serving multiple concurrent users, use vLLM. For local development and single-user scenarios, Ollama is perfect.
Quantization Options
If you don’t have enterprise GPUs, quantization lets you run larger models on less VRAM. Here’s what’s available:
GGUF Format (Ollama / llama.cpp)
| Quantization | VRAM (27B) | Quality Loss | Speed Impact |
|---|---|---|---|
| Q8_0 | 30GB | Negligible | Minimal |
| Q6_K | 24GB | Very slight | ~5% slower |
| Q5_K_M | 20GB | Minor | ~10% slower |
| Q4_K_M | 16GB | Noticeable on complex tasks | ~15% slower |
| Q3_K_M | 13GB | Significant | ~20% slower |
My recommendation: Q5_K_M is the sweet spot. It fits on an RTX 4090 and maintains over 95% of full-precision quality for OCR and image description tasks. Drop to Q4_K_M if you need to squeeze into 16GB.
AWQ Format (vLLM)
AWQ (Activation-aware Weight Quantization) is optimized for GPU inference:
# Download AWQ-quantized version
huggingface-cli download deepseek-ai/DeepSeek-VL2-AWQ --local-dir ./models/deepseek-vl2-awq
# Run with vLLM
python -m vllm.entrypoints.openai.api_server \
--model ./models/deepseek-vl2-awq \
--quantization awq \
--trust-remote-code \
--max-model-len 32768 \
--port 8000
AWQ generally gives better throughput than GGUF on NVIDIA GPUs because it’s designed for CUDA tensor cores. Quality is comparable to Q4 GGUF.
GPTQ Format
GPTQ is another option, particularly well-supported by older tooling:
huggingface-cli download deepseek-ai/DeepSeek-VL2-GPTQ-Int4 --local-dir ./models/deepseek-vl2-gptq
In 2026, AWQ has largely superseded GPTQ for new deployments. AWQ is faster and marginally more accurate at the same bit width.
Performance Benchmarks: Local vs API
I benchmarked both setups processing 100 invoice images (average 800x600px, typical business documents):
Throughput (images per minute)
| Setup | Images/min | Notes |
|---|---|---|
| DeepSeek API (V4-Flash) | 45-60 | Depends on rate limit tier |
| vLLM (A100 80GB, FP16) | 35-40 | Single GPU, batch size 4 |
| vLLM (2x A6000, FP16) | 50-55 | Tensor parallel |
| vLLM (RTX 4090, Q4_K_M) | 12-15 | Consumer GPU |
| Ollama (RTX 4090, Q4_K_M) | 8-10 | Sequential only |
Quality Comparison (OCR accuracy on invoice test set)
| Setup | Accuracy | Notes |
|---|---|---|
| DeepSeek API (V4-Flash) | 96.2% | Production model |
| DeepSeek API (V4-Pro) | 97.8% | Reasoning model |
| Local VL2-27B (FP16) | 94.1% | Full precision |
| Local VL2-27B (Q5_K_M) | 93.5% | Slight quality loss |
| Local VL2-27B (Q4_K_M) | 91.8% | Noticeable on handwriting |
| Local VL2-Tiny (3B) | 82.3% | Basic OCR only |
The API models (V4-Flash and V4-Pro) are newer and more capable than the open-weight VL2 release. Expect roughly 2-4% lower accuracy with the local model at full precision. Quantization adds another 1-3% loss depending on the level.
Cost Comparison (Monthly, 100K images)
| Option | Monthly Cost | Setup Effort |
|---|---|---|
| DeepSeek API (V4-Flash) | ~$40 | Minimal |
| Cloud A100 (on-demand) | ~$2,400 | Medium |
| Cloud A100 (reserved 1yr) | ~$1,200 | Medium |
| Own RTX 4090 (electricity) | ~$50 | High initial |
| Own A100 (colocation) | ~$400 | High |
The break-even for cloud GPU vs API is around 500K-1M images per month. For owned hardware, the payback period is 6-12 months depending on your existing infrastructure.
Docker Deployment
For production self-hosting, containerize the setup:
FROM nvidia/cuda:12.4-runtime-ubuntu22.04
RUN pip install vllm>=0.6.0 huggingface-hub
# Download model at build time (or mount as volume)
RUN huggingface-cli download deepseek-ai/DeepSeek-VL2-AWQ --local-dir /models/deepseek-vl2
EXPOSE 8000
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "/models/deepseek-vl2", \
"--quantization", "awq", \
"--trust-remote-code", \
"--max-model-len", "32768", \
"--port", "8000"]
docker build -t deepseek-vl2-server .
docker run --gpus all -p 8000:8000 deepseek-vl2-server
For production, use a docker-compose setup with health checks and restart policies:
version: "3.8"
services:
deepseek-vl2:
build: .
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- "8000:8000"
volumes:
- ./models:/models
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
When Self-Hosting Makes Sense
After running both API and self-hosted setups in production, here’s my honest assessment:
Self-host when:
- You can’t send data to external servers (compliance, legal, contractual)
- You’re processing 500K+ images monthly and own GPU infrastructure
- You need zero-latency (local network) for real-time applications
- You’re in a jurisdiction where sending data to China is a legal risk
- You need guaranteed availability without depending on DeepSeek’s uptime
Use the API when:
- Volume is under 500K images/month
- You don’t have GPU infrastructure or ops expertise
- You need the highest quality (V4-Flash/Pro are better than VL2 open weights)
- Quick iteration is more important than cost optimization
- You’re prototyping and don’t want infrastructure overhead
The middle ground: Hybrid approach
Many teams use the API for development and testing, then self-host for production data processing. This gives you the API’s convenience during development and the privacy/cost benefits of self-hosting for actual workloads.
For a comparison of how self-hosted quality compares to API alternatives, see our benchmark comparison article.
Monitoring Your Local Deployment
Once you’re running in production, monitor these metrics:
import requests
def check_health(base_url: str = "http://localhost:8000"):
# vLLM exposes metrics at /metrics (Prometheus format)
metrics = requests.get(f"{base_url}/metrics").text
# Key metrics to watch:
# - vllm:num_requests_running (current load)
# - vllm:num_requests_waiting (queue depth)
# - vllm:gpu_cache_usage_perc (memory pressure)
# - vllm:avg_generation_throughput_toks_per_s (performance)
return metrics
Set up alerts when GPU cache usage exceeds 90% or request queue depth grows consistently. These indicate you need to scale up (more GPUs) or optimize (better batching, shorter max_tokens).
FAQ
Can I run DeepSeek Vision on a Mac with Apple Silicon?
Yes, using Ollama or llama.cpp with the GGUF format. An M2 Ultra with 192GB unified memory can run the full 27B model in Q5_K_M quantization. An M4 Pro with 48GB handles the 16B model comfortably. Performance is roughly 3-5x slower than an equivalent NVIDIA GPU due to Metal’s lower throughput on matrix operations. Fine for development, not great for production throughput.
How much does GPU rental cost for self-hosting?
A100 80GB instances run roughly $1.50-2.50/hour on major cloud providers (Lambda, RunPod, Vast.ai). Reserved instances drop to $0.80-1.20/hour with annual commitments. For comparison, processing 100K images on the API costs about $40 with V4-Flash. You’d need to run the GPU for 20-25 hours to process the same volume, costing $30-60. The API wins on small-to-medium volume.
Is the MIT license really “do anything”?
Yes. MIT is one of the most permissive open-source licenses. You can use DeepSeek-VL2 commercially, modify it, redistribute it, embed it in proprietary products, and never share your modifications. The only requirement is including the copyright notice. There are no usage restrictions, no “responsible AI” clauses, and no revenue-sharing. This is a genuine advantage over some alternatives that use more restrictive licenses.
What’s the quality gap between local VL2 and the V4-Flash API?
Roughly 2-4% lower accuracy on complex tasks, with minimal difference on simple OCR. The V4 API models have been trained on more data and fine-tuned more extensively than the open-weight VL2 release. For straightforward document extraction, you won’t notice the difference. For tasks requiring nuanced reasoning (comparing documents, interpreting ambiguous handwriting), the API models pull ahead meaningfully.
Can I fine-tune the model for my specific documents?
Yes. Since you have the full weights, you can fine-tune using LoRA or full fine-tuning on your document types. This can close the gap with the API models and even surpass them on your specific domain. You’ll need a few hundred labeled examples (image + expected output pairs). Tools like axolotl and unsloth make fine-tuning vision models reasonably straightforward with a single A100.
Should I be worried about running Chinese AI models?
The MIT-licensed weights are just numbers. There’s no phone-home functionality, no telemetry, no data collection. Once downloaded, the model runs entirely offline. The concern about Chinese data laws applies only to the API (where your data goes to their servers), not to self-hosting. Running the model locally is exactly as safe as running any other open-source software on your machine. Verify this yourself by monitoring network traffic during inference.