🤖 AI Tools
· 10 min read

Self-Hosting DeepSeek Vision: Complete Local Setup Guide (2026)


DeepSeek released their vision models under an MIT license. That means you can download the weights, run them on your own hardware, and process images without any data leaving your infrastructure. No per-token costs. No rate limits. No sending sensitive documents to servers in China.

This guide covers everything from hardware requirements to a running local inference server. We’ll look at both vLLM (for maximum performance) and Ollama (for simplicity), plus quantization options that let you run on consumer GPUs.

If you’re deciding between the API and self-hosting, this article will help you figure out which makes sense for your situation. For API usage, see our complete DeepSeek Vision guide.

Why Self-Host?

Three reasons come up repeatedly:

Data privacy. The DeepSeek API routes through servers in China. For organizations subject to GDPR, HIPAA, SOC 2, or internal data governance policies, sending documents to foreign servers may violate compliance requirements. Self-hosting eliminates this concern entirely.

China data law concerns. China’s data security laws (DSL and PIPL) give the government broad authority to access data processed within its jurisdiction. Even if DeepSeek’s privacy policy says they don’t share data, the legal framework allows compelled access. Some organizations, particularly government contractors and defense-adjacent companies, have blanket policies against data processing in China.

Cost at scale. At high volume (500K+ images per month), owning or renting GPUs becomes cheaper than API calls. The break-even depends on your hardware costs, but the math favors self-hosting once you cross that threshold.

Available Models

DeepSeek released several vision-capable models you can self-host:

ModelParametersVRAM RequiredQuality Level
DeepSeek-VL2-Tiny3B8GBBasic OCR, simple descriptions
DeepSeek-VL2-Small16B36GBGood general quality
DeepSeek-VL227B (MoE)48GBNear API quality
DeepSeek-VL2-Large72B140GB+Matches V4-Flash API

The MoE (Mixture of Experts) architecture in the 27B model means it activates only a subset of parameters per forward pass, making it faster than a dense 27B model while using less memory than you’d expect.

For most self-hosting scenarios, DeepSeek-VL2 (27B) offers the best quality-to-cost ratio. It runs on a single A100 80GB or two A6000 48GB GPUs.

Hardware Requirements

Minimum (DeepSeek-VL2-Tiny, 3B)

  • GPU: RTX 3060 12GB or better
  • RAM: 16GB system
  • Storage: 10GB for model weights
  • Good for: Testing, simple OCR, hobbyist projects
  • GPU: A100 80GB or 2x A6000 48GB
  • RAM: 64GB system
  • Storage: 60GB for model weights (full precision) or 15GB (quantized)
  • Good for: Production workloads, high-quality results

High Performance (DeepSeek-VL2-Large, 72B)

  • GPU: 2x A100 80GB or 4x A6000 48GB
  • RAM: 128GB system
  • Storage: 150GB for model weights
  • Good for: Maximum quality, matching V4-Pro API output

Consumer GPU Options (Quantized)

GPUVRAMBest Model Fit
RTX 4060 Ti 16GB16GBVL2-Tiny (full), VL2-Small (4-bit)
RTX 4080 16GB16GBVL2-Tiny (full), VL2-Small (4-bit)
RTX 4090 24GB24GBVL2-Small (full), VL2-27B (4-bit)
RTX 5090 32GB32GBVL2-27B (5-bit)
2x RTX 409048GBVL2-27B (full)

Downloading the Weights

Models are hosted on HuggingFace. You’ll need git-lfs:

# Install git-lfs if you don't have it
sudo apt install git-lfs  # Ubuntu/Debian
brew install git-lfs       # macOS

git lfs install

# Clone the model (27B, recommended)
git clone https://huggingface.co/deepseek-ai/DeepSeek-VL2
# This downloads ~55GB

# Or just the small model for testing
git clone https://huggingface.co/deepseek-ai/DeepSeek-VL2-Tiny
# This downloads ~7GB

Alternatively, use the huggingface-hub Python package for more control:

pip install huggingface-hub

huggingface-cli download deepseek-ai/DeepSeek-VL2 --local-dir ./models/deepseek-vl2

vLLM gives you the best throughput and supports continuous batching, which means multiple requests process efficiently in parallel.

Installation

pip install vllm>=0.6.0

Starting the Server

python -m vllm.entrypoints.openai.api_server \
    --model ./models/deepseek-vl2 \
    --trust-remote-code \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9 \
    --port 8000 \
    --served-model-name deepseek-vl2

For multi-GPU setups:

python -m vllm.entrypoints.openai.api_server \
    --model ./models/deepseek-vl2 \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.9 \
    --port 8000 \
    --served-model-name deepseek-vl2

Using the Server

The vLLM server exposes an OpenAI-compatible API. Your existing code works with just a URL change:

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",  # local server doesn't check keys
    base_url="http://localhost:8000/v1"
)

response = client.chat.completions.create(
    model="deepseek-vl2",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "data:image/jpeg;base64,<your-base64>"}
                }
            ]
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

That’s the same code from our Python tutorial, just pointing at localhost. Zero code changes needed if you abstract the base URL into an environment variable.

vLLM Performance Tuning

# Enable prefix caching (helps with repeated prompts)
--enable-prefix-caching

# Increase batch size for throughput (uses more VRAM)
--max-num-batched-tokens 8192

# Enable chunked prefill for better latency on long inputs
--enable-chunked-prefill

Option 2: Ollama Setup (Simplest)

Ollama is the easiest way to get started. One command to download, one to run. Less configuration, less throughput than vLLM, but dead simple.

Installation

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Running DeepSeek-VL2

# Pull the model (quantized versions available)
ollama pull deepseek-vl2

# Or a specific quantization
ollama pull deepseek-vl2:q4_K_M

# Start serving
ollama serve

Using with Ollama

Ollama exposes its own API, but also supports the OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="ollama",
    base_url="http://localhost:11434/v1"
)

# Same usage as above
response = client.chat.completions.create(
    model="deepseek-vl2",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract text from this image."},
                {
                    "type": "image_url",
                    "image_url": {"url": "data:image/jpeg;base64,<base64-data>"}
                }
            ]
        }
    ],
    max_tokens=1000
)

Ollama Limitations

Ollama is great for simplicity but has tradeoffs:

  • No continuous batching (sequential processing only)
  • Lower throughput than vLLM under load
  • Memory management is less sophisticated
  • Limited control over quantization parameters

For production workloads serving multiple concurrent users, use vLLM. For local development and single-user scenarios, Ollama is perfect.

Quantization Options

If you don’t have enterprise GPUs, quantization lets you run larger models on less VRAM. Here’s what’s available:

GGUF Format (Ollama / llama.cpp)

QuantizationVRAM (27B)Quality LossSpeed Impact
Q8_030GBNegligibleMinimal
Q6_K24GBVery slight~5% slower
Q5_K_M20GBMinor~10% slower
Q4_K_M16GBNoticeable on complex tasks~15% slower
Q3_K_M13GBSignificant~20% slower

My recommendation: Q5_K_M is the sweet spot. It fits on an RTX 4090 and maintains over 95% of full-precision quality for OCR and image description tasks. Drop to Q4_K_M if you need to squeeze into 16GB.

AWQ Format (vLLM)

AWQ (Activation-aware Weight Quantization) is optimized for GPU inference:

# Download AWQ-quantized version
huggingface-cli download deepseek-ai/DeepSeek-VL2-AWQ --local-dir ./models/deepseek-vl2-awq

# Run with vLLM
python -m vllm.entrypoints.openai.api_server \
    --model ./models/deepseek-vl2-awq \
    --quantization awq \
    --trust-remote-code \
    --max-model-len 32768 \
    --port 8000

AWQ generally gives better throughput than GGUF on NVIDIA GPUs because it’s designed for CUDA tensor cores. Quality is comparable to Q4 GGUF.

GPTQ Format

GPTQ is another option, particularly well-supported by older tooling:

huggingface-cli download deepseek-ai/DeepSeek-VL2-GPTQ-Int4 --local-dir ./models/deepseek-vl2-gptq

In 2026, AWQ has largely superseded GPTQ for new deployments. AWQ is faster and marginally more accurate at the same bit width.

Performance Benchmarks: Local vs API

I benchmarked both setups processing 100 invoice images (average 800x600px, typical business documents):

Throughput (images per minute)

SetupImages/minNotes
DeepSeek API (V4-Flash)45-60Depends on rate limit tier
vLLM (A100 80GB, FP16)35-40Single GPU, batch size 4
vLLM (2x A6000, FP16)50-55Tensor parallel
vLLM (RTX 4090, Q4_K_M)12-15Consumer GPU
Ollama (RTX 4090, Q4_K_M)8-10Sequential only

Quality Comparison (OCR accuracy on invoice test set)

SetupAccuracyNotes
DeepSeek API (V4-Flash)96.2%Production model
DeepSeek API (V4-Pro)97.8%Reasoning model
Local VL2-27B (FP16)94.1%Full precision
Local VL2-27B (Q5_K_M)93.5%Slight quality loss
Local VL2-27B (Q4_K_M)91.8%Noticeable on handwriting
Local VL2-Tiny (3B)82.3%Basic OCR only

The API models (V4-Flash and V4-Pro) are newer and more capable than the open-weight VL2 release. Expect roughly 2-4% lower accuracy with the local model at full precision. Quantization adds another 1-3% loss depending on the level.

Cost Comparison (Monthly, 100K images)

OptionMonthly CostSetup Effort
DeepSeek API (V4-Flash)~$40Minimal
Cloud A100 (on-demand)~$2,400Medium
Cloud A100 (reserved 1yr)~$1,200Medium
Own RTX 4090 (electricity)~$50High initial
Own A100 (colocation)~$400High

The break-even for cloud GPU vs API is around 500K-1M images per month. For owned hardware, the payback period is 6-12 months depending on your existing infrastructure.

Docker Deployment

For production self-hosting, containerize the setup:

FROM nvidia/cuda:12.4-runtime-ubuntu22.04

RUN pip install vllm>=0.6.0 huggingface-hub

# Download model at build time (or mount as volume)
RUN huggingface-cli download deepseek-ai/DeepSeek-VL2-AWQ --local-dir /models/deepseek-vl2

EXPOSE 8000

CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "/models/deepseek-vl2", \
     "--quantization", "awq", \
     "--trust-remote-code", \
     "--max-model-len", "32768", \
     "--port", "8000"]
docker build -t deepseek-vl2-server .
docker run --gpus all -p 8000:8000 deepseek-vl2-server

For production, use a docker-compose setup with health checks and restart policies:

version: "3.8"
services:
  deepseek-vl2:
    build: .
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "8000:8000"
    volumes:
      - ./models:/models
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

When Self-Hosting Makes Sense

After running both API and self-hosted setups in production, here’s my honest assessment:

Self-host when:

  • You can’t send data to external servers (compliance, legal, contractual)
  • You’re processing 500K+ images monthly and own GPU infrastructure
  • You need zero-latency (local network) for real-time applications
  • You’re in a jurisdiction where sending data to China is a legal risk
  • You need guaranteed availability without depending on DeepSeek’s uptime

Use the API when:

  • Volume is under 500K images/month
  • You don’t have GPU infrastructure or ops expertise
  • You need the highest quality (V4-Flash/Pro are better than VL2 open weights)
  • Quick iteration is more important than cost optimization
  • You’re prototyping and don’t want infrastructure overhead

The middle ground: Hybrid approach

Many teams use the API for development and testing, then self-host for production data processing. This gives you the API’s convenience during development and the privacy/cost benefits of self-hosting for actual workloads.

For a comparison of how self-hosted quality compares to API alternatives, see our benchmark comparison article.

Monitoring Your Local Deployment

Once you’re running in production, monitor these metrics:

import requests

def check_health(base_url: str = "http://localhost:8000"):
    # vLLM exposes metrics at /metrics (Prometheus format)
    metrics = requests.get(f"{base_url}/metrics").text

    # Key metrics to watch:
    # - vllm:num_requests_running (current load)
    # - vllm:num_requests_waiting (queue depth)
    # - vllm:gpu_cache_usage_perc (memory pressure)
    # - vllm:avg_generation_throughput_toks_per_s (performance)

    return metrics

Set up alerts when GPU cache usage exceeds 90% or request queue depth grows consistently. These indicate you need to scale up (more GPUs) or optimize (better batching, shorter max_tokens).

FAQ

Can I run DeepSeek Vision on a Mac with Apple Silicon?

Yes, using Ollama or llama.cpp with the GGUF format. An M2 Ultra with 192GB unified memory can run the full 27B model in Q5_K_M quantization. An M4 Pro with 48GB handles the 16B model comfortably. Performance is roughly 3-5x slower than an equivalent NVIDIA GPU due to Metal’s lower throughput on matrix operations. Fine for development, not great for production throughput.

How much does GPU rental cost for self-hosting?

A100 80GB instances run roughly $1.50-2.50/hour on major cloud providers (Lambda, RunPod, Vast.ai). Reserved instances drop to $0.80-1.20/hour with annual commitments. For comparison, processing 100K images on the API costs about $40 with V4-Flash. You’d need to run the GPU for 20-25 hours to process the same volume, costing $30-60. The API wins on small-to-medium volume.

Is the MIT license really “do anything”?

Yes. MIT is one of the most permissive open-source licenses. You can use DeepSeek-VL2 commercially, modify it, redistribute it, embed it in proprietary products, and never share your modifications. The only requirement is including the copyright notice. There are no usage restrictions, no “responsible AI” clauses, and no revenue-sharing. This is a genuine advantage over some alternatives that use more restrictive licenses.

What’s the quality gap between local VL2 and the V4-Flash API?

Roughly 2-4% lower accuracy on complex tasks, with minimal difference on simple OCR. The V4 API models have been trained on more data and fine-tuned more extensively than the open-weight VL2 release. For straightforward document extraction, you won’t notice the difference. For tasks requiring nuanced reasoning (comparing documents, interpreting ambiguous handwriting), the API models pull ahead meaningfully.

Can I fine-tune the model for my specific documents?

Yes. Since you have the full weights, you can fine-tune using LoRA or full fine-tuning on your document types. This can close the gap with the API models and even surpass them on your specific domain. You’ll need a few hundred labeled examples (image + expected output pairs). Tools like axolotl and unsloth make fine-tuning vision models reasonably straightforward with a single A100.

Should I be worried about running Chinese AI models?

The MIT-licensed weights are just numbers. There’s no phone-home functionality, no telemetry, no data collection. Once downloaded, the model runs entirely offline. The concern about Chinese data laws applies only to the API (where your data goes to their servers), not to self-hosting. Running the model locally is exactly as safe as running any other open-source software on your machine. Verify this yourself by monitoring network traffic during inference.