Jun 19, 2026 · 5 min read

Run DeepSeek V4 on a Vultr GPU Server (Complete Setup)

Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.

DeepSeek V4 is one of the best open-weight models available right now. The Flash variant uses mixture-of-experts (MoE) with only 13B active parameters — meaning it fits on a single A100 80GB GPU while delivering performance that rivals much larger models.

Running it yourself means: no rate limits, full control over system prompts, privacy for sensitive queries, and potentially cheaper than API pricing at high volume.

In this tutorial, I’ll walk you through deploying DeepSeek V4 Flash on a Vultr GPU instance with vLLM, giving you an OpenAI-compatible API endpoint you can use from anywhere.

What You’ll Get

By the end of this tutorial:

DeepSeek V4 Flash running on an A100 80GB GPU
An OpenAI-compatible API endpoint (drop-in replacement)
~40-60 tokens/second generation speed
Full control over the model and inference settings

For the complete model breakdown, see our DeepSeek V4 Flash guide.

Getting Started

You’ll need a Vultr account with access to GPU instances:

Get $250 Vultr credits

The $250 credit covers ~135 hours of A100 time — enough to thoroughly test and benchmark your setup.

Step 1: Create an A100 Instance

Log into Vultr → Deploy New Server
Select Cloud GPU
Choose NVIDIA A100 80GB ($1.85/hr)
OS: Ubuntu 22.04
Select a data center (pick the closest to your users)
Add your SSH key
Deploy

Wait 60 seconds for provisioning.

Why A100 80GB? DeepSeek V4 Flash has 13B active parameters in a larger MoE architecture. The total model weights are about 50-60GB in FP16, so you need the 80GB variant. An A40 48GB won’t fit it.

Step 2: SSH In and Verify GPU

ssh root@YOUR_SERVER_IP
nvidia-smi

You should see:

NVIDIA A100-SXM4-80GB | 80GB VRAM | Driver 535.x | CUDA 12.x

Step 3: Install vLLM

vLLM is the fastest open-source inference engine for LLMs. It supports DeepSeek V4’s MoE architecture natively.

pip install vllm --upgrade

This takes 2-3 minutes. vLLM pulls in PyTorch, CUDA libraries, and all dependencies.

If you want to compare inference engines, check our vLLM vs Ollama vs llama.cpp vs TGI comparison.

Step 4: Download and Serve DeepSeek V4 Flash

Start vLLM with the DeepSeek V4 Flash model:

vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 1 \
  --max-model-len 16384 \
  --trust-remote-code \
  --port 8000

The first run downloads the model weights (~50GB). On Vultr’s network, this takes about 3-5 minutes.

Once loaded, you’ll see:

INFO: Started server process
INFO: Uvicorn running on http://0.0.0.0:8000

Key flags explained:

--tensor-parallel-size 1 — single GPU (increase for multi-GPU setups)
--max-model-len 16384 — max context window (increase if you need more, but uses more VRAM)
--trust-remote-code — required for DeepSeek’s custom architecture

Step 5: Test the OpenAI-Compatible API

vLLM serves an OpenAI-compatible endpoint by default. Test it:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in 3 sentences."}
    ],
    "max_tokens": 200
  }'

You should get a response in 2-4 seconds with high-quality output.

Step 6: Use From Your Local Machine

Open port 8000 in Vultr’s firewall (restrict to your IP), then from your laptop:

curl http://YOUR_SERVER_IP:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Or use any OpenAI client library — just change the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://YOUR_SERVER_IP:8000/v1",
    api_key="not-needed"  # vLLM doesn't require auth by default
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[{"role": "user", "content": "Write a haiku about servers"}],
)
print(response.choices[0].message.content)

Step 7: Run as a Background Service

Don’t let your model stop when you close SSH:

# Using screen
screen -S vllm
vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 1 \
  --max-model-len 16384 \
  --trust-remote-code \
  --port 8000
# Press Ctrl+A then D to detach

Or create a proper systemd service:

cat > /etc/systemd/system/vllm.service << 'EOF'
[Unit]
Description=vLLM DeepSeek V4 Flash
After=network.target

[Service]
ExecStart=/usr/local/bin/vllm serve deepseek-ai/DeepSeek-V4-Flash --tensor-parallel-size 1 --max-model-len 16384 --trust-remote-code --port 8000
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

systemctl enable --now vllm

Cost Analysis: Self-Hosted vs API

Let’s compare running DeepSeek V4 Flash yourself vs using the DeepSeek API:

Self-hosted on Vultr (A100):

Cost: $1.85/hr = $44.40/day = ~$1,332/month
Throughput: ~50 tokens/sec sustained
Daily capacity: ~4.3M tokens/day
Effective cost: ~$0.01 per 1K tokens

DeepSeek API pricing:

Input: $0.07 per 1M tokens
Output: $0.28 per 1M tokens
Average: ~$0.14 per 1M tokens = $0.00014 per 1K tokens

Break-even point: The API is cheaper unless you’re processing millions of tokens daily with strict latency requirements, need full privacy, or need to avoid rate limits.

When self-hosting wins:

You need >5M tokens/day sustained
Privacy requirements (healthcare, legal, finance)
No rate limits for batch processing
Custom model modifications or fine-tuning
Guaranteed low latency (no queuing)

For more on running DeepSeek locally, see how to run DeepSeek V4 locally.

Performance Tuning

Squeeze more out of your deployment:

vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95 \
  --max-num-batched-tokens 32768 \
  --trust-remote-code \
  --port 8000

--gpu-memory-utilization 0.95 — use more VRAM (default is 0.9)
--max-num-batched-tokens — handle more concurrent requests

For serving details and optimization, check our serve LLMs with vLLM guide.

FAQ

Does DeepSeek V4 Flash really fit on one A100?

Yes. Despite being a large MoE model, only 13B parameters are active per token. The total weights fit in about 50-60GB VRAM (FP16), leaving room for KV cache on an 80GB A100. You won’t fit it on a 40GB GPU though.

How does this compare to running through Ollama?

Ollama is simpler to set up but vLLM is significantly faster for serving (2-3x throughput), supports batched requests, and provides a native OpenAI-compatible API. Use Ollama for experimentation, vLLM for production serving.

Can I serve multiple models simultaneously?

Not easily on a single A100 with DeepSeek V4 Flash — it uses most of the VRAM. You’d need multi-GPU (2x A100) or a separate instance for each model. vLLM does support model routing with multiple GPUs.

What if I only need the model for a few hours a day?

That’s the beauty of Vultr’s hourly billing. Run for 4 hours = $7.40. Create a script that starts the server, does your batch processing, and then destroys the instance via Vultr’s API. No waste.

Run DeepSeek V4 on a Vultr GPU Server (Complete Setup)

What You’ll Get

Getting Started

Step 1: Create an A100 Instance

Step 2: SSH In and Verify GPU

Step 3: Install vLLM

Step 4: Download and Serve DeepSeek V4 Flash

Step 5: Test the OpenAI-Compatible API

Step 6: Use From Your Local Machine

Step 7: Run as a Background Service

Cost Analysis: Self-Hosted vs API

Performance Tuning

FAQ

📬 AI Dev Weekly

You might also like

Deploy Ollama on Vultr in 5 Minutes: Run AI Models in the Cloud

Deploy a RAG Pipeline on DigitalOcean (Python + Postgres + Embeddings)

Deploy an AI Chatbot on Railway for Free (Step-by-Step)

How to Migrate from GPT-5.5 or Claude to DeepSeek/MiMo (Step-by-Step)