Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
DeepSeek V4 is one of the best open-weight models available right now. The Flash variant uses mixture-of-experts (MoE) with only 13B active parameters — meaning it fits on a single A100 80GB GPU while delivering performance that rivals much larger models.
Running it yourself means: no rate limits, full control over system prompts, privacy for sensitive queries, and potentially cheaper than API pricing at high volume.
In this tutorial, I’ll walk you through deploying DeepSeek V4 Flash on a Vultr GPU instance with vLLM, giving you an OpenAI-compatible API endpoint you can use from anywhere.
What You’ll Get
By the end of this tutorial:
- DeepSeek V4 Flash running on an A100 80GB GPU
- An OpenAI-compatible API endpoint (drop-in replacement)
- ~40-60 tokens/second generation speed
- Full control over the model and inference settings
For the complete model breakdown, see our DeepSeek V4 Flash guide.
Getting Started
You’ll need a Vultr account with access to GPU instances:
The $250 credit covers ~135 hours of A100 time — enough to thoroughly test and benchmark your setup.
Step 1: Create an A100 Instance
- Log into Vultr → Deploy New Server
- Select Cloud GPU
- Choose NVIDIA A100 80GB ($1.85/hr)
- OS: Ubuntu 22.04
- Select a data center (pick the closest to your users)
- Add your SSH key
- Deploy
Wait 60 seconds for provisioning.
Why A100 80GB? DeepSeek V4 Flash has 13B active parameters in a larger MoE architecture. The total model weights are about 50-60GB in FP16, so you need the 80GB variant. An A40 48GB won’t fit it.
Step 2: SSH In and Verify GPU
ssh root@YOUR_SERVER_IP
nvidia-smi
You should see:
NVIDIA A100-SXM4-80GB | 80GB VRAM | Driver 535.x | CUDA 12.x
Step 3: Install vLLM
vLLM is the fastest open-source inference engine for LLMs. It supports DeepSeek V4’s MoE architecture natively.
pip install vllm --upgrade
This takes 2-3 minutes. vLLM pulls in PyTorch, CUDA libraries, and all dependencies.
If you want to compare inference engines, check our vLLM vs Ollama vs llama.cpp vs TGI comparison.
Step 4: Download and Serve DeepSeek V4 Flash
Start vLLM with the DeepSeek V4 Flash model:
vllm serve deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 1 \
--max-model-len 16384 \
--trust-remote-code \
--port 8000
The first run downloads the model weights (~50GB). On Vultr’s network, this takes about 3-5 minutes.
Once loaded, you’ll see:
INFO: Started server process
INFO: Uvicorn running on http://0.0.0.0:8000
Key flags explained:
--tensor-parallel-size 1— single GPU (increase for multi-GPU setups)--max-model-len 16384— max context window (increase if you need more, but uses more VRAM)--trust-remote-code— required for DeepSeek’s custom architecture
Step 5: Test the OpenAI-Compatible API
vLLM serves an OpenAI-compatible endpoint by default. Test it:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [
{"role": "user", "content": "Explain quantum computing in 3 sentences."}
],
"max_tokens": 200
}'
You should get a response in 2-4 seconds with high-quality output.
Step 6: Use From Your Local Machine
Open port 8000 in Vultr’s firewall (restrict to your IP), then from your laptop:
curl http://YOUR_SERVER_IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Or use any OpenAI client library — just change the base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://YOUR_SERVER_IP:8000/v1",
api_key="not-needed" # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[{"role": "user", "content": "Write a haiku about servers"}],
)
print(response.choices[0].message.content)
Step 7: Run as a Background Service
Don’t let your model stop when you close SSH:
# Using screen
screen -S vllm
vllm serve deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 1 \
--max-model-len 16384 \
--trust-remote-code \
--port 8000
# Press Ctrl+A then D to detach
Or create a proper systemd service:
cat > /etc/systemd/system/vllm.service << 'EOF'
[Unit]
Description=vLLM DeepSeek V4 Flash
After=network.target
[Service]
ExecStart=/usr/local/bin/vllm serve deepseek-ai/DeepSeek-V4-Flash --tensor-parallel-size 1 --max-model-len 16384 --trust-remote-code --port 8000
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
systemctl enable --now vllm
Cost Analysis: Self-Hosted vs API
Let’s compare running DeepSeek V4 Flash yourself vs using the DeepSeek API:
Self-hosted on Vultr (A100):
- Cost: $1.85/hr = $44.40/day = ~$1,332/month
- Throughput: ~50 tokens/sec sustained
- Daily capacity: ~4.3M tokens/day
- Effective cost: ~$0.01 per 1K tokens
DeepSeek API pricing:
- Input: $0.07 per 1M tokens
- Output: $0.28 per 1M tokens
- Average: ~$0.14 per 1M tokens = $0.00014 per 1K tokens
Break-even point: The API is cheaper unless you’re processing millions of tokens daily with strict latency requirements, need full privacy, or need to avoid rate limits.
When self-hosting wins:
- You need >5M tokens/day sustained
- Privacy requirements (healthcare, legal, finance)
- No rate limits for batch processing
- Custom model modifications or fine-tuning
- Guaranteed low latency (no queuing)
For more on running DeepSeek locally, see how to run DeepSeek V4 locally.
Performance Tuning
Squeeze more out of your deployment:
vllm serve deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.95 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--port 8000
--gpu-memory-utilization 0.95— use more VRAM (default is 0.9)--max-num-batched-tokens— handle more concurrent requests
For serving details and optimization, check our serve LLMs with vLLM guide.
FAQ
Does DeepSeek V4 Flash really fit on one A100?
Yes. Despite being a large MoE model, only 13B parameters are active per token. The total weights fit in about 50-60GB VRAM (FP16), leaving room for KV cache on an 80GB A100. You won’t fit it on a 40GB GPU though.
How does this compare to running through Ollama?
Ollama is simpler to set up but vLLM is significantly faster for serving (2-3x throughput), supports batched requests, and provides a native OpenAI-compatible API. Use Ollama for experimentation, vLLM for production serving.
Can I serve multiple models simultaneously?
Not easily on a single A100 with DeepSeek V4 Flash — it uses most of the VRAM. You’d need multi-GPU (2x A100) or a separate instance for each model. vLLM does support model routing with multiple GPUs.
What if I only need the model for a few hours a day?
That’s the beauty of Vultr’s hourly billing. Run for 4 hours = $7.40. Create a script that starts the server, does your batch processing, and then destroys the instance via Vultr’s API. No waste.