πŸ€– AI Tools
Β· 7 min read

How to Run DeepSeek V4 Locally: Hardware, Setup, and Deployment Guide (2026)


DeepSeek V4 comes in two variants, and only one of them is realistic for local deployment. V4-Flash uses a Mixture-of-Experts architecture with 284B total parameters but only 13B active per forward pass, making it feasible on a single multi-GPU server. V4-Pro, at 1.6 trillion parameters, requires a full cluster and is better suited for cloud deployment.

This guide covers hardware requirements, inference engine setup, quantization strategies, and deployment options for running DeepSeek V4 on your own machines. If you have worked with earlier DeepSeek models, check our general how to run DeepSeek locally guide for background.

V4-Flash vs V4-Pro: What Can Actually Run Locally?

V4-Flash is the model you want for local inference. Its MoE design means that despite having 284B total parameters across all experts, each token only activates roughly 13B parameters. This keeps compute requirements manageable while still delivering strong performance on coding, reasoning, and general tasks.

V4-Pro is a different story. At 1.6T parameters, even aggressive quantization leaves you needing hundreds of gigabytes of VRAM. Unless you have access to a multi-node GPU cluster, V4-Pro belongs on a cloud provider. We cover that scenario later in this guide.

Hardware Requirements for V4-Flash

The full model weights for V4-Flash are large because all expert parameters must be loaded into memory, even though only a subset activates per token. Here is what you need depending on your quantization strategy:

ConfigurationVRAM RequiredExample SetupThroughput
FP16 (full precision)~568 GB8x H100 80GBHighest quality, lower throughput
FP8~300 GB4x H100 80GB or 8x A100 80GBGood balance of quality and speed
FP4+FP8 mixed precision~180 GB4x A100 80GB or 2x H100 80GBSlight quality trade-off, fast
FP4 + CPU offloading~80 GB VRAM + 256 GB RAM2x RTX 4090 + high-RAM hostSlower, but accessible hardware

Key considerations:

  • NVLink or NVSwitch between GPUs matters significantly for multi-GPU setups. PCIe connections will bottleneck throughput.
  • System RAM should be at least 128 GB for any multi-GPU configuration, and 256 GB+ if you plan to offload experts to CPU.
  • Fast NVMe storage (2 TB+) helps with model loading times. The full FP16 checkpoint is over 500 GB on disk.

Setting Up vLLM for V4-Flash

vLLM is one of the most mature inference engines for large MoE models. It supports tensor parallelism out of the box, which is essential for spreading V4-Flash across multiple GPUs.

Install vLLM with MoE support:

pip install vllm --upgrade

Launch the V4-Flash server with tensor parallelism across 4 GPUs:

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 4 \
  --dtype float8 \
  --max-model-len 65536 \
  --trust-remote-code \
  --port 8000

For FP4 mixed-precision quantization (reduces VRAM usage significantly):

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --quantization fp4_mixed \
  --max-model-len 32768 \
  --trust-remote-code \
  --port 8000

Test the endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
    "max_tokens": 512
  }'

vLLM handles KV cache management, continuous batching, and expert routing automatically. For a deeper comparison of inference engines, see our vLLM vs Ollama vs llama.cpp breakdown.

Setting Up SGLang for V4-Flash

SGLang is another strong option, particularly if you need structured generation or advanced prompt control. It also supports MoE models with tensor parallelism.

Install SGLang:

pip install sglang[all] --upgrade

Launch the server:

python -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --tp 4 \
  --dtype float8 \
  --context-length 65536 \
  --port 8100

SGLang exposes an OpenAI-compatible API by default:

curl http://localhost:8100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "Explain the difference between async and sync in Python."}],
    "max_tokens": 256
  }'

SGLang can outperform vLLM on certain workloads thanks to its RadixAttention caching and optimized scheduling. See our SGLang vs vLLM comparison for benchmarks.

KTransformers: CPU-GPU Heterogeneous Inference

If you do not have a rack of GPUs, KTransformers offers a way to run V4-Flash using a mix of GPU and CPU resources. It offloads MoE expert layers to system RAM while keeping attention layers and active experts on the GPU.

Install KTransformers:

pip install ktransformers --upgrade

Launch with expert offloading:

python -m ktransformers.server \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --device-map auto \
  --offload-strategy moe-cpu \
  --gpu-memory-limit 48G \
  --port 8200

This configuration keeps the attention mechanism and routing network on GPU while placing the bulk of expert parameters in system RAM. You will need:

  • At least one GPU with 24 GB+ VRAM (RTX 3090/4090 or better)
  • 256 GB+ system RAM (DDR5 preferred for bandwidth)
  • Fast CPU with high memory bandwidth (AMD EPYC or Intel Xeon recommended)

The trade-off is speed. Expect 2 to 5x slower token generation compared to a full GPU setup, but it makes V4-Flash accessible on hardware that costs a fraction of a multi-H100 server.

Quantization: FP4+FP8 Mixed Precision

DeepSeek V4-Flash responds well to mixed-precision quantization. The recommended approach keeps attention layers and the routing network in FP8 while quantizing MoE expert weights to FP4. This works because:

  • Expert weights are sparse by nature (most are inactive per token), so lower precision has minimal impact on output quality.
  • Attention layers and the router are always active and benefit from higher precision.
  • The memory savings are substantial: roughly 40% reduction compared to uniform FP8.

Most inference engines now support this natively. In vLLM, use the --quantization fp4_mixed flag shown earlier. In SGLang:

python -m sglang.launch_server \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --tp 2 \
  --quantization fp4_mixed \
  --context-length 32768 \
  --port 8100

Pre-quantized checkpoints are available on Hugging Face under the deepseek-ai organization. Using these avoids the time-consuming on-the-fly quantization step during model loading.

Configuring Thinking Mode Locally

V4-Flash supports a β€œthinking” mode where the model generates internal chain-of-thought reasoning before producing its final answer. This is useful for complex coding and math problems.

To enable thinking mode, include a system prompt or use the dedicated API parameter:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [
      {"role": "system", "content": "Think step by step before answering."},
      {"role": "user", "content": "Find all prime numbers between 1000 and 1050."}
    ],
    "max_tokens": 2048,
    "extra_body": {"thinking": true}
  }'

When thinking mode is active, the model uses more tokens per response (the reasoning tokens are generated but can be hidden from the final output depending on your configuration). Budget for 2 to 4x the normal token count on reasoning-heavy prompts.

You can also set a thinking budget to cap the number of reasoning tokens:

{
  "extra_body": {
    "thinking": true,
    "thinking_budget": 4096
  }
}

This prevents runaway reasoning on simpler queries while still allowing deep thought when needed.

V4-Pro Deployment: Cluster-Scale Only

V4-Pro at 1.6T parameters is not a local model for most teams. Even in FP4, you are looking at roughly 400 GB+ of VRAM, and the model’s architecture benefits from pipeline parallelism across many GPUs.

Realistic V4-Pro setups:

ConfigurationHardwareEstimated Cost
FP816x H100 80GB (2 nodes)$300K+ hardware
FP4 mixed8x H100 80GB (1 node)$150K+ hardware
Cloud rentalMulti-GPU instances$15-40/hr depending on provider

For most use cases, renting GPU capacity from a cloud GPU provider is more practical than buying hardware for V4-Pro. Providers like Lambda, CoreWeave, and RunPod offer multi-H100 instances that can handle V4-Pro inference.

If you only need V4-Pro occasionally, the API is the most cost-effective path. Reserve local deployment for V4-Flash, which delivers excellent results for the vast majority of coding and reasoning tasks.

FAQ

What is the minimum hardware to run V4-Flash locally?

The absolute minimum is a single high-VRAM GPU (48 GB+) combined with 256 GB of system RAM, using KTransformers with CPU offloading and FP4 quantization. Performance will be limited, but it works for testing and low-throughput use cases. For production workloads, plan for at least 2x A100 80GB or equivalent with FP4 mixed precision.

Can I run V4-Flash with Ollama or llama.cpp?

As of April 2026, V4-Flash’s MoE architecture is best supported by vLLM and SGLang. llama.cpp has experimental MoE support but may not handle V4-Flash’s expert routing optimally. Ollama relies on llama.cpp under the hood, so the same limitations apply. Check our vLLM vs Ollama vs llama.cpp guide for the latest compatibility updates.

Is the quality loss from FP4 quantization noticeable?

For most coding and general tasks, FP4 mixed precision (experts in FP4, attention in FP8) produces output that is nearly indistinguishable from full precision. On highly nuanced reasoning benchmarks, there is a small measurable drop (1 to 3%), but in practice most users will not notice a difference. The memory savings make it the recommended default for local deployment.