DeepSeek V3 is a 671B parameter MoE model that matches GPT-4o on most benchmarks. DeepSeek R1 is a reasoning model comparable to OpenAIβs o1. Both are open-source and free to self-host. Hereβs how to run them.
Update (April 24, 2026): DeepSeek V4-Flash (284B/13B active) is now the latest DeepSeek model. Open weights available on HuggingFace, MIT licensed. See how to run V4 locally for the updated setup guide.
Which DeepSeek model to run
| Model | Total params | Active params | VRAM needed (Q4) | Best for |
|---|---|---|---|---|
| DeepSeek Coder V2 Lite | 236B | 14B | ~10-12GB | Coding on consumer hardware |
| DeepSeek V3 (Q4) | 671B | 37B | ~80-100GB | Full-quality general purpose |
| DeepSeek R1 (distilled 7B) | 7B | 7B | ~6GB | Reasoning on any hardware |
| DeepSeek R1 (distilled 32B) | 32B | 32B | ~20-24GB | Strong reasoning |
| DeepSeek R1 (full) | 671B | 37B | ~80-100GB | Best reasoning, needs serious hardware |
For most developers: start with DeepSeek Coder V2 Lite (fits on a consumer GPU) or the R1 distilled 7B (runs on a laptop).
Run with Ollama
# DeepSeek Coder β runs on 12GB GPU
ollama run deepseek-coder-v2:16b
# DeepSeek R1 distilled β runs on any laptop
ollama run deepseek-r1:7b
# DeepSeek R1 32B β needs 24GB GPU
ollama run deepseek-r1:32b
# Full DeepSeek V3 β needs 80GB+ VRAM
ollama run deepseek-v3
Run with llama.cpp
# Download quantized DeepSeek Coder
huggingface-cli download deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct-GGUF \
--include "*Q4_K_M*" \
--local-dir ./models
# Start server
llama-server \
--model ./models/DeepSeek-Coder-V2-Lite-Q4_K_M.gguf \
--ctx-size 16384 \
--threads 8 \
--port 8080
Hardware reality check
The full DeepSeek V3 (671B) is not practical on consumer hardware. Even quantized to Q4, it needs ~80-100GB of VRAM. That means:
- Mac Studio M3/M4 Ultra with 192GB unified memory β works
- Multi-GPU setup (2-4x A100 80GB) β works
- Single consumer GPU β doesnβt work
If you want to run the full V3 or R1 without buying multi-GPU hardware, cloud GPU providers offer multi-A100 instances on demand at a fraction of the purchase cost.
The distilled models are the practical choice:
- R1 7B: any laptop with 8GB RAM
- Coder V2 Lite: any GPU with 12GB VRAM
- R1 32B: RTX 4090 or 32GB Mac
DeepSeek R1: reasoning on a laptop
The R1 distilled models are the hidden gem. The 7B version runs on basically anything and gives you o1-style chain-of-thought reasoning for free:
ollama run deepseek-r1:7b
Itβs not as strong as the full R1, but for math problems, logic puzzles, and step-by-step reasoning, itβs remarkably capable for a model that fits in 6GB of RAM.
Connect to your IDE
# Start DeepSeek Coder locally
ollama run deepseek-coder-v2:16b
# In VS Code with Continue:
# Provider: Ollama
# Model: deepseek-coder-v2:16b
DeepSeek Coder V2 Lite is a strong free alternative to GitHub Copilot for code completion. It supports 338 programming languages and runs entirely on your machine.
DeepSeek vs Qwen for self-hosting
| DeepSeek | Qwen 3.5 | |
|---|---|---|
| Best small model | R1 7B (reasoning) | 9B (general, beats 120B models) |
| Best coding model | Coder V2 Lite 14B | Qwen 2.5 Coder 32B |
| Full model VRAM | ~80-100GB | ~214GB (397B) |
| License | MIT | Apache 2.0 |
| Multimodal | No | Yes |
For coding: DeepSeek Coder V2 Lite if you have 12GB VRAM, Qwen 2.5 Coder 32B if you have 24GB. For reasoning: DeepSeek R1 distilled models are unique β no other open-source model offers dedicated chain-of-thought reasoning at this size. For general use: Qwen 3.5-9B is the better all-rounder.
Related
- Best Self-Hosted AI Models in 2026
- Qwen 2.5 Coder vs DeepSeek Coder β Open-Source Coding Models Compared
- How to Run Qwen 3.5 Locally
- Ollama vs llama.cpp vs vLLM β Which Should You Use?
Related: Self-Hosted AI for Enterprise Β· Run DeepSeek Locally