How to Run DeepSeek Locally — V3 and R1 Setup Guide

DeepSeek V3 is a 671B parameter MoE model that matches GPT-4o on most benchmarks. DeepSeek R1 is a reasoning model comparable to OpenAI’s o1. Both are open-source and free to self-host. Here’s how to run them.

Update (April 24, 2026): DeepSeek V4-Flash (284B/13B active) is now the latest DeepSeek model. Open weights available on HuggingFace, MIT licensed. See how to run V4 locally for the updated setup guide.

Which DeepSeek model to run

Model	Total params	Active params	VRAM needed (Q4)	Best for
DeepSeek Coder V2 Lite	236B	14B	~10-12GB	Coding on consumer hardware
DeepSeek V3 (Q4)	671B	37B	~80-100GB	Full-quality general purpose
DeepSeek R1 (distilled 7B)	7B	7B	~6GB	Reasoning on any hardware
DeepSeek R1 (distilled 32B)	32B	32B	~20-24GB	Strong reasoning
DeepSeek R1 (full)	671B	37B	~80-100GB	Best reasoning, needs serious hardware

For most developers: start with DeepSeek Coder V2 Lite (fits on a consumer GPU) or the R1 distilled 7B (runs on a laptop).

Run with Ollama

# DeepSeek Coder — runs on 12GB GPU
ollama run deepseek-coder-v2:16b

# DeepSeek R1 distilled — runs on any laptop
ollama run deepseek-r1:7b

# DeepSeek R1 32B — needs 24GB GPU
ollama run deepseek-r1:32b

# Full DeepSeek V3 — needs 80GB+ VRAM
ollama run deepseek-v3

Run with llama.cpp

# Download quantized DeepSeek Coder
huggingface-cli download deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct-GGUF \
  --include "*Q4_K_M*" \
  --local-dir ./models

# Start server
llama-server \
  --model ./models/DeepSeek-Coder-V2-Lite-Q4_K_M.gguf \
  --ctx-size 16384 \
  --threads 8 \
  --port 8080

Hardware reality check

The full DeepSeek V3 (671B) is not practical on consumer hardware. Even quantized to Q4, it needs ~80-100GB of VRAM. That means:

Mac Studio M3/M4 Ultra with 192GB unified memory — works
Multi-GPU setup (2-4x A100 80GB) — works
Single consumer GPU — doesn’t work

If you want to run the full V3 or R1 without buying multi-GPU hardware, cloud GPU providers offer multi-A100 instances on demand at a fraction of the purchase cost.

The distilled models are the practical choice:

R1 7B: any laptop with 8GB RAM
Coder V2 Lite: any GPU with 12GB VRAM
R1 32B: RTX 4090 or 32GB Mac

DeepSeek R1: reasoning on a laptop

The R1 distilled models are the hidden gem. The 7B version runs on basically anything and gives you o1-style chain-of-thought reasoning for free:

ollama run deepseek-r1:7b

It’s not as strong as the full R1, but for math problems, logic puzzles, and step-by-step reasoning, it’s remarkably capable for a model that fits in 6GB of RAM.

Connect to your IDE

# Start DeepSeek Coder locally
ollama run deepseek-coder-v2:16b

# In VS Code with Continue:
# Provider: Ollama
# Model: deepseek-coder-v2:16b

DeepSeek Coder V2 Lite is a strong free alternative to GitHub Copilot for code completion. It supports 338 programming languages and runs entirely on your machine.

DeepSeek vs Qwen for self-hosting

	DeepSeek	Qwen 3.5
Best small model	R1 7B (reasoning)	9B (general, beats 120B models)
Best coding model	Coder V2 Lite 14B	Qwen 2.5 Coder 32B
Full model VRAM	~80-100GB	~214GB (397B)
License	MIT	Apache 2.0
Multimodal	No	Yes

For coding: DeepSeek Coder V2 Lite if you have 12GB VRAM, Qwen 2.5 Coder 32B if you have 24GB. For reasoning: DeepSeek R1 distilled models are unique — no other open-source model offers dedicated chain-of-thought reasoning at this size. For general use: Qwen 3.5-9B is the better all-rounder.

How to Run DeepSeek Locally — V3 and R1 Setup Guide

Which DeepSeek model to run

Run with Ollama

Run with llama.cpp

Hardware reality check

DeepSeek R1: reasoning on a laptop

Connect to your IDE

DeepSeek vs Qwen for self-hosting

Related

📬 AI Dev Weekly

You might also like

How to Run GLM-5.1 Locally — Hardware, Setup, and Quantization Guide (2026)

How to Replace GitHub Copilot for Free — Step-by-Step Guide (2026)

How to Run AI Without a GPU — CPU-Only Inference Guide (2026)

How to Run Llama 4 Locally — Scout and Maverick Setup Guide