How to Run AI Locally on Windows β Complete Setup Guide (2026)
Running AI locally on Windows works well in 2026, but the setup has more friction than macOS or Linux. Driver issues, antivirus interference, PATH problems, and the WSL-vs-native decision all trip people up. This guide covers the three main paths and helps you pick the right one.
Which path should you pick?
- Ollama native β Easiest. One installer, runs as a Windows service, no WSL needed. Best for most people. See our Ollama complete guide for deeper coverage.
- LM Studio β Best GUI experience. Download, install, click. Auto-detects your GPU. Full details in our LM Studio guide.
- WSL2 + Linux tools β Most flexible. Required if you need vLLM, text-generation-inference, or other Linux-only tooling. More setup, but gives you a full Linux environment.
If you just want to chat with a model or use it with a coding tool, start with Ollama. If you want a visual interface for comparing models, go LM Studio. If youβre building production inference pipelines, go WSL2.
Check your hardware
Before installing anything, figure out what GPU you have and how much VRAM is available. Open PowerShell:
# Check GPU name and VRAM
Get-CimInstance Win32_VideoController | Select-Object Name, AdapterRAM
For NVIDIA GPUs, install nvidia-smi (comes with the driver) and run:
nvidia-smi
This shows your GPU model, driver version, CUDA version, and current VRAM usage. You need this info to pick the right model size β check our VRAM requirements guide and best GPU for local AI guide.
Quick reference: 8 GB VRAM runs 7Bβ8B models comfortably. 16 GB handles 13Bβ14B. 24 GB opens up 30B+ models with quantization.
Path 1: Ollama native (recommended for most)
Ollama has a native Windows installer since late 2024. No WSL required.
Install
- Download the installer from ollama.com
- Run the
.exeβ it installs Ollama and registers it as a Windows service - Open a new PowerShell window (important β PATH wonβt update in existing terminals)
Verify
ollama --version
If this returns βnot recognized,β close all terminals and open a fresh one. The installer adds Ollama to your PATH, but existing sessions donβt pick it up.
Pull and run a model
ollama pull llama3.2
ollama run llama3.2
Thatβs it. Ollama auto-detects your GPU and offloads layers to VRAM. To confirm GPU is being used:
ollama ps
The PROCESSOR column shows gpu if acceleration is active. If it shows cpu, see our GPU not detected fix.
Ollama runs as a service
On Windows, Ollama runs in the background as a service. The API is available at http://localhost:11434 by default. You can manage it from the system tray icon.
Path 2: LM Studio (best GUI experience)
LM Studio gives you a desktop app for downloading, running, and chatting with models.
Install
- Download from lmstudio.ai
- Run the installer
- Launch LM Studio β it auto-detects your GPU on first run
Usage
- Browse and download models from the built-in model catalog
- Select a model and click Load β LM Studio picks the best quantization for your VRAM
- Chat directly in the app, or enable the local API server for external tools
LM Studio handles VRAM management automatically. It shows you exactly how much VRAM each model needs before loading. No terminal required.
Path 3: WSL2 (for Linux tools like vLLM)
If you need Linux-only tools β vLLM, text-generation-inference, SGLang β WSL2 is the way.
Enable WSL2
wsl --install
This installs WSL2 with Ubuntu by default. Restart when prompted.
GPU passthrough
GPU passthrough works automatically on WSL2 with recent NVIDIA drivers. Install the latest Game Ready or Studio driver on the Windows side β do NOT install CUDA drivers inside WSL. The Windows driver handles GPU access for both Windows and WSL.
Verify inside WSL:
nvidia-smi
If this works, your GPU is accessible from WSL.
Install Ollama in WSL
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama run llama3.2
Or install vLLM, llama.cpp, or any other Linux tool as you normally would.
NVIDIA CUDA setup
NVIDIA GPUs give the best local AI experience on Windows. Hereβs the setup:
1. Install the latest driver
Download from nvidia.com/drivers. Pick Game Ready Driver or Studio Driver β both work. The driver includes CUDA runtime support.
2. Install CUDA Toolkit (optional)
Only needed if youβre compiling from source or using tools that require the full toolkit (like building llama.cpp yourself). Download from developer.nvidia.com/cuda-downloads.
After installing, verify:
nvcc --version
If nvcc isnβt found, add the CUDA bin directory to your PATH:
# Typical path β adjust version number
$env:PATH += ";C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin"
To make it permanent, add it through System Properties β Environment Variables.
3. Verify everything
nvidia-smi
# Should show driver version and CUDA version
Most tools (Ollama, LM Studio) only need the driver β they bundle their own CUDA runtime.
AMD and Intel GPU notes
AMD GPUs
ROCm support on Windows is limited. Most local AI tools on Windows fall back to Vulkan for AMD GPUs, which works but is slower than CUDA. Ollama and LM Studio both support Vulkan acceleration for AMD cards.
For best AMD performance, consider the WSL2 path β ROCm has better Linux support. But check compatibility first: not all AMD GPUs are supported by ROCm.
Intel Arc GPUs
Intel Arc is supported via Ollama with oneAPI. Install the latest Intel GPU drivers and the oneAPI toolkit. Performance is decent for smaller models but lags behind NVIDIA.
# Check Intel GPU
Get-CimInstance Win32_VideoController | Where-Object { $_.Name -like "*Intel*" }
CPU-only fallback
No GPU? You can still run models β just slower. Ollama and LM Studio both fall back to CPU automatically. For best CPU performance:
- Use quantized models (Q4_K_M or lower)
- Stick to small models (3Bβ7B parameters)
- Make sure your CPU supports AVX2 (most CPUs from 2015+ do)
Check AVX2 support:
# If this returns results, you have AVX2
Get-CimInstance Win32_Processor | Select-Object Name, Description
For a deeper dive on running without a GPU, see our no-GPU guide.
CPU inference for a 7B model typically gives 2β5 tokens/second depending on your CPU. Usable for testing, not great for real work.
Windows-specific troubleshooting
| Problem | Fix |
|---|---|
ollama not recognized after install | Close all terminals, open a new PowerShell window. The installer updates PATH but existing sessions donβt see it. |
| Ollama extremely slow first run | Windows Defender is scanning the model files (multi-GB). Add the model directory to exclusions (see performance tips below). |
| Antivirus blocks Ollama | Some antivirus software flags Ollamaβs network listener. Add ollama.exe and the Ollama install directory to your antivirus exclusions. |
| GPU not detected | Update your GPU driver to the latest version. For NVIDIA, run nvidia-smi to verify the driver is working. See GPU not detected fix. |
| CUDA out of memory | Youβre loading a model too large for your VRAM. Use a smaller model or a more aggressive quantization (Q4 instead of Q8). |
| WSL2 canβt see GPU | Install the latest Windows GPU driver (not a Linux driver inside WSL). Run nvidia-smi inside WSL to verify. |
| Model download stalls | Check your firewall/proxy settings. Ollama downloads from CDN endpoints that corporate firewalls sometimes block. |
| Port 11434 already in use | Another Ollama instance is running. Check the system tray or run tasklist /FI "IMAGENAME eq ollama.exe" to find and kill it. |
| LM Studio wonβt load model | Not enough VRAM. LM Studio shows required VRAM before loading β pick a smaller quantization or a smaller model. |
Performance tips
Exclude model directories from Windows Defender
This is the single biggest performance win on Windows. Real-time scanning on multi-gigabyte model files causes massive slowdowns during loading and inference.
# Run as Administrator
Add-MpPreference -ExclusionPath "$env:USERPROFILE\.ollama"
Add-MpPreference -ExclusionPath "$env:USERPROFILE\.cache\lm-studio"
Close GPU-hungry apps
Games, browsers with hardware acceleration, and video editors all compete for VRAM. Close them before loading large models. Check current VRAM usage:
nvidia-smi
Use the right quantization
For limited VRAM, use Q4_K_M quantization. Itβs the best balance of quality and size. Q8 is higher quality but uses roughly twice the VRAM.
Keep drivers updated
NVIDIA regularly improves AI inference performance in driver updates. Check for updates monthly.
Set Ollama environment variables
You can configure Ollamaβs behavior through environment variables in PowerShell:
# Change model storage location (useful if C: drive is small)
$env:OLLAMA_MODELS = "D:\ollama-models"
# Increase context window
$env:OLLAMA_NUM_CTX = "8192"
To make these permanent, set them in System Properties β Environment Variables.
FAQ
Can I run AI on Windows without a GPU?
Yes, both Ollama and LM Studio fall back to CPU automatically. Youβll want to use small quantized models (7B or under, Q4_K_M) and expect 2β5 tokens/second β usable for testing but slow for real work.
Does Ollama work on Windows?
Yes, Ollama has a native Windows installer since late 2024. It runs as a Windows service, auto-detects your GPU, and exposes the same API on port 11434 β no WSL required.
Do I need WSL?
Not for Ollama or LM Studio β both run natively on Windows. You only need WSL2 if you want Linux-only tools like vLLM, text-generation-inference, or SGLang, or if you prefer a full Linux development environment.
Which GPU is best for AI on Windows?
NVIDIA GPUs give the best experience due to mature CUDA support. An RTX 3060 12GB is the budget entry point, an RTX 3090 or 4090 with 24GB VRAM is ideal for running larger models up to 32B parameters.
Related guides
- Ollama Complete Guide β Full Ollama reference
- LM Studio Complete Guide β Everything about LM Studio
- How to Run AI Without a GPU β CPU-only setups
- Best GPU for Local AI in 2026 β GPU buying guide
- How Much VRAM Do You Need? β VRAM requirements by model
- Ollama GPU Not Detected Fix β Troubleshooting GPU issues