πŸ“ Tutorials
Β· 7 min read
Last updated on

How to Run AI Locally on Windows β€” Complete Setup Guide (2026)


Running AI locally on Windows works well in 2026, but the setup has more friction than macOS or Linux. Driver issues, antivirus interference, PATH problems, and the WSL-vs-native decision all trip people up. This guide covers the three main paths and helps you pick the right one.

Which path should you pick?

  • Ollama native β€” Easiest. One installer, runs as a Windows service, no WSL needed. Best for most people. See our Ollama complete guide for deeper coverage.
  • LM Studio β€” Best GUI experience. Download, install, click. Auto-detects your GPU. Full details in our LM Studio guide.
  • WSL2 + Linux tools β€” Most flexible. Required if you need vLLM, text-generation-inference, or other Linux-only tooling. More setup, but gives you a full Linux environment.

If you just want to chat with a model or use it with a coding tool, start with Ollama. If you want a visual interface for comparing models, go LM Studio. If you’re building production inference pipelines, go WSL2.

Check your hardware

Before installing anything, figure out what GPU you have and how much VRAM is available. Open PowerShell:

# Check GPU name and VRAM
Get-CimInstance Win32_VideoController | Select-Object Name, AdapterRAM

For NVIDIA GPUs, install nvidia-smi (comes with the driver) and run:

nvidia-smi

This shows your GPU model, driver version, CUDA version, and current VRAM usage. You need this info to pick the right model size β€” check our VRAM requirements guide and best GPU for local AI guide.

Quick reference: 8 GB VRAM runs 7B–8B models comfortably. 16 GB handles 13B–14B. 24 GB opens up 30B+ models with quantization.

Ollama has a native Windows installer since late 2024. No WSL required.

Install

  1. Download the installer from ollama.com
  2. Run the .exe β€” it installs Ollama and registers it as a Windows service
  3. Open a new PowerShell window (important β€” PATH won’t update in existing terminals)

Verify

ollama --version

If this returns β€œnot recognized,” close all terminals and open a fresh one. The installer adds Ollama to your PATH, but existing sessions don’t pick it up.

Pull and run a model

ollama pull llama3.2
ollama run llama3.2

That’s it. Ollama auto-detects your GPU and offloads layers to VRAM. To confirm GPU is being used:

ollama ps

The PROCESSOR column shows gpu if acceleration is active. If it shows cpu, see our GPU not detected fix.

Ollama runs as a service

On Windows, Ollama runs in the background as a service. The API is available at http://localhost:11434 by default. You can manage it from the system tray icon.

Path 2: LM Studio (best GUI experience)

LM Studio gives you a desktop app for downloading, running, and chatting with models.

Install

  1. Download from lmstudio.ai
  2. Run the installer
  3. Launch LM Studio β€” it auto-detects your GPU on first run

Usage

  • Browse and download models from the built-in model catalog
  • Select a model and click Load β€” LM Studio picks the best quantization for your VRAM
  • Chat directly in the app, or enable the local API server for external tools

LM Studio handles VRAM management automatically. It shows you exactly how much VRAM each model needs before loading. No terminal required.

Path 3: WSL2 (for Linux tools like vLLM)

If you need Linux-only tools β€” vLLM, text-generation-inference, SGLang β€” WSL2 is the way.

Enable WSL2

wsl --install

This installs WSL2 with Ubuntu by default. Restart when prompted.

GPU passthrough

GPU passthrough works automatically on WSL2 with recent NVIDIA drivers. Install the latest Game Ready or Studio driver on the Windows side β€” do NOT install CUDA drivers inside WSL. The Windows driver handles GPU access for both Windows and WSL.

Verify inside WSL:

nvidia-smi

If this works, your GPU is accessible from WSL.

Install Ollama in WSL

curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama run llama3.2

Or install vLLM, llama.cpp, or any other Linux tool as you normally would.

NVIDIA CUDA setup

NVIDIA GPUs give the best local AI experience on Windows. Here’s the setup:

1. Install the latest driver

Download from nvidia.com/drivers. Pick Game Ready Driver or Studio Driver β€” both work. The driver includes CUDA runtime support.

2. Install CUDA Toolkit (optional)

Only needed if you’re compiling from source or using tools that require the full toolkit (like building llama.cpp yourself). Download from developer.nvidia.com/cuda-downloads.

After installing, verify:

nvcc --version

If nvcc isn’t found, add the CUDA bin directory to your PATH:

# Typical path β€” adjust version number
$env:PATH += ";C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin"

To make it permanent, add it through System Properties β†’ Environment Variables.

3. Verify everything

nvidia-smi
# Should show driver version and CUDA version

Most tools (Ollama, LM Studio) only need the driver β€” they bundle their own CUDA runtime.

AMD and Intel GPU notes

AMD GPUs

ROCm support on Windows is limited. Most local AI tools on Windows fall back to Vulkan for AMD GPUs, which works but is slower than CUDA. Ollama and LM Studio both support Vulkan acceleration for AMD cards.

For best AMD performance, consider the WSL2 path β€” ROCm has better Linux support. But check compatibility first: not all AMD GPUs are supported by ROCm.

Intel Arc GPUs

Intel Arc is supported via Ollama with oneAPI. Install the latest Intel GPU drivers and the oneAPI toolkit. Performance is decent for smaller models but lags behind NVIDIA.

# Check Intel GPU
Get-CimInstance Win32_VideoController | Where-Object { $_.Name -like "*Intel*" }

CPU-only fallback

No GPU? You can still run models β€” just slower. Ollama and LM Studio both fall back to CPU automatically. For best CPU performance:

  • Use quantized models (Q4_K_M or lower)
  • Stick to small models (3B–7B parameters)
  • Make sure your CPU supports AVX2 (most CPUs from 2015+ do)

Check AVX2 support:

# If this returns results, you have AVX2
Get-CimInstance Win32_Processor | Select-Object Name, Description

For a deeper dive on running without a GPU, see our no-GPU guide.

CPU inference for a 7B model typically gives 2–5 tokens/second depending on your CPU. Usable for testing, not great for real work.

Windows-specific troubleshooting

ProblemFix
ollama not recognized after installClose all terminals, open a new PowerShell window. The installer updates PATH but existing sessions don’t see it.
Ollama extremely slow first runWindows Defender is scanning the model files (multi-GB). Add the model directory to exclusions (see performance tips below).
Antivirus blocks OllamaSome antivirus software flags Ollama’s network listener. Add ollama.exe and the Ollama install directory to your antivirus exclusions.
GPU not detectedUpdate your GPU driver to the latest version. For NVIDIA, run nvidia-smi to verify the driver is working. See GPU not detected fix.
CUDA out of memoryYou’re loading a model too large for your VRAM. Use a smaller model or a more aggressive quantization (Q4 instead of Q8).
WSL2 can’t see GPUInstall the latest Windows GPU driver (not a Linux driver inside WSL). Run nvidia-smi inside WSL to verify.
Model download stallsCheck your firewall/proxy settings. Ollama downloads from CDN endpoints that corporate firewalls sometimes block.
Port 11434 already in useAnother Ollama instance is running. Check the system tray or run tasklist /FI "IMAGENAME eq ollama.exe" to find and kill it.
LM Studio won’t load modelNot enough VRAM. LM Studio shows required VRAM before loading β€” pick a smaller quantization or a smaller model.

Performance tips

Exclude model directories from Windows Defender

This is the single biggest performance win on Windows. Real-time scanning on multi-gigabyte model files causes massive slowdowns during loading and inference.

# Run as Administrator
Add-MpPreference -ExclusionPath "$env:USERPROFILE\.ollama"
Add-MpPreference -ExclusionPath "$env:USERPROFILE\.cache\lm-studio"

Close GPU-hungry apps

Games, browsers with hardware acceleration, and video editors all compete for VRAM. Close them before loading large models. Check current VRAM usage:

nvidia-smi

Use the right quantization

For limited VRAM, use Q4_K_M quantization. It’s the best balance of quality and size. Q8 is higher quality but uses roughly twice the VRAM.

Keep drivers updated

NVIDIA regularly improves AI inference performance in driver updates. Check for updates monthly.

Set Ollama environment variables

You can configure Ollama’s behavior through environment variables in PowerShell:

# Change model storage location (useful if C: drive is small)
$env:OLLAMA_MODELS = "D:\ollama-models"

# Increase context window
$env:OLLAMA_NUM_CTX = "8192"

To make these permanent, set them in System Properties β†’ Environment Variables.

FAQ

Can I run AI on Windows without a GPU?

Yes, both Ollama and LM Studio fall back to CPU automatically. You’ll want to use small quantized models (7B or under, Q4_K_M) and expect 2–5 tokens/second β€” usable for testing but slow for real work.

Does Ollama work on Windows?

Yes, Ollama has a native Windows installer since late 2024. It runs as a Windows service, auto-detects your GPU, and exposes the same API on port 11434 β€” no WSL required.

Do I need WSL?

Not for Ollama or LM Studio β€” both run natively on Windows. You only need WSL2 if you want Linux-only tools like vLLM, text-generation-inference, or SGLang, or if you prefer a full Linux development environment.

Which GPU is best for AI on Windows?

NVIDIA GPUs give the best experience due to mature CUDA support. An RTX 3060 12GB is the budget entry point, an RTX 3090 or 4090 with 24GB VRAM is ideal for running larger models up to 32B parameters.