LM Studio Complete Guide — Run Local LLMs With a GUI (2026)
LM Studio is a free desktop application that lets you download, run, and chat with open-source large language models entirely on your own hardware. No cloud. No subscription. No data leaving your machine.
Under the hood, it’s a polished GUI sitting on top of llama.cpp, the C++ inference engine that made local LLMs practical. What LM Studio adds is convenience: a model browser connected to HuggingFace, a built-in chat interface, an OpenAI-compatible API server, and automatic GPU detection — all without touching a terminal.
It’s become one of the most popular ways to run local models because it removes nearly all the friction. You install it, search for a model, click download, and start chatting. If you’ve been curious about running AI locally but didn’t want to wrestle with Python environments or Docker containers, this is where to start.
For a comparison with the CLI-first alternative, see our Ollama vs LM Studio vs vLLM breakdown.
System Requirements
| Component | Minimum | Recommended |
| RAM | 8 GB | 16 GB+ |
| GPU VRAM | Not required (CPU-only works) | 8 GB+ VRAM |
| Disk | 10 GB free | 50 GB+ (models are large) |
| OS | Windows 10+, macOS 14+ (Apple Silicon only), Linux (x86_64) | |
| GPU Support | NVIDIA (CUDA), Apple Silicon (Metal/MLX), AMD (Vulkan/ROCm) | |
Not sure if your GPU is enough? Check our guide on how much VRAM you actually need for AI.
Installation
macOS (Apple Silicon): Download the .dmg from lmstudio.ai, drag to Applications, done. Requires macOS 14 Sonoma or later. Metal acceleration is enabled automatically.
Windows: Download the installer from the same site. Run it. CUDA support is detected automatically if you have an NVIDIA GPU with up-to-date drivers.
Linux: Download the .AppImage, make it executable (chmod +x), and run. For Vulkan or ROCm GPU support, make sure your drivers are installed first.
That’s it — no Python, no dependencies, no Docker.
Downloading Your First Model
Open LM Studio and go to the Discover tab (the magnifying glass icon). This is a built-in browser for HuggingFace models, filtered to show compatible GGUF files.
What is GGUF?
GGUF is the file format used by llama.cpp for quantized models. Quantization compresses a model’s weights from full precision (16-bit) down to smaller representations (8-bit, 4-bit, etc.), dramatically reducing file size and memory usage with a modest quality trade-off.
Quantization Levels — Quick Reference
| Quant | Size vs Full | Quality | Use Case |
| Q2_K | ~25% | Low | Experimentation only |
| Q4_K_M | ~35% | Good | Best balance — start here |
| Q5_K_M | ~45% | Very good | When you have VRAM to spare |
| Q8_0 | ~55% | Near-original | High-end GPUs (24 GB+) |
For most users, Q4_K_M is the sweet spot. It keeps quality high while fitting comfortably in 8–16 GB of VRAM for 7B–14B parameter models.
How to Pick a Model
- In the Discover tab, search for a model name (e.g., “Qwen 3.6” or “Llama 4”).
- Look at the available quantizations and file sizes.
- Pick Q4_K_M unless you have a reason not to.
- Click Download.
The model lands in ~/.cache/lm-studio/models/ and is ready to load.
Chat Interface Basics
Switch to the Chat tab (the message bubble icon). Select your downloaded model from the dropdown at the top. LM Studio loads it into memory — you’ll see a progress bar and memory usage stats.
Once loaded, you can:
- Chat with the model in a familiar message interface
- Adjust the system prompt to shape behavior
- Tweak temperature, top-p, max tokens, and other generation parameters in the right sidebar
- Toggle Developer Mode (in settings) for advanced options like context length, rope scaling, and GPU layer offloading controls
Responses are generated locally. Speed depends on your hardware — expect 10–40 tokens/second on a decent GPU, 2–8 tokens/second on CPU only.
Running a Local API Server
This is one of LM Studio’s killer features. Go to the Developer tab (or the <-> icon), load a model, and click Start Server. You now have an OpenAI-compatible API running at:
http://localhost:1234/v1
Any tool that supports the OpenAI API can point to this endpoint instead. Here’s a quick Python example:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio" # any string works, it's local
)
response = client.chat.completions.create(
model="loaded-model-name",
messages=[
{"role": "user", "content": "Explain quicksort in plain English."}
],
temperature=0.7
)
print(response.choices[0].message.content)
This makes LM Studio a drop-in local backend for tools like Continue.dev (VS Code AI assistant), Open WebUI, custom scripts, and anything else that speaks the OpenAI protocol.
For coding-specific model recommendations, see best AI models for coding locally.
GPU Acceleration Setup
LM Studio detects your GPU automatically in most cases, but here’s what to know:
Apple Silicon (Metal/MLX): Works out of the box. LM Studio uses Metal for GPU inference and also supports MLX, Apple’s optimized ML framework. Unified memory means your full RAM is available as “VRAM.” A MacBook with 16 GB can comfortably run 7B–14B models.
NVIDIA (CUDA): Make sure you have recent NVIDIA drivers installed (535+ recommended). LM Studio bundles its own CUDA runtime, so you don’t need to install the CUDA toolkit separately. Use the GPU offloading slider in the model settings to control how many layers run on GPU vs CPU.
AMD (Vulkan/ROCm): Vulkan support works on most modern AMD GPUs. ROCm support is available on Linux for supported AMD cards (RX 7000 series and some 6000 series). Performance is improving but still behind CUDA.
CPU-only: Totally fine for smaller models (7B Q4). Slower, but it works. LM Studio uses AVX2/AVX-512 instructions when available.
Best Models to Try in 2026
| Model | Parameters | Q4_K_M Size | Good For |
| Llama 4 Scout | 17B active (109B MoE) | ~60 GB | General purpose, multilingual |
| Qwen 3.6 35B-A3B | 3B active (35B MoE) | ~20 GB | Coding, reasoning, efficient MoE |
| Mistral Small 3.2 | 24B | ~14 GB | Instruction following, chat |
| Gemma 4 12B | 12B | ~7 GB | Compact all-rounder |
| DeepSeek-R1 8B | 8B | ~5 GB | Reasoning, math, chain-of-thought |
| Phi-4 Mini | 3.8B | ~2.5 GB | Lightweight, fast, good for testing |
For more options that fit in limited hardware, see best AI models under 16 GB VRAM.
LM Studio vs Ollama
Both run local LLMs using llama.cpp. The choice comes down to workflow preference:
| Feature | LM Studio | Ollama |
| Interface | GUI (desktop app) | CLI / background service |
| Model source | HuggingFace (GGUF) | Ollama library + custom Modelfiles |
| API server | OpenAI-compatible (localhost:1234) | Ollama API + OpenAI-compatible |
| Best for | Exploring models, visual tweaking | Automation, scripting, server use |
| Setup effort | Minimal (point and click) | Minimal (one-line install) |
Use LM Studio if you want a visual interface to browse, download, and experiment with models. Use Ollama if you want a CLI-first tool that runs as a background service and integrates into scripts and pipelines.
They’re not mutually exclusive — many people use both. Read the full Ollama complete guide for the other side of the coin.
Tips and Common Issues
Model won’t load — out of memory. The model is too large for your available RAM/VRAM. Try a smaller quantization (Q4_K_M instead of Q8) or a smaller model. Reduce context length in settings — the default 4096 is fine for most tasks and uses less memory than 8192+.
Slow generation on GPU. Check that GPU offloading is actually enabled. In the model settings, set the number of GPU layers to the maximum your VRAM allows. Partially offloaded models (some layers on GPU, some on CPU) are much faster than pure CPU.
API server returns errors. Make sure a model is loaded before starting the server. The model name in your API call doesn’t need to match exactly — LM Studio serves whatever model is currently loaded.
macOS: “App is damaged” warning. Run xattr -cr /Applications/LM\ Studio.app in Terminal, then open again.
Want more control? Enable Developer Mode in settings to access context length overrides, rope frequency settings, batch size tuning, and per-layer GPU offloading.
Keep models organized. Over time you’ll accumulate many GBs of models. Periodically review and delete ones you don’t use from the My Models tab.
FAQ
Is LM Studio free?
Yes, LM Studio is free for personal use. It provides a full-featured GUI for downloading, running, and experimenting with local language models at no cost.
Does LM Studio need a GPU?
No, LM Studio works on CPU-only machines, though a GPU significantly improves generation speed. With Apple Silicon Macs, the unified memory architecture provides good performance without a discrete GPU.
How does LM Studio compare to Ollama?
LM Studio offers a visual interface for browsing and managing models with easy configuration, while Ollama is a CLI-first tool that runs as a background service. LM Studio is better for experimentation and beginners; Ollama is better for scripting and production deployments.
Can I use LM Studio as an API server?
Yes, LM Studio includes a built-in local API server that exposes an OpenAI-compatible endpoint. You can use it with any tool that supports the OpenAI API format, including Continue.dev, Open WebUI, and custom applications.