Apr 27, 2026 · 8 min read

Last updated on Apr 19, 2026

LM Studio Complete Guide — Run Local LLMs With a GUI (2026)

LM Studio is a free desktop application that lets you download, run, and chat with open-source large language models entirely on your own hardware. No cloud. No subscription. No data leaving your machine.

Under the hood, it’s a polished GUI sitting on top of llama.cpp, the C++ inference engine that made local LLMs practical. What LM Studio adds is convenience: a model browser connected to HuggingFace, a built-in chat interface, an OpenAI-compatible API server, and automatic GPU detection — all without touching a terminal.

It’s become one of the most popular ways to run local models because it removes nearly all the friction. You install it, search for a model, click download, and start chatting. If you’ve been curious about running AI locally but didn’t want to wrestle with Python environments or Docker containers, this is where to start.

For a comparison with the CLI-first alternative, see our Ollama vs LM Studio vs vLLM breakdown.

System Requirements

Component	Minimum	Recommended
RAM	8 GB	16 GB+
GPU VRAM	Not required (CPU-only works)	8 GB+ VRAM
Disk	10 GB free	50 GB+ (models are large)
OS	Windows 10+, macOS 14+ (Apple Silicon only), Linux (x86_64)
GPU Support	NVIDIA (CUDA), Apple Silicon (Metal/MLX), AMD (Vulkan/ROCm)

Not sure if your GPU is enough? Check our guide on how much VRAM you actually need for AI.

Installation

macOS (Apple Silicon): Download the .dmg from lmstudio.ai, drag to Applications, done. Requires macOS 14 Sonoma or later. Metal acceleration is enabled automatically.

Windows: Download the installer from the same site. Run it. CUDA support is detected automatically if you have an NVIDIA GPU with up-to-date drivers.

Linux: Download the .AppImage, make it executable (chmod +x), and run. For Vulkan or ROCm GPU support, make sure your drivers are installed first.

That’s it — no Python, no dependencies, no Docker.

Downloading Your First Model

Open LM Studio and go to the Discover tab (the magnifying glass icon). This is a built-in browser for HuggingFace models, filtered to show compatible GGUF files.

What is GGUF?

GGUF is the file format used by llama.cpp for quantized models. Quantization compresses a model’s weights from full precision (16-bit) down to smaller representations (8-bit, 4-bit, etc.), dramatically reducing file size and memory usage with a modest quality trade-off.

Quantization Levels — Quick Reference

Quant	Size vs Full	Quality	Use Case
Q2_K	~25%	Low	Experimentation only
Q4_K_M	~35%	Good	Best balance — start here
Q5_K_M	~45%	Very good	When you have VRAM to spare
Q8_0	~55%	Near-original	High-end GPUs (24 GB+)

For most users, Q4_K_M is the sweet spot. It keeps quality high while fitting comfortably in 8–16 GB of VRAM for 7B–14B parameter models.

How to Pick a Model

In the Discover tab, search for a model name (e.g., “Qwen 3.6” or “Llama 4”).
Look at the available quantizations and file sizes.
Pick Q4_K_M unless you have a reason not to.
Click Download.

The model lands in ~/.cache/lm-studio/models/ and is ready to load.

Chat Interface Basics

Switch to the Chat tab (the message bubble icon). Select your downloaded model from the dropdown at the top. LM Studio loads it into memory — you’ll see a progress bar and memory usage stats.

Once loaded, you can:

Chat with the model in a familiar message interface
Adjust the system prompt to shape behavior
Tweak temperature, top-p, max tokens, and other generation parameters in the right sidebar
Toggle Developer Mode (in settings) for advanced options like context length, rope scaling, and GPU layer offloading controls

Responses are generated locally. Speed depends on your hardware — expect 10–40 tokens/second on a decent GPU, 2–8 tokens/second on CPU only.

Running a Local API Server

This is one of LM Studio’s killer features. Go to the Developer tab (or the <-> icon), load a model, and click Start Server. You now have an OpenAI-compatible API running at:

http://localhost:1234/v1

Any tool that supports the OpenAI API can point to this endpoint instead. Here’s a quick Python example:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"  # any string works, it's local
)

response = client.chat.completions.create(
    model="loaded-model-name",
    messages=[
        {"role": "user", "content": "Explain quicksort in plain English."}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)

This makes LM Studio a drop-in local backend for tools like Continue.dev (VS Code AI assistant), Open WebUI, custom scripts, and anything else that speaks the OpenAI protocol.

For coding-specific model recommendations, see best AI models for coding locally.

GPU Acceleration Setup

LM Studio detects your GPU automatically in most cases, but here’s what to know:

Apple Silicon (Metal/MLX): Works out of the box. LM Studio uses Metal for GPU inference and also supports MLX, Apple’s optimized ML framework. Unified memory means your full RAM is available as “VRAM.” A MacBook with 16 GB can comfortably run 7B–14B models.

NVIDIA (CUDA): Make sure you have recent NVIDIA drivers installed (535+ recommended). LM Studio bundles its own CUDA runtime, so you don’t need to install the CUDA toolkit separately. Use the GPU offloading slider in the model settings to control how many layers run on GPU vs CPU.

AMD (Vulkan/ROCm): Vulkan support works on most modern AMD GPUs. ROCm support is available on Linux for supported AMD cards (RX 7000 series and some 6000 series). Performance is improving but still behind CUDA.

CPU-only: Totally fine for smaller models (7B Q4). Slower, but it works. LM Studio uses AVX2/AVX-512 instructions when available.

Best Models to Try in 2026

Model	Parameters	Q4_K_M Size	Good For
Llama 4 Scout	17B active (109B MoE)	~60 GB	General purpose, multilingual
Qwen 3.6 35B-A3B	3B active (35B MoE)	~20 GB	Coding, reasoning, efficient MoE
Mistral Small 3.2	24B	~14 GB	Instruction following, chat
Gemma 4 12B	12B	~7 GB	Compact all-rounder
DeepSeek-R1 8B	8B	~5 GB	Reasoning, math, chain-of-thought
Phi-4 Mini	3.8B	~2.5 GB	Lightweight, fast, good for testing

For more options that fit in limited hardware, see best AI models under 16 GB VRAM.

LM Studio vs Ollama

Both run local LLMs using llama.cpp. The choice comes down to workflow preference:

Feature	LM Studio	Ollama
Interface	GUI (desktop app)	CLI / background service
Model source	HuggingFace (GGUF)	Ollama library + custom Modelfiles
API server	OpenAI-compatible (localhost:1234)	Ollama API + OpenAI-compatible
Best for	Exploring models, visual tweaking	Automation, scripting, server use
Setup effort	Minimal (point and click)	Minimal (one-line install)

Use LM Studio if you want a visual interface to browse, download, and experiment with models. Use Ollama if you want a CLI-first tool that runs as a background service and integrates into scripts and pipelines.

They’re not mutually exclusive — many people use both. Read the full Ollama complete guide for the other side of the coin.

Tips and Common Issues

Model won’t load — out of memory. The model is too large for your available RAM/VRAM. Try a smaller quantization (Q4_K_M instead of Q8) or a smaller model. Reduce context length in settings — the default 4096 is fine for most tasks and uses less memory than 8192+.

Slow generation on GPU. Check that GPU offloading is actually enabled. In the model settings, set the number of GPU layers to the maximum your VRAM allows. Partially offloaded models (some layers on GPU, some on CPU) are much faster than pure CPU.

API server returns errors. Make sure a model is loaded before starting the server. The model name in your API call doesn’t need to match exactly — LM Studio serves whatever model is currently loaded.

macOS: “App is damaged” warning. Run xattr -cr /Applications/LM\ Studio.app in Terminal, then open again.

Want more control? Enable Developer Mode in settings to access context length overrides, rope frequency settings, batch size tuning, and per-layer GPU offloading.

Keep models organized. Over time you’ll accumulate many GBs of models. Periodically review and delete ones you don’t use from the My Models tab.

FAQ

Is LM Studio free?

Yes, LM Studio is free for personal use. It provides a full-featured GUI for downloading, running, and experimenting with local language models at no cost.

Does LM Studio need a GPU?

No, LM Studio works on CPU-only machines, though a GPU significantly improves generation speed. With Apple Silicon Macs, the unified memory architecture provides good performance without a discrete GPU.

How does LM Studio compare to Ollama?

LM Studio offers a visual interface for browsing and managing models with easy configuration, while Ollama is a CLI-first tool that runs as a background service. LM Studio is better for experimentation and beginners; Ollama is better for scripting and production deployments.

Can I use LM Studio as an API server?

Yes, LM Studio includes a built-in local API server that exposes an OpenAI-compatible endpoint. You can use it with any tool that supports the OpenAI API format, including Continue.dev, Open WebUI, and custom applications.

Ollama Complete Guide 2026 — the CLI alternative
Ollama vs LM Studio vs vLLM — detailed comparison
Best AI Models for Coding Locally
How Much VRAM Do You Need for AI?
Best AI Models Under 16 GB VRAM
Qwen 3.6 35B-A3B Complete Guide
LM Studio Official Site