Fine-Tune a Local LLM β Beginner's Guide with LoRA and Unsloth (2026)
Youβve been writing elaborate system prompts, stuffing examples into context windows, and wrestling with inconsistent outputs. At some point, prompting hits a wall. The model doesnβt know your domain β itβs guessing based on instructions you keep repeating.
Fine-tuning fixes that. Instead of telling the model what to do every time, you teach it once. The knowledge becomes part of the modelβs weights. Responses get faster, more consistent, and cheaper (no more burning tokens on giant prompts).
Fine-tuning makes sense when:
- You need a specific output format every time (structured JSON, code in your frameworkβs style, medical notes)
- You have domain knowledge the base model lacks or gets wrong
- Youβre repeating the same long system prompt across thousands of requests
- You want a smaller, faster model that performs like a larger one on your specific task
If you just need to reference documents, RAG is probably the better path. But when the model itself needs to behave differently, fine-tuning is the answer.
What fine-tuning actually does
A base model like Llama 3 or Qwen 3 has billions of parameters β numerical weights learned during pre-training on massive datasets. Fine-tuning adjusts those weights using your own, much smaller dataset so the model learns your patterns.
Think of it like this: the base model went to university and learned general knowledge. Fine-tuning is the on-the-job training where it learns how your company works.
You donβt need terabytes of data. A few hundred high-quality examples can dramatically shift model behavior for a focused task.
LoRA: adapters, not full weights
Training all billions of parameters requires serious hardware β multiple A100 GPUs and days of compute. Thatβs where LoRA (Low-Rank Adaptation) changes the game.
Instead of modifying every weight in the model, LoRA freezes the original weights and attaches small trainable βadapterβ layers alongside them. These adapters are typically less than 1% of the original model size.
The result:
- Memory: a 7B model fine-tune fits in ~16 GB VRAM instead of 80+ GB
- Speed: training takes minutes to hours, not days
- Storage: adapters are ~50β200 MB, not 14 GB
- Quality: nearly identical to full fine-tuning for most tasks
You can even stack multiple LoRA adapters on the same base model for different tasks.
Why Unsloth
Unsloth is an open-source library that makes LoRA fine-tuning roughly 2x faster while using 60% less memory compared to standard Hugging Face training. It achieves this through custom CUDA kernels and optimized backpropagation β you donβt need to understand the internals, just enjoy the speed.
It supports Llama, Qwen, Mistral, Gemma, and most popular architectures out of the box.
Hardware requirements
| Tier | VRAM | What you can fine-tune | Example GPU |
|---|---|---|---|
| Minimum | 16 GB | 7β8B models with 4-bit quantization | RTX 4060 Ti 16GB, T4 |
| Recommended | 24 GB | 7β14B models comfortably | RTX 3090, RTX 4090, A5000 |
| Comfortable | 48 GB | Up to 30B+ models | A6000, dual 3090 |
Not sure about your setup? Check our VRAM guide for detailed breakdowns. For production quantization decisions, see quantization tradeoffs.
A free Google Colab T4 instance (16 GB) works for following this tutorial.
Step 1: Install Unsloth
pip install unsloth
Unsloth pulls in its dependencies (transformers, trl, etc.) automatically. Make sure you have a CUDA-compatible GPU and PyTorch installed.
Step 2: Load the base model
Weβll use Qwen 3 8B as our base. Unsloth handles 4-bit quantization automatically to fit in limited VRAM.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-8B-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
You can swap in "unsloth/Llama-3.1-8B-bnb-4bit" or any supported model. Check Unslothβs Hugging Face page for the full list of pre-quantized models.
Step 3: Prepare your dataset
Fine-tuning needs examples of the behavior you want. The standard format is conversational JSONL β one JSON object per line:
{"conversations": [{"role": "system", "content": "You are a customer support agent for Acme Corp."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "Go to acme.com/reset, enter your email, and click the link we send you. The link expires in 15 minutes."}]}
{"conversations": [{"role": "user", "content": "What's your return policy?"}, {"role": "assistant", "content": "You can return any item within 30 days of purchase. Items must be unused and in original packaging. Refunds process in 5-7 business days."}]}
Save this as dataset.jsonl. A few guidelines:
- Quality over quantity β 200β500 excellent examples beat 10,000 sloppy ones
- Cover edge cases β include the tricky scenarios, not just the easy ones
- Be consistent β if you want a specific tone or format, every example should demonstrate it
Load it with Hugging Face datasets:
from datasets import load_dataset
dataset = load_dataset("json", data_files="dataset.jsonl", split="train")
Step 4: Configure LoRA and train
First, apply LoRA adapters to the model:
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
)
The key parameters: r=16 controls adapter size (higher = more capacity but more memory; 8β32 is the typical range), and target_modules specifies which layers get adapters.
Now set up the trainer:
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
output_dir="outputs",
),
)
trainer.train()
On a single RTX 4090 with a 500-example dataset, this typically finishes in 10β20 minutes. Watch the training loss β it should decrease steadily. If it plateaus immediately, your learning rate may be too low. If it spikes, too high.
Step 5: Test the fine-tuned model
Before exporting, verify the model behaves as expected:
FastLanguageModel.for_inference(model)
messages = [
{"role": "user", "content": "How do I reset my password?"}
]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
output = model.generate(input_ids=inputs, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Run several test prompts, especially edge cases. If the outputs arenβt right, revisit your dataset β the fix is almost always better data, not more hyperparameter tuning.
Step 6: Export to GGUF for Ollama
To run your fine-tuned model locally with Ollama, export it to GGUF format:
model.save_pretrained_gguf(
"my-finetuned-model",
tokenizer,
quantization_method="q4_k_m",
)
This creates a quantized GGUF file you can load directly into Ollama:
# Create a Modelfile
echo 'FROM ./my-finetuned-model/unsloth.Q4_K_M.gguf' > Modelfile
# Import into Ollama
ollama create my-model -f Modelfile
# Test it
ollama run my-model "How do I reset my password?"
The q4_k_m quantization is a solid default β good balance of quality and size. For more on quantization choices, see our quantization tradeoffs guide.
When to fine-tune vs RAG vs better prompting
| Approach | Best for | Not great for |
|---|---|---|
| Better prompting | Quick iteration, general tasks, when you have good examples to put in-context | Consistent formatting at scale, domain-specific knowledge |
| RAG | Referencing documents, knowledge that changes frequently, citation needed | Changing model behavior/style, structured output formats |
| Fine-tuning | Consistent behavior, domain expertise, specific output formats, smaller/faster models | Rapidly changing information, when you have < 50 examples |
These arenβt mutually exclusive. A fine-tuned model with RAG is a powerful combination β the model knows how to respond while RAG provides what to reference.
For a deeper dive into RAG, see our local RAG pipeline tutorial. If youβre fine-tuning specifically for code, check out best models for coding locally. And for a Gemma-specific walkthrough, we have a dedicated Gemma 4 LoRA guide.
FAQ
Do I need a GPU to fine-tune?
A GPU is strongly recommended β fine-tuning on CPU is impractically slow for any model above 1B parameters. With Unsloth and 4-bit quantization, a 16 GB VRAM GPU (like a free Colab T4) is enough to fine-tune 7β8B models.
How much data do I need?
As few as 200β500 high-quality examples can produce significant improvements for a focused task. Quality matters far more than quantity β 300 carefully crafted examples will outperform 10,000 sloppy ones.
Is fine-tuning better than RAG?
They solve different problems. Fine-tuning changes how the model behaves (tone, format, domain expertise), while RAG provides external knowledge the model can reference. For many production systems, combining both gives the best results.
Can I fine-tune Llama or Qwen?
Yes, both Llama and Qwen are fully supported by Unsloth and the broader LoRA ecosystem. You can load pre-quantized versions like unsloth/Llama-3.1-8B-bnb-4bit or unsloth/Qwen3-8B-bnb-4bit and fine-tune them on a single consumer GPU.
Where to go from here
- Experiment with different
rvalues (8, 16, 32) and compare outputs - Try DPO (Direct Preference Optimization) to align the model with human preferences after SFT
- Build an evaluation set to measure quality objectively β donβt just vibe-check
- Combine your fine-tuned model with RAG for the best of both worlds
Fine-tuning used to require a machine learning team and a cluster of GPUs. With LoRA and Unsloth, itβs an afternoon project on a single consumer GPU. The hardest part isnβt the code β itβs curating good training data.