May 3, 2026 · 7 min read

Last updated on Apr 19, 2026

Fine-Tune a Local LLM — Beginner's Guide with LoRA and Unsloth (2026)

Q: Do I need a GPU to fine-tune?

A GPU is strongly recommended — fine-tuning on CPU is impractically slow for any model above 1B parameters. With Unsloth and 4-bit quantization, a 16 GB VRAM GPU (like a free Colab T4) is enough to fine-tune 7–8B models.

Q: How much data do I need?

As few as 200–500 high-quality examples can produce significant improvements for a focused task. Quality matters far more than quantity — 300 carefully crafted examples will outperform 10,000 sloppy ones.

Q: Is fine-tuning better than RAG?

They solve different problems. Fine-tuning changes how the model behaves (tone, format, domain expertise), while RAG provides external knowledge the model can reference. For many production systems, combining both gives the best results.

Q: Can I fine-tune Llama or Qwen?

Yes, both Llama and Qwen are fully supported by Unsloth and the broader LoRA ecosystem. You can load pre-quantized versions like `unsloth/Llama-3.1-8B-bnb-4bit` or `unsloth/Qwen3-8B-bnb-4bit` and fine-tune them on a single consumer GPU.

You’ve been writing elaborate system prompts, stuffing examples into context windows, and wrestling with inconsistent outputs. At some point, prompting hits a wall. The model doesn’t know your domain — it’s guessing based on instructions you keep repeating.

Fine-tuning fixes that. Instead of telling the model what to do every time, you teach it once. The knowledge becomes part of the model’s weights. Responses get faster, more consistent, and cheaper (no more burning tokens on giant prompts).

Fine-tuning makes sense when:

You need a specific output format every time (structured JSON, code in your framework’s style, medical notes)
You have domain knowledge the base model lacks or gets wrong
You’re repeating the same long system prompt across thousands of requests
You want a smaller, faster model that performs like a larger one on your specific task

If you just need to reference documents, RAG is probably the better path. But when the model itself needs to behave differently, fine-tuning is the answer.

What fine-tuning actually does

A base model like Llama 3 or Qwen 3 has billions of parameters — numerical weights learned during pre-training on massive datasets. Fine-tuning adjusts those weights using your own, much smaller dataset so the model learns your patterns.

Think of it like this: the base model went to university and learned general knowledge. Fine-tuning is the on-the-job training where it learns how your company works.

You don’t need terabytes of data. A few hundred high-quality examples can dramatically shift model behavior for a focused task.

LoRA: adapters, not full weights

Training all billions of parameters requires serious hardware — multiple A100 GPUs and days of compute. That’s where LoRA (Low-Rank Adaptation) changes the game.

Instead of modifying every weight in the model, LoRA freezes the original weights and attaches small trainable “adapter” layers alongside them. These adapters are typically less than 1% of the original model size.

The result:

Memory: a 7B model fine-tune fits in ~16 GB VRAM instead of 80+ GB
Speed: training takes minutes to hours, not days
Storage: adapters are ~50–200 MB, not 14 GB
Quality: nearly identical to full fine-tuning for most tasks

You can even stack multiple LoRA adapters on the same base model for different tasks.

Why Unsloth

Unsloth is an open-source library that makes LoRA fine-tuning roughly 2x faster while using 60% less memory compared to standard Hugging Face training. It achieves this through custom CUDA kernels and optimized backpropagation — you don’t need to understand the internals, just enjoy the speed.

It supports Llama, Qwen, Mistral, Gemma, and most popular architectures out of the box.

Hardware requirements

Tier	VRAM	What you can fine-tune	Example GPU
Minimum	16 GB	7–8B models with 4-bit quantization	RTX 4060 Ti 16GB, T4
Recommended	24 GB	7–14B models comfortably	RTX 3090, RTX 4090, A5000
Comfortable	48 GB	Up to 30B+ models	A6000, dual 3090

Not sure about your setup? Check our VRAM guide for detailed breakdowns. For production quantization decisions, see quantization tradeoffs.

A free Google Colab T4 instance (16 GB) works for following this tutorial.

Step 1: Install Unsloth

pip install unsloth

Unsloth pulls in its dependencies (transformers, trl, etc.) automatically. Make sure you have a CUDA-compatible GPU and PyTorch installed.

Step 2: Load the base model

We’ll use Qwen 3 8B as our base. Unsloth handles 4-bit quantization automatically to fit in limited VRAM.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

You can swap in "unsloth/Llama-3.1-8B-bnb-4bit" or any supported model. Check Unsloth’s Hugging Face page for the full list of pre-quantized models.

Step 3: Prepare your dataset

Fine-tuning needs examples of the behavior you want. The standard format is conversational JSONL — one JSON object per line:

{"conversations": [{"role": "system", "content": "You are a customer support agent for Acme Corp."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "Go to acme.com/reset, enter your email, and click the link we send you. The link expires in 15 minutes."}]}
{"conversations": [{"role": "user", "content": "What's your return policy?"}, {"role": "assistant", "content": "You can return any item within 30 days of purchase. Items must be unused and in original packaging. Refunds process in 5-7 business days."}]}

Save this as dataset.jsonl. A few guidelines:

Quality over quantity — 200–500 excellent examples beat 10,000 sloppy ones
Cover edge cases — include the tricky scenarios, not just the easy ones
Be consistent — if you want a specific tone or format, every example should demonstrate it

Load it with Hugging Face datasets:

from datasets import load_dataset

dataset = load_dataset("json", data_files="dataset.jsonl", split="train")

Step 4: Configure LoRA and train

First, apply LoRA adapters to the model:

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
)

The key parameters: r=16 controls adapter size (higher = more capacity but more memory; 8–32 is the typical range), and target_modules specifies which layers get adapters.

Now set up the trainer:

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="outputs",
    ),
)

trainer.train()

On a single RTX 4090 with a 500-example dataset, this typically finishes in 10–20 minutes. Watch the training loss — it should decrease steadily. If it plateaus immediately, your learning rate may be too low. If it spikes, too high.

Step 5: Test the fine-tuned model

Before exporting, verify the model behaves as expected:

FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "How do I reset my password?"}
]

inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

output = model.generate(input_ids=inputs, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Run several test prompts, especially edge cases. If the outputs aren’t right, revisit your dataset — the fix is almost always better data, not more hyperparameter tuning.

Step 6: Export to GGUF for Ollama

To run your fine-tuned model locally with Ollama, export it to GGUF format:

model.save_pretrained_gguf(
    "my-finetuned-model",
    tokenizer,
    quantization_method="q4_k_m",
)

This creates a quantized GGUF file you can load directly into Ollama:

# Create a Modelfile
echo 'FROM ./my-finetuned-model/unsloth.Q4_K_M.gguf' > Modelfile

# Import into Ollama
ollama create my-model -f Modelfile

# Test it
ollama run my-model "How do I reset my password?"

The q4_k_m quantization is a solid default — good balance of quality and size. For more on quantization choices, see our quantization tradeoffs guide.

When to fine-tune vs RAG vs better prompting

Approach	Best for	Not great for
Better prompting	Quick iteration, general tasks, when you have good examples to put in-context	Consistent formatting at scale, domain-specific knowledge
RAG	Referencing documents, knowledge that changes frequently, citation needed	Changing model behavior/style, structured output formats
Fine-tuning	Consistent behavior, domain expertise, specific output formats, smaller/faster models	Rapidly changing information, when you have < 50 examples

These aren’t mutually exclusive. A fine-tuned model with RAG is a powerful combination — the model knows how to respond while RAG provides what to reference.

For a deeper dive into RAG, see our local RAG pipeline tutorial. If you’re fine-tuning specifically for code, check out best models for coding locally. And for a Gemma-specific walkthrough, we have a dedicated Gemma 4 LoRA guide.

FAQ

Do I need a GPU to fine-tune?

A GPU is strongly recommended — fine-tuning on CPU is impractically slow for any model above 1B parameters. With Unsloth and 4-bit quantization, a 16 GB VRAM GPU (like a free Colab T4) is enough to fine-tune 7–8B models.

How much data do I need?

As few as 200–500 high-quality examples can produce significant improvements for a focused task. Quality matters far more than quantity — 300 carefully crafted examples will outperform 10,000 sloppy ones.

Is fine-tuning better than RAG?

They solve different problems. Fine-tuning changes how the model behaves (tone, format, domain expertise), while RAG provides external knowledge the model can reference. For many production systems, combining both gives the best results.

Can I fine-tune Llama or Qwen?

Yes, both Llama and Qwen are fully supported by Unsloth and the broader LoRA ecosystem. You can load pre-quantized versions like unsloth/Llama-3.1-8B-bnb-4bit or unsloth/Qwen3-8B-bnb-4bit and fine-tune them on a single consumer GPU.

Where to go from here

Experiment with different r values (8, 16, 32) and compare outputs
Try DPO (Direct Preference Optimization) to align the model with human preferences after SFT
Build an evaluation set to measure quality objectively — don’t just vibe-check
Combine your fine-tuned model with RAG for the best of both worlds

Fine-tuning used to require a machine learning team and a cluster of GPUs. With LoRA and Unsloth, it’s an afternoon project on a single consumer GPU. The hardest part isn’t the code — it’s curating good training data.