Gemma 4 is one of the best open models to fine-tune. Apache 2.0 license means no restrictions, the architecture is well-supported by every fine-tuning tool, and the base model is strong enough that even small datasets produce good results.
This guide walks through fine-tuning Gemma 4 26B with LoRA using Unsloth (fastest) and Hugging Face TRL (most flexible).
What is LoRA fine-tuning?
LoRA (Low-Rank Adaptation) trains a small set of adapter weights instead of modifying the entire model. Benefits:
- 10-100x less memory than full fine-tuning
- Trains in minutes to hours instead of days
- Preserves base model knowledge β you add capabilities, not replace them
- Small output β adapter files are typically 10-100 MB
The base Gemma 4 model stays unchanged. Your LoRA adapter sits on top and modifies its behavior for your specific use case.
Hardware requirements
| Model | Method | VRAM needed | Training time (1K examples) |
|---|---|---|---|
| Gemma 4 E4B | LoRA | 8 GB | ~10 min |
| Gemma 4 26B | QLoRA (4-bit) | 12 GB | ~30 min |
| Gemma 4 26B | LoRA (16-bit) | 48 GB | ~20 min |
| Gemma 4 31B | QLoRA (4-bit) | 16 GB | ~45 min |
QLoRA (Quantized LoRA) loads the base model in 4-bit precision, making it possible to fine-tune large models on consumer GPUs. Quality is nearly identical to full-precision LoRA.
An RTX 3090 (24 GB) or RTX 4090 (24 GB) handles Gemma 4 26B QLoRA comfortably. For the best GPU options, see our buying guide.
Method 1: Unsloth (fastest)
Unsloth is 2-5x faster than standard fine-tuning and uses 60% less memory. Itβs the recommended approach.
Install
pip install unsloth
Prepare your data
Format your training data as conversations:
dataset = [
{
"conversations": [
{"role": "user", "content": "What's the refund policy for digital products?"},
{"role": "assistant", "content": "Digital products can be refunded within 14 days if unused. Once a license key is activated, refunds are handled case-by-case. Contact support@example.com with your order number."}
]
},
# ... more examples
]
Save as JSONL:
import json
with open("training_data.jsonl", "w") as f:
for item in dataset:
f.write(json.dumps(item) + "\n")
How much data do you need?
- 50-100 examples: Teaches a specific style or format
- 500-1000 examples: Teaches domain knowledge
- 5000+ examples: Significant behavior changes
Fine-tune
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# Load model in 4-bit (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="google/gemma-4-26b",
max_seq_length=2048,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank (higher = more capacity, more memory)
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
)
# Load dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=TrainingArguments(
output_dir="./gemma4-lora",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_steps=10,
logging_steps=10,
save_steps=100,
fp16=True,
),
tokenizer=tokenizer,
max_seq_length=2048,
)
trainer.train()
# Save the adapter
model.save_pretrained("./gemma4-lora")
tokenizer.save_pretrained("./gemma4-lora")
Export to Ollama
# Convert to GGUF for use with Ollama
python -m unsloth.save --model ./gemma4-lora --output gemma4-custom.gguf --quantization q4_k_m
# Create Ollama model
echo 'FROM gemma4-custom.gguf' > Modelfile
ollama create my-custom-gemma -f Modelfile
ollama run my-custom-gemma
Method 2: Hugging Face TRL (most flexible)
For more control over the training process:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import load_dataset
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-26b",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-26b")
# LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# Train (same as above)
# ...
Fine-tuning tips
Pick the right base model
- Gemma 4 E4B β fastest to train, good for simple tasks (classification, formatting)
- Gemma 4 26B β best balance, recommended for most use cases
- Gemma 4 31B β highest quality base, use if you have the VRAM
Data quality > quantity
100 high-quality, diverse examples beat 10,000 noisy ones. Each example should:
- Be a realistic input/output pair
- Demonstrate the exact behavior you want
- Cover edge cases and variations
Hyperparameter starting points
| Parameter | Recommended | Notes |
|---|---|---|
| LoRA rank (r) | 16 | Increase to 32-64 for complex tasks |
| Learning rate | 2e-4 | Lower (1e-4) for larger datasets |
| Epochs | 3 | More for small datasets, fewer for large |
| Batch size | 2-4 | Limited by VRAM |
Evaluate before deploying
# Quick evaluation
model.eval()
inputs = tokenizer("Your test prompt here", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
Compare outputs against your expected results. If quality is poor, try:
- More training data
- Higher LoRA rank
- More epochs
- Better data quality
Use cases for fine-tuning
Customer support bot β Train on your support tickets to match your tone and product knowledge.
Code assistant for your stack β Train on your codebase conventions, API patterns, and internal libraries.
Domain expert β Train on medical, legal, or financial documents to create a specialized assistant.
Content writer β Train on your published content to match your writing style.
Running your fine-tuned model
After exporting to GGUF and creating an Ollama model, use it like any other:
ollama run my-custom-gemma
It works with all the same tools β Continue.dev, Open WebUI, any OpenAI-compatible client. See our Ollama guide for integration options.
For serving to multiple users, see our Ollama vs llama.cpp vs vLLM comparison β vLLM handles LoRA adapters natively for production serving.
Further reading
- Gemma 4 Family Guide β model specs and benchmarks
- How to Run Gemma 4 Locally β setup without fine-tuning
- Best GPU for AI Locally β hardware recommendations
- Self-Hosted AI vs API β when to run locally vs use cloud