May 1, 2026 · 4 min read

How to Fine-Tune Gemma 4 with LoRA — Step-by-Step Guide (2026)

Gemma 4 is one of the best open models to fine-tune. Apache 2.0 license means no restrictions, the architecture is well-supported by every fine-tuning tool, and the base model is strong enough that even small datasets produce good results.

This guide walks through fine-tuning Gemma 4 26B with LoRA using Unsloth (fastest) and Hugging Face TRL (most flexible).

What is LoRA fine-tuning?

LoRA (Low-Rank Adaptation) trains a small set of adapter weights instead of modifying the entire model. Benefits:

10-100x less memory than full fine-tuning
Trains in minutes to hours instead of days
Preserves base model knowledge — you add capabilities, not replace them
Small output — adapter files are typically 10-100 MB

The base Gemma 4 model stays unchanged. Your LoRA adapter sits on top and modifies its behavior for your specific use case.

Hardware requirements

Model	Method	VRAM needed	Training time (1K examples)
Gemma 4 E4B	LoRA	8 GB	~10 min
Gemma 4 26B	QLoRA (4-bit)	12 GB	~30 min
Gemma 4 26B	LoRA (16-bit)	48 GB	~20 min
Gemma 4 31B	QLoRA (4-bit)	16 GB	~45 min

QLoRA (Quantized LoRA) loads the base model in 4-bit precision, making it possible to fine-tune large models on consumer GPUs. Quality is nearly identical to full-precision LoRA.

An RTX 3090 (24 GB) or RTX 4090 (24 GB) handles Gemma 4 26B QLoRA comfortably. For the best GPU options, see our buying guide.

Method 1: Unsloth (fastest)

Unsloth is 2-5x faster than standard fine-tuning and uses 60% less memory. It’s the recommended approach.

Install

pip install unsloth

Prepare your data

Format your training data as conversations:

dataset = [
    {
        "conversations": [
            {"role": "user", "content": "What's the refund policy for digital products?"},
            {"role": "assistant", "content": "Digital products can be refunded within 14 days if unused. Once a license key is activated, refunds are handled case-by-case. Contact support@example.com with your order number."}
        ]
    },
    # ... more examples
]

Save as JSONL:

import json
with open("training_data.jsonl", "w") as f:
    for item in dataset:
        f.write(json.dumps(item) + "\n")

How much data do you need?

50-100 examples: Teaches a specific style or format
500-1000 examples: Teaches domain knowledge
5000+ examples: Significant behavior changes

Fine-tune

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Load model in 4-bit (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-4-26b",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,              # LoRA rank (higher = more capacity, more memory)
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
)

# Load dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./gemma4-lora",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        warmup_steps=10,
        logging_steps=10,
        save_steps=100,
        fp16=True,
    ),
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()

# Save the adapter
model.save_pretrained("./gemma4-lora")
tokenizer.save_pretrained("./gemma4-lora")

Export to Ollama

# Convert to GGUF for use with Ollama
python -m unsloth.save --model ./gemma4-lora --output gemma4-custom.gguf --quantization q4_k_m

# Create Ollama model
echo 'FROM gemma4-custom.gguf' > Modelfile
ollama create my-custom-gemma -f Modelfile
ollama run my-custom-gemma

Method 2: Hugging Face TRL (most flexible)

For more control over the training process:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import load_dataset

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-26b",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-26b")

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

# Train (same as above)
# ...

Fine-tuning tips

Pick the right base model

Gemma 4 E4B — fastest to train, good for simple tasks (classification, formatting)
Gemma 4 26B — best balance, recommended for most use cases
Gemma 4 31B — highest quality base, use if you have the VRAM

Data quality > quantity

100 high-quality, diverse examples beat 10,000 noisy ones. Each example should:

Be a realistic input/output pair
Demonstrate the exact behavior you want
Cover edge cases and variations

Hyperparameter starting points

Parameter	Recommended	Notes
LoRA rank (r)	16	Increase to 32-64 for complex tasks
Learning rate	2e-4	Lower (1e-4) for larger datasets
Epochs	3	More for small datasets, fewer for large
Batch size	2-4	Limited by VRAM

Evaluate before deploying

# Quick evaluation
model.eval()
inputs = tokenizer("Your test prompt here", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

Compare outputs against your expected results. If quality is poor, try:

More training data
Higher LoRA rank
More epochs
Better data quality

Use cases for fine-tuning

Customer support bot — Train on your support tickets to match your tone and product knowledge.

Code assistant for your stack — Train on your codebase conventions, API patterns, and internal libraries.

Domain expert — Train on medical, legal, or financial documents to create a specialized assistant.

Content writer — Train on your published content to match your writing style.

Running your fine-tuned model

After exporting to GGUF and creating an Ollama model, use it like any other:

ollama run my-custom-gemma

It works with all the same tools — Continue.dev, Open WebUI, any OpenAI-compatible client. See our Ollama guide for integration options.

For serving to multiple users, see our Ollama vs llama.cpp vs vLLM comparison — vLLM handles LoRA adapters natively for production serving.

How to Fine-Tune Gemma 4 with LoRA — Step-by-Step Guide (2026)

What is LoRA fine-tuning?

Hardware requirements

Method 1: Unsloth (fastest)

Install

Prepare your data

Fine-tune

Export to Ollama

Method 2: Hugging Face TRL (most flexible)

Fine-tuning tips

Pick the right base model

Data quality > quantity

Hyperparameter starting points

Evaluate before deploying

Use cases for fine-tuning

Running your fine-tuned model

Further reading

📬 AI Dev Weekly

You might also like

How to Run Gemma 4 Locally — Complete Setup Guide (2026)

How to Run Llama 4 Maverick (400B) Locally — Setup Guide (2026)

Qwen 3.5 vs Gemma 4 — Alibaba vs Google Open Models Compared (2026)

Gemma 4 vs MiMo V2 Pro — Google vs Xiaomi AI Showdown (2026)