πŸ€– AI Tools
Β· 4 min read

How to Fine-Tune Gemma 4 with LoRA β€” Step-by-Step Guide (2026)


Gemma 4 is one of the best open models to fine-tune. Apache 2.0 license means no restrictions, the architecture is well-supported by every fine-tuning tool, and the base model is strong enough that even small datasets produce good results.

This guide walks through fine-tuning Gemma 4 26B with LoRA using Unsloth (fastest) and Hugging Face TRL (most flexible).

What is LoRA fine-tuning?

LoRA (Low-Rank Adaptation) trains a small set of adapter weights instead of modifying the entire model. Benefits:

  • 10-100x less memory than full fine-tuning
  • Trains in minutes to hours instead of days
  • Preserves base model knowledge β€” you add capabilities, not replace them
  • Small output β€” adapter files are typically 10-100 MB

The base Gemma 4 model stays unchanged. Your LoRA adapter sits on top and modifies its behavior for your specific use case.

Hardware requirements

ModelMethodVRAM neededTraining time (1K examples)
Gemma 4 E4BLoRA8 GB~10 min
Gemma 4 26BQLoRA (4-bit)12 GB~30 min
Gemma 4 26BLoRA (16-bit)48 GB~20 min
Gemma 4 31BQLoRA (4-bit)16 GB~45 min

QLoRA (Quantized LoRA) loads the base model in 4-bit precision, making it possible to fine-tune large models on consumer GPUs. Quality is nearly identical to full-precision LoRA.

An RTX 3090 (24 GB) or RTX 4090 (24 GB) handles Gemma 4 26B QLoRA comfortably. For the best GPU options, see our buying guide.

Method 1: Unsloth (fastest)

Unsloth is 2-5x faster than standard fine-tuning and uses 60% less memory. It’s the recommended approach.

Install

pip install unsloth

Prepare your data

Format your training data as conversations:

dataset = [
    {
        "conversations": [
            {"role": "user", "content": "What's the refund policy for digital products?"},
            {"role": "assistant", "content": "Digital products can be refunded within 14 days if unused. Once a license key is activated, refunds are handled case-by-case. Contact support@example.com with your order number."}
        ]
    },
    # ... more examples
]

Save as JSONL:

import json
with open("training_data.jsonl", "w") as f:
    for item in dataset:
        f.write(json.dumps(item) + "\n")

How much data do you need?

  • 50-100 examples: Teaches a specific style or format
  • 500-1000 examples: Teaches domain knowledge
  • 5000+ examples: Significant behavior changes

Fine-tune

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Load model in 4-bit (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-4-26b",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,              # LoRA rank (higher = more capacity, more memory)
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
)

# Load dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./gemma4-lora",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        warmup_steps=10,
        logging_steps=10,
        save_steps=100,
        fp16=True,
    ),
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()

# Save the adapter
model.save_pretrained("./gemma4-lora")
tokenizer.save_pretrained("./gemma4-lora")

Export to Ollama

# Convert to GGUF for use with Ollama
python -m unsloth.save --model ./gemma4-lora --output gemma4-custom.gguf --quantization q4_k_m

# Create Ollama model
echo 'FROM gemma4-custom.gguf' > Modelfile
ollama create my-custom-gemma -f Modelfile
ollama run my-custom-gemma

Method 2: Hugging Face TRL (most flexible)

For more control over the training process:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import load_dataset

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-26b",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-26b")

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

# Train (same as above)
# ...

Fine-tuning tips

Pick the right base model

  • Gemma 4 E4B β€” fastest to train, good for simple tasks (classification, formatting)
  • Gemma 4 26B β€” best balance, recommended for most use cases
  • Gemma 4 31B β€” highest quality base, use if you have the VRAM

Data quality > quantity

100 high-quality, diverse examples beat 10,000 noisy ones. Each example should:

  • Be a realistic input/output pair
  • Demonstrate the exact behavior you want
  • Cover edge cases and variations

Hyperparameter starting points

ParameterRecommendedNotes
LoRA rank (r)16Increase to 32-64 for complex tasks
Learning rate2e-4Lower (1e-4) for larger datasets
Epochs3More for small datasets, fewer for large
Batch size2-4Limited by VRAM

Evaluate before deploying

# Quick evaluation
model.eval()
inputs = tokenizer("Your test prompt here", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

Compare outputs against your expected results. If quality is poor, try:

  • More training data
  • Higher LoRA rank
  • More epochs
  • Better data quality

Use cases for fine-tuning

Customer support bot β€” Train on your support tickets to match your tone and product knowledge.

Code assistant for your stack β€” Train on your codebase conventions, API patterns, and internal libraries.

Domain expert β€” Train on medical, legal, or financial documents to create a specialized assistant.

Content writer β€” Train on your published content to match your writing style.

Running your fine-tuned model

After exporting to GGUF and creating an Ollama model, use it like any other:

ollama run my-custom-gemma

It works with all the same tools β€” Continue.dev, Open WebUI, any OpenAI-compatible client. See our Ollama guide for integration options.

For serving to multiple users, see our Ollama vs llama.cpp vs vLLM comparison β€” vLLM handles LoRA adapters natively for production serving.

Further reading