(Well, almost.)
The Problem: Fine-Tuning is Absurdly Expensive¶
With modern language models, they’re huge. And when you want to fine-tune them the “traditional” way, you need to update every single parameter.
Let’s look at the numbers:
| Model | Parameters | GPU Memory (FP32) |
|---|---|---|
| GPT-2 | 124M | ~500 MB |
| GPT-2 Large | 774M | ~3 GB |
| LLaMA 7B | 7B | ~28 GB |
| LLaMA 70B | 70B | ~280 GB |
That 70B model? You’d need a server rack just to load it into memory. And that’s before you start training, which requires storing gradients and optimizer states (typically 3-4x more memory).
So what do we do? Give up on fine-tuning large models?
Not quite.
LoRA: The Key Insight¶
LoRA stands for Low-Rank Adaptation. The paper came out in 2021 and basically revolutionized how we fine-tune large models.
Here’s the core idea: when you fine-tune a model, you don’t actually need to change all the weights by large amounts. The model already knows language — you’re just steering it slightly. So the matrix of weight updates should be “low-rank.”
Wait, what does “low-rank” mean?¶
Think about a spreadsheet with 1,000 rows and 1,000 columns. That’s a million cells of data. But what if all the values in that spreadsheet could be generated from just a few simple rules? Like “multiply row number by column number” or something like that?
In linear algebra terms, that spreadsheet would be “low-rank” — it looks like a million numbers, but there’s actually much less information there. You could recreate it from a much smaller description.
The LoRA Trick¶
Normally, when you update a weight matrix, you’d do this:
Where is the same size as (so, possibly millions of numbers).
LoRA says: “Let’s approximate as a product of two much smaller matrices”:
where:
is the frozen (unchanging) pre-trained weights — shape:
is a tall skinny matrix — shape:
is a short wide matrix — shape:
is the rank — typically something tiny like 8 or 16
Instead of storing new numbers (millions), we only store numbers (thousands).
An Analogy¶
Imagine you’re editing a photograph. You could change every single pixel individually (full fine-tuning). Or you could apply a simple filter — like “make it 10% warmer” — which is defined by just a few parameters but affects the whole image (LoRA).
Both change the image, but one requires way less information to describe.
import torch
import torch.nn as nn
import math
class LoRALayer(nn.Module):
"""
A Low-Rank Adaptation layer.
This doesn't replace a linear layer — it sits *alongside* one,
adding a small trainable update.
"""
def __init__(
self,
in_features: int,
out_features: int,
rank: int = 8,
alpha: float = 16.0,
dropout: float = 0.1
):
super().__init__()
self.rank = rank
self.alpha = alpha
# This scaling factor controls how much the LoRA update affects the output
# alpha/rank is a common choice that keeps the update magnitude reasonable
self.scaling = alpha / rank
# The two low-rank matrices
# A is initialized with random values (Kaiming initialization)
# B is initialized to zeros (so initially, LoRA contributes nothing)
self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Compute the LoRA contribution: (B @ A) @ x * scaling
x has shape: (batch_size, sequence_length, in_features)
We're computing: x @ A^T @ B^T * scaling
"""
lora_out = self.dropout(x) @ self.lora_A.T @ self.lora_B.T
return lora_out * self.scaling
# Let's see the parameter savings in action
in_features, out_features = 4096, 4096
rank = 8
# Full fine-tuning would update this many parameters:
full_params = in_features * out_features
# LoRA only needs this many:
lora_params = (in_features * rank) + (out_features * rank)
print(f"Full fine-tuning: {full_params:,} parameters")
print(f"LoRA (rank={rank}): {lora_params:,} parameters")
print(f"Reduction: {full_params / lora_params:.1f}x fewer parameters")
print()
print(f"That's {100 * lora_params / full_params:.2f}% of the original size.")
print(f"(Or to put it another way: we're using {100 * (1 - lora_params / full_params):.1f}% less parameters!)")Full fine-tuning: 16,777,216 parameters
LoRA (rank=8): 65,536 parameters
Reduction: 256.0x fewer parameters
That's 0.39% of the original size.
(Or to put it another way: we're using 99.6% less parameters!)
Wrapping an Existing Layer¶
In practice, we don’t create standalone LoRA layers. We wrap existing layers from a pre-trained model.
The pattern is:
Take a frozen linear layer from the pre-trained model
Add a LoRA layer alongside it
During forward pass: add both outputs together
Let’s build that.
class LoRALinear(nn.Module):
"""
A linear layer with LoRA adaptation.
Think of this as a wrapper around a pre-trained linear layer.
The original layer is frozen, and we add a small LoRA "correction" on top.
"""
def __init__(
self,
original_layer: nn.Linear,
rank: int = 8,
alpha: float = 16.0,
dropout: float = 0.1
):
super().__init__()
# Store the original layer and freeze it
# (We're not training these weights — they stay as pre-trained)
self.original = original_layer
for param in self.original.parameters():
param.requires_grad = False
# Add the LoRA adapter
self.lora = LoRALayer(
in_features=original_layer.in_features,
out_features=original_layer.out_features,
rank=rank,
alpha=alpha,
dropout=dropout
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Forward pass: original output + LoRA output.
The original layer produces the "base" output.
The LoRA layer produces a small "correction."
We add them together.
"""
return self.original(x) + self.lora(x)
# Let's test it with a realistic size (GPT-2's hidden dimension)
original = nn.Linear(768, 768)
lora_layer = LoRALinear(original, rank=8)
# Count trainable vs. frozen parameters
trainable = sum(p.numel() for p in lora_layer.parameters() if p.requires_grad)
total = sum(p.numel() for p in lora_layer.parameters())
print(f"Trainable parameters: {trainable:,}")
print(f"Total parameters: {total:,}")
print(f"Trainable: {100 * trainable / total:.2f}%")
print()
print("So we're only updating ~2% of the layer's parameters!")
print("(The other 98% stay frozen at their pre-trained values.)")Trainable parameters: 12,288
Total parameters: 602,880
Trainable: 2.04%
So we're only updating ~2% of the layer's parameters!
(The other 98% stay frozen at their pre-trained values.)
Using PEFT Library (The Real-World Approach)¶
Okay, so we’ve built LoRA from scratch to understand how it works. But in practice, you’d never actually do that.
Instead, you’d use HuggingFace’s PEFT library (Parameter-Efficient Fine-Tuning). It handles all the low-level details and integrates seamlessly with transformers.
Let’s see how easy it is to add LoRA to a real model.
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load GPT-2 (the small version, 124M parameters)
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
total_params = sum(p.numel() for p in model.parameters())
print(f"Base model: {model_name}")
print(f"Total parameters: {total_params:,}")
print()
print("Now let's add LoRA to this...")Base model: gpt2
Total parameters: 124,439,808
Now let's add LoRA to this...
from peft import LoraConfig, get_peft_model, TaskType
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM, # We're doing causal language modeling
r=8, # Rank: the size of the low-rank matrices
lora_alpha=16, # Scaling factor (typically 2*r)
lora_dropout=0.1, # Dropout applied to LoRA layers
target_modules=["c_attn", "c_proj"], # Which layers to adapt
fan_in_fan_out=True, # Required for GPT-2's Conv1D layers
bias="none" # Don't add LoRA to bias terms
)
# Apply LoRA to the model
# This wraps the specified layers with LoRA adapters
peft_model = get_peft_model(model, lora_config)
# Print the breakdown
peft_model.print_trainable_parameters()
print()
print("Wait, what are 'c_attn' and 'c_proj'?")
print()
print("Those are GPT-2's attention layers:")
print(" - c_attn: The combined Query/Key/Value projection")
print(" - c_proj: The output projection after attention")
print()
print("We're adding LoRA *only* to these layers, not the entire model.")
print("(Empirically, adapting attention layers gives the best results.)")trainable params: 811,008 || all params: 125,250,816 || trainable%: 0.6475
Wait, what are 'c_attn' and 'c_proj'?
Those are GPT-2's attention layers:
- c_attn: The combined Query/Key/Value projection
- c_proj: The output projection after attention
We're adding LoRA *only* to these layers, not the entire model.
(Empirically, adapting attention layers gives the best results.)
Choosing LoRA Hyperparameters¶
LoRA has a few knobs you can tune. Here’s what they mean and how to choose them:
| Parameter | What it means | Typical Values | How to choose |
|---|---|---|---|
r (rank) | Size of the low-rank matrices | 4, 8, 16, 32 | Start with 8. Increase if underfitting. |
alpha | Scaling factor for LoRA output | 16, 32 | Usually 2×r. Higher = stronger adaptation. |
dropout | Dropout on LoRA layers | 0.05-0.1 | Standard regularization. |
target_modules | Which layers get LoRA | Attention layers | Q, K, V projections work best. |
The rank/alpha relationship¶
The rank controls capacity: higher rank = more expressive updates, but also more parameters.
The alpha controls magnitude: it scales how much the LoRA output affects the final result. The ratio alpha/r is what actually matters — it’s like a learning rate specifically for the LoRA part.
Think of it this way:
Low rank (r=4): “I only need a simple adjustment”
High rank (r=64): “I need to make complex changes to the model”
Most of the time, r=8 or r=16 is plenty. Remember: the pre-trained model already knows a lot. We’re just steering it, not retraining from scratch.
Which layers to target?¶
For transformers, the attention layers matter most. Specifically:
Query, Key, Value projections (often combined as
c_attnin GPT-2)Output projection (
c_proj)
You could add LoRA to the feedforward layers too, but empirically it doesn’t help much and just adds more parameters.
Merging LoRA Weights (For Inference)¶
Here’s a neat trick: after training, you can merge the LoRA weights back into the original model.
Remember, during training we compute:
But mathematically, this is the same as:
So we can just add to the original weights once, and then we’re back to a normal model — no extra computation at inference time!
This is huge. You get the training benefits of LoRA (low memory, fast updates) AND the inference benefits of a regular model (no overhead).
Let’s see how that works in code:
def merge_lora_weights(original_weight, lora_A, lora_B, scaling):
"""
Merge LoRA weights into the original weight matrix.
W_merged = W_original + (B @ A) * scaling
After this, you can throw away the LoRA matrices and just use W_merged.
"""
# Compute the low-rank update
delta_W = (lora_B @ lora_A) * scaling
# Add it to the original weights
return original_weight + delta_W
# Example with realistic dimensions
d, k, r = 768, 768, 8
alpha = 16
scaling = alpha / r
# Simulate the weights
W_original = torch.randn(d, k) # Original pre-trained weights
lora_A = torch.randn(r, k) # LoRA matrix A (trained)
lora_B = torch.randn(d, r) # LoRA matrix B (trained)
# Merge them
W_merged = merge_lora_weights(W_original, lora_A, lora_B, scaling)
print(f"Original W shape: {W_original.shape}")
print(f"LoRA A shape: {lora_A.shape}")
print(f"LoRA B shape: {lora_B.shape}")
print(f"Merged W shape: {W_merged.shape}")
print()
print("Before merging:")
print(f" - Two separate matrix multiplies during inference")
print(f" - Slightly slower, but LoRA weights can be swapped out")
print()
print("After merging:")
print(f" - Single matrix multiply (same as original model)")
print(f" - Zero inference overhead!")
print(f" - Trade-off: can't easily swap LoRA adapters anymore")
print()
print("(In practice, you'd merge for production deployment.)")Original W shape: torch.Size([768, 768])
LoRA A shape: torch.Size([8, 768])
LoRA B shape: torch.Size([768, 8])
Merged W shape: torch.Size([768, 768])
Before merging:
- Two separate matrix multiplies during inference
- Slightly slower, but LoRA weights can be swapped out
After merging:
- Single matrix multiply (same as original model)
- Zero inference overhead!
- Trade-off: can't easily swap LoRA adapters anymore
(In practice, you'd merge for production deployment.)
Why LoRA is Brilliant¶
Let’s recap why this technique took over the world:
1. Memory Efficient¶
You only need to store and update the small LoRA matrices. For a 7B parameter model with rank-16 LoRA, you might only train 0.1% of the parameters. That’s the difference between needing 8 GPUs and needing 1.
2. Fast Training¶
Fewer parameters = fewer gradients to compute = faster backward pass. Training can be 2-3x faster than full fine-tuning.
3. Modular¶
This is my favorite part. You can train multiple LoRA adapters for different tasks, all sharing the same base model. Want a model that can do customer support AND code generation? Train two LoRA adapters (a few MB each) instead of fine-tuning two full models (several GB each).
Then swap them in and out as needed. It’s like having a Swiss Army knife where you only store the tools, not duplicate copies of the handle.
4. Preserves Base Model Knowledge¶
Because the pre-trained weights are frozen, you don’t suffer from “catastrophic forgetting” — where fine-tuning on task A makes the model worse at tasks B, C, and D.
5. Easy Deployment¶
Merge the weights for production, and you’re back to a standard model. No special serving infrastructure needed.
The only downside? LoRA is an approximation. There are some updates that can’t be represented as a low-rank matrix. But empirically, for most fine-tuning tasks, this constraint doesn’t hurt — and the benefits massively outweigh it.
What’s Next?¶
We’ve now covered supervised fine-tuning (SFT) with LoRA — teaching a model to follow a specific style or format by training on input-output examples.
But what if you want to teach a model to be helpful or harmless or creative? That’s harder to capture in input-output pairs.
Enter: Reward Modeling.
Instead of training on explicit examples, we train a separate model to predict human preferences. Then we use that reward model to guide the fine-tuning process (via reinforcement learning).
It’s a bit more complex, but it’s how models like ChatGPT and Claude get their “personality.”