Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

LoRA: The Clever Trick That Makes Fine-Tuning Feasible

(Well, almost.)

The Problem: Fine-Tuning is Absurdly Expensive

With modern language models, they’re huge. And when you want to fine-tune them the “traditional” way, you need to update every single parameter.

Let’s look at the numbers:

ModelParametersGPU Memory (FP32)
GPT-2124M~500 MB
GPT-2 Large774M~3 GB
LLaMA 7B7B~28 GB
LLaMA 70B70B~280 GB

That 70B model? You’d need a server rack just to load it into memory. And that’s before you start training, which requires storing gradients and optimizer states (typically 3-4x more memory).

So what do we do? Give up on fine-tuning large models?

Not quite.

LoRA: The Key Insight

LoRA stands for Low-Rank Adaptation. The paper came out in 2021 and revolutionized how we fine-tune large models.

Here’s the core idea: when you fine-tune a model, you don’t actually need to change all the weights by large amounts. The model already knows language. you’re just steering it slightly. So the matrix of weight updates should be “low-rank.”

Wait, what does “low-rank” mean?

Think about a spreadsheet with 1,000 rows and 1,000 columns. That’s a million cells of data. But what if all the values in that spreadsheet could be generated from just a few simple rules? Like “multiply row number by column number” or something like that?

In linear algebra terms, that spreadsheet would be “low-rank”. It looks like a million numbers, but there’s actually much less information there. You could recreate it from a much smaller description.

The LoRA Trick

Normally, when you update a weight matrix, you’d do this:

Wnew=Woriginal+ΔWW_{\text{new}} = W_{\text{original}} + \Delta W

Where ΔW\Delta W is the same size as WW (so, possibly millions of numbers).

LoRA says: “Let’s approximate ΔW\Delta W as a product of two much smaller matrices”:

Wnew=Woriginal+B×AW_{\text{new}} = W_{\text{original}} + B \times A

where:

  • WoriginalW_{\text{original}} is the frozen (unchanging) pre-trained weights. shape: d×kd \times k

  • BB is a tall skinny matrix. shape: d×rd \times r

  • AA is a short wide matrix. shape: r×kr \times k

  • rr is the rank: typically something tiny like 8 or 16

Instead of storing d×kd \times k new numbers (millions), we only store (d×r)+(r×k)(d \times r) + (r \times k) numbers (thousands).

An Analogy

Imagine you’re editing a photograph. You could change every single pixel individually (full fine-tuning). Or you could apply a simple filter. like “make it 10% warmer”, which is defined by just a few parameters but affects the whole image (LoRA).

Both change the image, but one requires way less information to describe.

import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
    """
    A Low-Rank Adaptation layer.
    
    This doesn't replace a linear layer. It sits *alongside* one,
    adding a small trainable update.
    """
    
    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 8,
        alpha: float = 16.0,
        dropout: float = 0.1
    ):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        
        # This scaling factor controls how much the LoRA update affects the output
        # alpha/rank is a common choice that keeps the update magnitude reasonable
        self.scaling = alpha / rank
        
        # The two low-rank matrices
        # A is initialized with random values (Kaiming initialization)
        # B is initialized to zeros (so initially, LoRA contributes nothing)
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Compute the LoRA contribution: (B @ A) @ x * scaling
        
        x has shape: (batch_size, sequence_length, in_features)
        We're computing: x @ A^T @ B^T * scaling
        """
        lora_out = self.dropout(x) @ self.lora_A.T @ self.lora_B.T
        return lora_out * self.scaling


# Let's see the parameter savings in action
in_features, out_features = 4096, 4096
rank = 8

# Full fine-tuning would update this many parameters:
full_params = in_features * out_features

# LoRA only needs this many:
lora_params = (in_features * rank) + (out_features * rank)

print(f"Full fine-tuning: {full_params:,} parameters")
print(f"LoRA (rank={rank}): {lora_params:,} parameters")
print(f"Reduction: {full_params / lora_params:.1f}x fewer parameters")
print()
print(f"That's {100 * lora_params / full_params:.2f}% of the original size.")
print(f"(Or to put it another way: we're using {100 * (1 - lora_params / full_params):.1f}% less parameters!)")
Full fine-tuning: 16,777,216 parameters
LoRA (rank=8): 65,536 parameters
Reduction: 256.0x fewer parameters

That's 0.39% of the original size.
(Or to put it another way: we're using 99.6% less parameters!)

Wrapping an Existing Layer

In practice, we don’t create standalone LoRA layers. We wrap existing layers from a pre-trained model.

The pattern is:

  1. Take a frozen linear layer from the pre-trained model

  2. Add a LoRA layer alongside it

  3. During forward pass: add both outputs together

Let’s build that.

class LoRALinear(nn.Module):
    """
    A linear layer with LoRA adaptation.
    
    Think of this as a wrapper around a pre-trained linear layer.
    The original layer is frozen, and we add a small LoRA "correction" on top.
    """
    
    def __init__(
        self,
        original_layer: nn.Linear,
        rank: int = 8,
        alpha: float = 16.0,
        dropout: float = 0.1
    ):
        super().__init__()
        
        # Store the original layer and freeze it
        # (We're not training these weights: they stay as pre-trained)
        self.original = original_layer
        for param in self.original.parameters():
            param.requires_grad = False
        
        # Add the LoRA adapter
        self.lora = LoRALayer(
            in_features=original_layer.in_features,
            out_features=original_layer.out_features,
            rank=rank,
            alpha=alpha,
            dropout=dropout
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass: original output + LoRA output.
        
        The original layer produces the "base" output.
        The LoRA layer produces a small "correction."
        We add them together.
        """
        return self.original(x) + self.lora(x)


# Let's test it with a realistic size (GPT-2's hidden dimension)
original = nn.Linear(768, 768)
lora_layer = LoRALinear(original, rank=8)

# Count trainable vs. frozen parameters
trainable = sum(p.numel() for p in lora_layer.parameters() if p.requires_grad)
total = sum(p.numel() for p in lora_layer.parameters())

print(f"Trainable parameters: {trainable:,}")
print(f"Total parameters: {total:,}")
print(f"Trainable: {100 * trainable / total:.2f}%")
print()
print("So we're only updating ~2% of the layer's parameters!")
print("(The other 98% stay frozen at their pre-trained values.)")
Trainable parameters: 12,288
Total parameters: 602,880
Trainable: 2.04%

So we're only updating ~2% of the layer's parameters!
(The other 98% stay frozen at their pre-trained values.)

Using PEFT Library (The Real-World Approach)

Okay, so we’ve built LoRA from scratch to understand how it works. But in practice, you’d never actually do that.

Instead, you’d use HuggingFace’s PEFT library (Parameter-Efficient Fine-Tuning). It handles all the low-level details and integrates seamlessly with transformers.

Let’s see how easy it is to add LoRA to a real model.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load GPT-2 (the small version, 124M parameters)
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

total_params = sum(p.numel() for p in model.parameters())
print(f"Base model: {model_name}")
print(f"Total parameters: {total_params:,}")
print()
print("Now let's add LoRA to this...")
Base model: gpt2
Total parameters: 124,439,808

Now let's add LoRA to this...
from peft import LoraConfig, get_peft_model, TaskType

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # We're doing causal language modeling
    r=8,                            # Rank: the size of the low-rank matrices
    lora_alpha=16,                  # Scaling factor (typically 2*r)
    lora_dropout=0.1,               # Dropout applied to LoRA layers
    target_modules=["c_attn", "c_proj"],  # Which layers to adapt
    fan_in_fan_out=True,            # Required for GPT-2's Conv1D layers
    bias="none"                     # Don't add LoRA to bias terms
)

# Apply LoRA to the model
# This wraps the specified layers with LoRA adapters
peft_model = get_peft_model(model, lora_config)

# Print the breakdown
peft_model.print_trainable_parameters()

print()
print("Wait, what are 'c_attn' and 'c_proj'?")
print()
print("Those are GPT-2's attention layers:")
print("  - c_attn: The combined Query/Key/Value projection")
print("  - c_proj: The output projection after attention")
print()
print("We're adding LoRA *only* to these layers, not the entire model.")
print("(Empirically, adapting attention layers gives the best results.)")
trainable params: 811,008 || all params: 125,250,816 || trainable%: 0.6475

Wait, what are 'c_attn' and 'c_proj'?

Those are GPT-2's attention layers:
  - c_attn: The combined Query/Key/Value projection
  - c_proj: The output projection after attention

We're adding LoRA *only* to these layers, not the entire model.
(Empirically, adapting attention layers gives the best results.)

Choosing LoRA Hyperparameters

LoRA has a few knobs you can tune. Here’s what they mean and how to choose them:

ParameterWhat it meansTypical ValuesHow to choose
r (rank)Size of the low-rank matrices4, 8, 16, 32Start with 8. Increase if underfitting.
alphaScaling factor for LoRA output16, 32Usually 2×r. Higher = stronger adaptation.
dropoutDropout on LoRA layers0.05-0.1Standard regularization.
target_modulesWhich layers get LoRAAttention layersQ, K, V projections work best.

The rank/alpha relationship

The rank controls capacity: higher rank = more expressive updates, but also more parameters.

The alpha controls magnitude: it scales how much the LoRA output affects the final result. The ratio alpha/r is what actually matters. it’s like a learning rate specifically for the LoRA part.

Think of it this way:

  • Low rank (r=4): “I only need a simple adjustment”

  • High rank (r=64): “I need to make complex changes to the model”

Most of the time, r=8 or r=16 is plenty. Remember: the pre-trained model already knows a lot. We’re just steering it, not retraining from scratch.

Which layers to target?

For transformers, the attention layers matter most. Specifically:

  • Query, Key, Value projections (often combined as c_attn in GPT-2)

  • Output projection (c_proj)

You could add LoRA to the feedforward layers too, but empirically it doesn’t help much and just adds more parameters.

Merging LoRA Weights (For Inference)

Here’s a neat trick: after training, you can merge the LoRA weights back into the original model.

Remember, during training we compute:

output=Woriginalx+(B×A)x\text{output} = W_{\text{original}} \cdot x + (B \times A) \cdot x

But mathematically, this is the same as:

output=(Woriginal+B×A)x\text{output} = (W_{\text{original}} + B \times A) \cdot x

So we can just add B×AB \times A to the original weights once, and then we’re back to a normal model. no extra computation at inference time!

This is huge. You get the training benefits of LoRA (low memory, fast updates) AND the inference benefits of a regular model (no overhead).

Let’s see how that works in code:

def merge_lora_weights(original_weight, lora_A, lora_B, scaling):
    """
    Merge LoRA weights into the original weight matrix.
    
    W_merged = W_original + (B @ A) * scaling
    
    After this, you can throw away the LoRA matrices and just use W_merged.
    """
    # Compute the low-rank update
    delta_W = (lora_B @ lora_A) * scaling
    
    # Add it to the original weights
    return original_weight + delta_W


# Example with realistic dimensions
d, k, r = 768, 768, 8
alpha = 16
scaling = alpha / r

# Simulate the weights
W_original = torch.randn(d, k)  # Original pre-trained weights
lora_A = torch.randn(r, k)      # LoRA matrix A (trained)
lora_B = torch.randn(d, r)      # LoRA matrix B (trained)

# Merge them
W_merged = merge_lora_weights(W_original, lora_A, lora_B, scaling)

print(f"Original W shape: {W_original.shape}")
print(f"LoRA A shape: {lora_A.shape}")
print(f"LoRA B shape: {lora_B.shape}")
print(f"Merged W shape: {W_merged.shape}")
print()
print("Before merging:")
print(f"  - Two separate matrix multiplies during inference")
print(f"  - Slightly slower, but LoRA weights can be swapped out")
print()
print("After merging:")
print(f"  - Single matrix multiply (same as original model)")
print(f"  - Zero inference overhead!")
print(f"  - Trade-off: can't easily swap LoRA adapters anymore")
print()
print("(In practice, you'd merge for production deployment.)")
Original W shape: torch.Size([768, 768])
LoRA A shape: torch.Size([8, 768])
LoRA B shape: torch.Size([768, 8])
Merged W shape: torch.Size([768, 768])

Before merging:
  - Two separate matrix multiplies during inference
  - Slightly slower, but LoRA weights can be swapped out

After merging:
  - Single matrix multiply (same as original model)
  - Zero inference overhead!
  - Trade-off: can't easily swap LoRA adapters anymore

(In practice, you'd merge for production deployment.)

Why LoRA Works So Well

Let’s recap why this technique has become the standard:

1. Memory Efficient

You only need to store and update the small LoRA matrices. For a 7B parameter model with rank-16 LoRA, you might only train 0.1% of the parameters. That’s the difference between needing 8 GPUs and needing 1.

2. Fast Training

Fewer parameters = fewer gradients to compute = faster backward pass. Training can be 2-3x faster than full fine-tuning.

3. Modular

You can train multiple LoRA adapters for different tasks, all sharing the same base model. Want a model that can do customer support AND code generation? Train two LoRA adapters (a few MB each) instead of fine-tuning two full models (several GB each).

Then swap them in and out as needed. It’s like having a Swiss Army knife where you only store the tools, not duplicate copies of the handle.

4. Preserves Base Model Knowledge

Because the pre-trained weights are frozen, you don’t suffer from “catastrophic forgetting”. where fine-tuning on task A makes the model worse at tasks B, C, and D.

5. Easy Deployment

Merge the weights for production, and you’re back to a standard model. No special serving infrastructure needed.


The only downside? LoRA is an approximation. There are some updates that can’t be represented as a low-rank matrix. But empirically, for most fine-tuning tasks, this constraint doesn’t hurt. And the benefits massively outweigh it.

What’s Next?

We’ve now covered supervised fine-tuning (SFT) with LoRA. teaching a model to follow a specific style or format by training on input-output examples.

But what if you want to teach a model to be helpful or harmless or creative? That’s harder to capture in input-output pairs.

Enter: Reward Modeling.

Instead of training on explicit examples, we train a separate model to predict human preferences. Then we use that reward model to guide the fine-tuning process (via reinforcement learning).

It’s a bit more complex, but it’s how models like ChatGPT and Claude get their “personality.”