LoRA: The Clever Trick That Makes Fine-Tuning Feasible

(Well, almost.)

The Problem: Fine-Tuning is Absurdly Expensive¶

With modern language models, they’re huge. And when you want to fine-tune them the “traditional” way, you need to update every single parameter.

Let’s look at the numbers:

Model	Parameters	GPU Memory (FP32)
GPT-2	124M	~500 MB
GPT-2 Large	774M	~3 GB
LLaMA 7B	7B	~28 GB
LLaMA 70B	70B	~280 GB

That 70B model? You’d need a server rack just to load it into memory. And that’s before you start training, which requires storing gradients and optimizer states (typically 3-4x more memory).

So what do we do? Give up on fine-tuning large models?

Not quite.

LoRA: The Key Insight¶

LoRA stands for Low-Rank Adaptation. The paper came out in 2021 and basically revolutionized how we fine-tune large models.

Here’s the core idea: when you fine-tune a model, you don’t actually need to change all the weights by large amounts. The model already knows language — you’re just steering it slightly. So the matrix of weight updates should be “low-rank.”

Wait, what does “low-rank” mean?¶

Think about a spreadsheet with 1,000 rows and 1,000 columns. That’s a million cells of data. But what if all the values in that spreadsheet could be generated from just a few simple rules? Like “multiply row number by column number” or something like that?

In linear algebra terms, that spreadsheet would be “low-rank” — it looks like a million numbers, but there’s actually much less information there. You could recreate it from a much smaller description.

The LoRA Trick¶

Normally, when you update a weight matrix, you’d do this:

W_{\text{new}} = W_{\text{original}} + \Delta W

(1)

Where $\Delta W$ is the same size as $W$ (so, possibly millions of numbers).

LoRA says: “Let’s approximate $\Delta W$ as a product of two much smaller matrices”:

W_{\text{new}} = W_{\text{original}} + B \times A

(2)

where:

$W_{\text{original}}$ is the frozen (unchanging) pre-trained weights — shape: $d \times k$
$B$ is a tall skinny matrix — shape: $d \times r$
$A$ is a short wide matrix — shape: $r \times k$
$r$ is the rank — typically something tiny like 8 or 16

Instead of storing $d \times k$ new numbers (millions), we only store $(d \times r) + (r \times k)$ numbers (thousands).

An Analogy¶

Imagine you’re editing a photograph. You could change every single pixel individually (full fine-tuning). Or you could apply a simple filter — like “make it 10% warmer” — which is defined by just a few parameters but affects the whole image (LoRA).

Both change the image, but one requires way less information to describe.

import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
    """
    A Low-Rank Adaptation layer.
    
    This doesn't replace a linear layer — it sits *alongside* one,
    adding a small trainable update.
    """
    
    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 8,
        alpha: float = 16.0,
        dropout: float = 0.1
    ):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        
        # This scaling factor controls how much the LoRA update affects the output
        # alpha/rank is a common choice that keeps the update magnitude reasonable
        self.scaling = alpha / rank
        
        # The two low-rank matrices
        # A is initialized with random values (Kaiming initialization)
        # B is initialized to zeros (so initially, LoRA contributes nothing)
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Compute the LoRA contribution: (B @ A) @ x * scaling
        
        x has shape: (batch_size, sequence_length, in_features)
        We're computing: x @ A^T @ B^T * scaling
        """
        lora_out = self.dropout(x) @ self.lora_A.T @ self.lora_B.T
        return lora_out * self.scaling


# Let's see the parameter savings in action
in_features, out_features = 4096, 4096
rank = 8

# Full fine-tuning would update this many parameters:
full_params = in_features * out_features

# LoRA only needs this many:
lora_params = (in_features * rank) + (out_features * rank)

print(f"Full fine-tuning: {full_params:,} parameters")
print(f"LoRA (rank={rank}): {lora_params:,} parameters")
print(f"Reduction: {full_params / lora_params:.1f}x fewer parameters")
print()
print(f"That's {100 * lora_params / full_params:.2f}% of the original size.")
print(f"(Or to put it another way: we're using {100 * (1 - lora_params / full_params):.1f}% less parameters!)")

Full fine-tuning: 16,777,216 parameters
LoRA (rank=8): 65,536 parameters
Reduction: 256.0x fewer parameters

That's 0.39% of the original size.
(Or to put it another way: we're using 99.6% less parameters!)

Wrapping an Existing Layer¶

In practice, we don’t create standalone LoRA layers. We wrap existing layers from a pre-trained model.

The pattern is:

Take a frozen linear layer from the pre-trained model
Add a LoRA layer alongside it
During forward pass: add both outputs together

Let’s build that.

class LoRALinear(nn.Module):
    """
    A linear layer with LoRA adaptation.
    
    Think of this as a wrapper around a pre-trained linear layer.
    The original layer is frozen, and we add a small LoRA "correction" on top.
    """
    
    def __init__(
        self,
        original_layer: nn.Linear,
        rank: int = 8,
        alpha: float = 16.0,
        dropout: float = 0.1
    ):
        super().__init__()
        
        # Store the original layer and freeze it
        # (We're not training these weights — they stay as pre-trained)
        self.original = original_layer
        for param in self.original.parameters():
            param.requires_grad = False
        
        # Add the LoRA adapter
        self.lora = LoRALayer(
            in_features=original_layer.in_features,
            out_features=original_layer.out_features,
            rank=rank,
            alpha=alpha,
            dropout=dropout
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass: original output + LoRA output.
        
        The original layer produces the "base" output.
        The LoRA layer produces a small "correction."
        We add them together.
        """
        return self.original(x) + self.lora(x)


# Let's test it with a realistic size (GPT-2's hidden dimension)
original = nn.Linear(768, 768)
lora_layer = LoRALinear(original, rank=8)

# Count trainable vs. frozen parameters
trainable = sum(p.numel() for p in lora_layer.parameters() if p.requires_grad)
total = sum(p.numel() for p in lora_layer.parameters())

print(f"Trainable parameters: {trainable:,}")
print(f"Total parameters: {total:,}")
print(f"Trainable: {100 * trainable / total:.2f}%")
print()
print("So we're only updating ~2% of the layer's parameters!")
print("(The other 98% stay frozen at their pre-trained values.)")

Trainable parameters: 12,288
Total parameters: 602,880
Trainable: 2.04%

So we're only updating ~2% of the layer's parameters!
(The other 98% stay frozen at their pre-trained values.)

Using PEFT Library (The Real-World Approach)¶

Okay, so we’ve built LoRA from scratch to understand how it works. But in practice, you’d never actually do that.

Instead, you’d use HuggingFace’s PEFT library (Parameter-Efficient Fine-Tuning). It handles all the low-level details and integrates seamlessly with transformers.

Let’s see how easy it is to add LoRA to a real model.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load GPT-2 (the small version, 124M parameters)
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

total_params = sum(p.numel() for p in model.parameters())
print(f"Base model: {model_name}")
print(f"Total parameters: {total_params:,}")
print()
print("Now let's add LoRA to this...")

Base model: gpt2
Total parameters: 124,439,808

Now let's add LoRA to this...

from peft import LoraConfig, get_peft_model, TaskType

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # We're doing causal language modeling
    r=8,                            # Rank: the size of the low-rank matrices
    lora_alpha=16,                  # Scaling factor (typically 2*r)
    lora_dropout=0.1,               # Dropout applied to LoRA layers
    target_modules=["c_attn", "c_proj"],  # Which layers to adapt
    fan_in_fan_out=True,            # Required for GPT-2's Conv1D layers
    bias="none"                     # Don't add LoRA to bias terms
)

# Apply LoRA to the model
# This wraps the specified layers with LoRA adapters
peft_model = get_peft_model(model, lora_config)

# Print the breakdown
peft_model.print_trainable_parameters()

print()
print("Wait, what are 'c_attn' and 'c_proj'?")
print()
print("Those are GPT-2's attention layers:")
print("  - c_attn: The combined Query/Key/Value projection")
print("  - c_proj: The output projection after attention")
print()
print("We're adding LoRA *only* to these layers, not the entire model.")
print("(Empirically, adapting attention layers gives the best results.)")

trainable params: 811,008 || all params: 125,250,816 || trainable%: 0.6475

Wait, what are 'c_attn' and 'c_proj'?

Those are GPT-2's attention layers:
  - c_attn: The combined Query/Key/Value projection
  - c_proj: The output projection after attention

We're adding LoRA *only* to these layers, not the entire model.
(Empirically, adapting attention layers gives the best results.)

Choosing LoRA Hyperparameters¶

LoRA has a few knobs you can tune. Here’s what they mean and how to choose them:

Parameter	What it means	Typical Values	How to choose
`r` (rank)	Size of the low-rank matrices	4, 8, 16, 32	Start with 8. Increase if underfitting.
`alpha`	Scaling factor for LoRA output	16, 32	Usually 2×r. Higher = stronger adaptation.
`dropout`	Dropout on LoRA layers	0.05-0.1	Standard regularization.
`target_modules`	Which layers get LoRA	Attention layers	Q, K, V projections work best.

The rank/alpha relationship¶

The rank controls capacity: higher rank = more expressive updates, but also more parameters.

The alpha controls magnitude: it scales how much the LoRA output affects the final result. The ratio alpha/r is what actually matters — it’s like a learning rate specifically for the LoRA part.

Think of it this way:

Low rank (r=4): “I only need a simple adjustment”
High rank (r=64): “I need to make complex changes to the model”

Most of the time, r=8 or r=16 is plenty. Remember: the pre-trained model already knows a lot. We’re just steering it, not retraining from scratch.

Which layers to target?¶

For transformers, the attention layers matter most. Specifically:

Query, Key, Value projections (often combined as c_attn in GPT-2)
Output projection (c_proj)

You could add LoRA to the feedforward layers too, but empirically it doesn’t help much and just adds more parameters.

Merging LoRA Weights (For Inference)¶

Here’s a neat trick: after training, you can merge the LoRA weights back into the original model.

Remember, during training we compute:

\text{output} = W_{\text{original}} \cdot x + (B \times A) \cdot x

(3)

But mathematically, this is the same as:

\text{output} = (W_{\text{original}} + B \times A) \cdot x

(4)

So we can just add $B \times A$ to the original weights once, and then we’re back to a normal model — no extra computation at inference time!

This is huge. You get the training benefits of LoRA (low memory, fast updates) AND the inference benefits of a regular model (no overhead).

Let’s see how that works in code:

def merge_lora_weights(original_weight, lora_A, lora_B, scaling):
    """
    Merge LoRA weights into the original weight matrix.
    
    W_merged = W_original + (B @ A) * scaling
    
    After this, you can throw away the LoRA matrices and just use W_merged.
    """
    # Compute the low-rank update
    delta_W = (lora_B @ lora_A) * scaling
    
    # Add it to the original weights
    return original_weight + delta_W


# Example with realistic dimensions
d, k, r = 768, 768, 8
alpha = 16
scaling = alpha / r

# Simulate the weights
W_original = torch.randn(d, k)  # Original pre-trained weights
lora_A = torch.randn(r, k)      # LoRA matrix A (trained)
lora_B = torch.randn(d, r)      # LoRA matrix B (trained)

# Merge them
W_merged = merge_lora_weights(W_original, lora_A, lora_B, scaling)

print(f"Original W shape: {W_original.shape}")
print(f"LoRA A shape: {lora_A.shape}")
print(f"LoRA B shape: {lora_B.shape}")
print(f"Merged W shape: {W_merged.shape}")
print()
print("Before merging:")
print(f"  - Two separate matrix multiplies during inference")
print(f"  - Slightly slower, but LoRA weights can be swapped out")
print()
print("After merging:")
print(f"  - Single matrix multiply (same as original model)")
print(f"  - Zero inference overhead!")
print(f"  - Trade-off: can't easily swap LoRA adapters anymore")
print()
print("(In practice, you'd merge for production deployment.)")

Original W shape: torch.Size([768, 768])
LoRA A shape: torch.Size([8, 768])
LoRA B shape: torch.Size([768, 8])
Merged W shape: torch.Size([768, 768])

Before merging:
  - Two separate matrix multiplies during inference
  - Slightly slower, but LoRA weights can be swapped out

After merging:
  - Single matrix multiply (same as original model)
  - Zero inference overhead!
  - Trade-off: can't easily swap LoRA adapters anymore

(In practice, you'd merge for production deployment.)

Why LoRA is Brilliant¶

Let’s recap why this technique took over the world:

1. Memory Efficient¶

You only need to store and update the small LoRA matrices. For a 7B parameter model with rank-16 LoRA, you might only train 0.1% of the parameters. That’s the difference between needing 8 GPUs and needing 1.

2. Fast Training¶

Fewer parameters = fewer gradients to compute = faster backward pass. Training can be 2-3x faster than full fine-tuning.

3. Modular¶

This is my favorite part. You can train multiple LoRA adapters for different tasks, all sharing the same base model. Want a model that can do customer support AND code generation? Train two LoRA adapters (a few MB each) instead of fine-tuning two full models (several GB each).

Then swap them in and out as needed. It’s like having a Swiss Army knife where you only store the tools, not duplicate copies of the handle.

4. Preserves Base Model Knowledge¶

Because the pre-trained weights are frozen, you don’t suffer from “catastrophic forgetting” — where fine-tuning on task A makes the model worse at tasks B, C, and D.

5. Easy Deployment¶

Merge the weights for production, and you’re back to a standard model. No special serving infrastructure needed.

The only downside? LoRA is an approximation. There are some updates that can’t be represented as a low-rank matrix. But empirically, for most fine-tuning tasks, this constraint doesn’t hurt — and the benefits massively outweigh it.

What’s Next?¶

We’ve now covered supervised fine-tuning (SFT) with LoRA — teaching a model to follow a specific style or format by training on input-output examples.

But what if you want to teach a model to be helpful or harmless or creative? That’s harder to capture in input-output pairs.

Enter: Reward Modeling.

Instead of training on explicit examples, we train a separate model to predict human preferences. Then we use that reward model to guide the fine-tuning process (via reinforcement learning).

It’s a bit more complex, but it’s how models like ChatGPT and Claude get their “personality.”