The Memory Game - An Introduction to Transformers

When training large language models, they’re memory hogs.

A 7B parameter model in full precision? That’s 28 GB just for the weights. Add in the optimizer state (another 56 GB!), gradients (28 GB), and activations (20+ GB), and you’re looking at 130+ GB of memory.

Your RTX 4090 has 24 GB.

See the problem?

The good news: with the right tricks, you can train that 7B model on consumer hardware. This notebook is your guide to the memory optimization techniques that make it possible.

Understanding the Memory Problem¶

First, let’s break down where all that memory goes during training.

Think of GPU memory like a packed suitcase. You’ve got limited space, and you need to fit everything in. Here’s what’s taking up room:

The Memory Breakdown¶

Total GPU Memory During Training:
├── Model Weights        (~25-30%)  ← The model parameters themselves
├── Optimizer State      (~50-60%)  ← Momentum & variance (biggest offender!)
├── Gradients            (~25-30%)  ← One gradient per parameter
├── Activations          (~10-20%)  ← Saved outputs for backprop
└── Framework Overhead   (~5%)      ← PyTorch bookkeeping

Notice something? The optimizer state is typically the largest consumer of memory. Not the model itself!

Why? Because modern optimizers like AdamW keep track of two extra values per parameter: a momentum term and a variance term. That’s 2x the model size right there.

This is why a 7B parameter model needs way more than 28 GB of memory for training. It’s not just storing the weights—it’s storing all the machinery needed to update those weights effectively.

Let’s Do the Math¶

Numbers make this concrete. Let’s calculate the exact memory requirements for two models.

GPT-2 (124M parameters) - Full Fine-Tuning¶

Model weights (fp32):     124M params × 4 bytes = 496 MB
Optimizer (AdamW):        124M params × 8 bytes = 992 MB  ← 2x model size!
Gradients (fp32):         124M params × 4 bytes = 496 MB
Activations (batch=8):                            ~500 MB
Framework overhead:                               ~100 MB
─────────────────────────────────────────────────────────
Total:                                           ~2.6 GB

Not too bad! This fits comfortably on most GPUs.

Llama 7B - Full Fine-Tuning¶

Model weights (fp32):     7B params × 4 bytes = 28 GB
Optimizer (AdamW):        7B params × 8 bytes = 56 GB  ← Oof.
Gradients (fp32):         7B params × 4 bytes = 28 GB
Activations (batch=8):                          ~20 GB
Framework overhead:                             ~2 GB
─────────────────────────────────────────────────────
Total:                                         ~134 GB

Whoops! That’s not fitting on consumer hardware.

Breaking Down the Calculations¶

Why 4 bytes for fp32? Each 32-bit floating point number takes 32 bits = 4 bytes of memory.

Why 8 bytes for AdamW? AdamW stores two additional values per parameter (first moment and second moment), each in fp32. So that’s 4 bytes + 4 bytes = 8 bytes per parameter just for optimizer state.

What are activations? During the forward pass, we save the output of each layer. We need these saved values during backpropagation to compute gradients. More layers, longer sequences, and bigger batch sizes = more activations to store.

The good news? We can dramatically reduce these numbers with the right techniques.

Technique 1: Mixed Precision Training¶

This is the most impactful technique you can apply. It cuts memory usage roughly in half with just a few lines of code.

What is Mixed Precision?¶

Instead of using 32-bit floating point numbers (fp32) for everything, we use 16-bit numbers (fp16 or bf16) for most operations.

Think of it like this: you’re doing carpentry. Sometimes you need a micrometer for precise measurements. But most of the time? A ruler is fine. Mixed precision training uses the “ruler” (16-bit) for most work and pulls out the “micrometer” (32-bit) only when needed.

The Three Formats¶

Format	Bits	Range	Precision	Memory per Value
fp32	32	±3.4 × 10³⁸	~7 decimal digits	4 bytes
fp16	16	±65,504	~3 decimal digits	2 bytes
bf16	16	±3.4 × 10³⁸	~2 decimal digits	2 bytes

fp16 (Float16): Traditional half precision. Small range, can overflow/underflow easily.

bf16 (BrainFloat16): Google’s format. Same range as fp32 but less precision. This is the sweet spot for modern GPUs (Ampere, Ada, Hopper). No overflow issues, no loss scaling needed.

Which should you use? If your GPU supports it (RTX 30-series or newer, A100, H100), use bf16. It’s simpler and more robust. Otherwise, fp16 works but requires loss scaling to prevent underflow.

Memory Savings¶

7B model in fp32:     7B × 4 bytes = 28 GB
7B model in bf16:     7B × 2 bytes = 14 GB
                                     ↓
                           50% reduction!

And training quality is essentially identical. You’re getting half the memory usage for free.

import torch
from torch.amp import autocast, GradScaler

# This is what mixed precision training looks like in code.
# It's surprisingly simple!

# For fp16, we need a GradScaler to prevent underflow
scaler = GradScaler('cuda')  # (not needed for bf16)

def train_step_mixed_precision(model, batch, optimizer):
    """A training step using mixed precision.
    
    The key is the `autocast` context manager - it automatically
    casts operations to the specified dtype when beneficial.
    """
    optimizer.zero_grad()
    
    # Forward pass in lower precision (bf16 or fp16)
    # PyTorch automatically figures out which ops should be bf16
    # and which should stay fp32 (like loss calculation)
    with autocast('cuda', dtype=torch.bfloat16):  # or torch.float16
        outputs = model(batch["input_ids"])
        loss = outputs.loss
    
    # Backward pass
    # For fp16: scale the loss to prevent gradient underflow
    # For bf16: this is unnecessary but doesn't hurt
    scaler.scale(loss).backward()
    
    # Optimizer step with unscaling
    scaler.step(optimizer)
    scaler.update()
    
    return loss.item()

print("Mixed Precision Memory Savings:")
print("  FP32 → BF16: ~50% reduction in memory")
print("  FP32 → FP16: ~50% reduction in memory")
print()
print("Speed bonus: Training is often 20-30% faster too!")
print("(Modern GPUs have dedicated hardware for fp16/bf16 operations)")
print()
print("Quality impact: Negligible for most tasks")
print("(We've trained hundreds of models this way - it just works)")

Mixed Precision Memory Savings:
  FP32 → BF16: ~50% reduction in memory
  FP32 → FP16: ~50% reduction in memory

Speed bonus: Training is often 20-30% faster too!
(Modern GPUs have dedicated hardware for fp16/bf16 operations)

Quality impact: Negligible for most tasks
(We've trained hundreds of models this way - it just works)

Technique 2: LoRA (Low-Rank Adaptation)¶

If mixed precision is a memory reducer, LoRA is a memory destroyer. In the best way.

Remember how I said the optimizer state is the biggest memory hog? LoRA solves this by freezing the base model and training only tiny adapter layers.

The Core Idea¶

Instead of updating all 7 billion parameters, we freeze them and add small “adapter” matrices that we train instead.

Think of it like this: you’ve got a massive reference book (the base model). Instead of rewriting the whole book, you add sticky notes (LoRA adapters) with corrections and additions. The book stays the same; only the notes change.

The Math¶

For each weight matrix W, LoRA adds two small matrices A and B:

Original:    W (full rank, millions of parameters)
LoRA adds:   W + (A × B)
             ↑    ↑   ↑
         frozen  rank r matrices
                 (tiny!)

If W is 4096×4096, that’s 16.7M parameters. But A and B with rank r=16? That’s only 4096×16 + 16×4096 = 131K parameters!

That’s 128x fewer parameters to train.

Memory Impact¶

Full Fine-Tuning (Llama 7B):
  Trainable params:   7,000,000,000
  Optimizer state:    56 GB  (8 bytes per param)
  Gradients:          28 GB  (4 bytes per param)

LoRA with r=16 (Llama 7B):
  Trainable params:   16,777,216  (0.24% of model!)
  Optimizer state:    134 MB      (418x reduction!)
  Gradients:          67 MB       (418x reduction!)

The base model stays frozen, so we only need optimizer state and gradients for those tiny adapter matrices.

It’s kind of absurd how well this works.

# Let's see the actual numbers for LoRA memory savings.
# This is why LoRA has become the standard approach for fine-tuning large models.

print("LoRA Memory Comparison (Llama 7B in BF16)")
print("=" * 60)
print()

print("Full Fine-Tuning:")
print("  Base model:         14 GB (all parameters trainable)")
print("  Optimizer state:    56 GB (momentum + variance for all params)")
print("  Gradients:          14 GB (one gradient per parameter)")
print("  ─────────────────────────")
print("  Total:              84 GB (before activations!)")
print()

print("LoRA (r=16):")
print("  Base model:         14 GB (frozen - can even be quantized!)")
print("  LoRA adapters:      67 MB (only these are trainable)")
print("  Optimizer state:    268 MB (only for LoRA adapters)")
print("  Gradients:          67 MB (only for LoRA adapters)")
print("  ─────────────────────────")
print("  Total:              14.4 GB (before activations)")
print()
print("  Memory reduction:   5.8x smaller!")
print("  And we can combine this with quantization...")
print()

print("Choosing the Rank (r)")
print("=" * 60)
print()
print("The rank controls capacity vs. memory trade-off:")
print()
print("  r=4:    ~33 MB trainable")
print("          Minimal memory, but may underfit complex tasks")
print()
print("  r=8:    ~67 MB trainable")
print("          Good for simple fine-tuning tasks")
print()
print("  r=16:   ~134 MB trainable")
print("          The sweet spot - recommended default")
print()
print("  r=32:   ~268 MB trainable")
print("          High capacity for complex tasks")
print()
print("  r=64:   ~536 MB trainable")
print("          When you need more expressiveness")
print()
print("Rule of thumb: Start with r=16. Increase if underfitting.")
print("(Most tasks work great with r=16, honestly)")

LoRA Memory Comparison (Llama 7B in BF16)
============================================================

Full Fine-Tuning:
  Base model:         14 GB (all parameters trainable)
  Optimizer state:    56 GB (momentum + variance for all params)
  Gradients:          14 GB (one gradient per parameter)
  ─────────────────────────
  Total:              84 GB (before activations!)

LoRA (r=16):
  Base model:         14 GB (frozen - can even be quantized!)
  LoRA adapters:      67 MB (only these are trainable)
  Optimizer state:    268 MB (only for LoRA adapters)
  Gradients:          67 MB (only for LoRA adapters)
  ─────────────────────────
  Total:              14.4 GB (before activations)

  Memory reduction:   5.8x smaller!
  And we can combine this with quantization...

Choosing the Rank (r)
============================================================

The rank controls capacity vs. memory trade-off:

  r=4:    ~33 MB trainable
          Minimal memory, but may underfit complex tasks

  r=8:    ~67 MB trainable
          Good for simple fine-tuning tasks

  r=16:   ~134 MB trainable
          The sweet spot - recommended default

  r=32:   ~268 MB trainable
          High capacity for complex tasks

  r=64:   ~536 MB trainable
          When you need more expressiveness

Rule of thumb: Start with r=16. Increase if underfitting.
(Most tasks work great with r=16, honestly)

Technique 3: Gradient Accumulation¶

Okay, this one’s clever.

You know how larger batch sizes generally lead to more stable training? But bigger batches need more memory (you’re processing more samples at once).

Gradient accumulation lets you have your cake and eat it too.

The Trick¶

Instead of computing gradients and updating weights after every batch, we accumulate gradients across multiple small batches, then update once.

Effective batch size = per_device_batch_size × gradient_accumulation_steps

Memory usage = per_device_batch_size (not effective batch size!)

So if you can only fit batch_size=2 in memory, but you want the training stability of batch_size=32, you can do:

per_device_batch_size = 2
gradient_accumulation_steps = 16
# Effective batch size = 2 × 16 = 32

You get the benefits of a large batch size with the memory footprint of a small one.

The Trade-off¶

Pro: Train with larger effective batch sizes without OOM errors.

Con: Slightly slower (you’re doing more forward passes before each update).

But slower beats “doesn’t fit in memory” every time.

# Here's what gradient accumulation looks like in practice.
# The key insight: gradients ADD together, so we can accumulate them
# over multiple batches before applying an update.

def train_with_gradient_accumulation(model, dataloader, optimizer, accumulation_steps=4):
    """Train with gradient accumulation.
    
    We process `accumulation_steps` batches, accumulating gradients,
    then update the model once.
    """
    optimizer.zero_grad()
    
    for i, batch in enumerate(dataloader):
        # Forward pass
        outputs = model(batch["input_ids"])
        loss = outputs.loss
        
        # Important: Scale the loss by accumulation steps
        # (so the effective learning rate stays consistent)
        loss = loss / accumulation_steps
        
        # Backward pass - this ADDS to existing gradients
        loss.backward()
        
        # Only update weights every `accumulation_steps` batches
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()  # Clear for next accumulation

# Let's see the impact
print("Gradient Accumulation Example:")
print("=" * 60)
print()
print("Scenario: You can fit batch_size=4 in memory")
print("          But you want effective batch_size=32")
print()
print("Solution:")
print("  per_device_batch_size = 4")
print("  gradient_accumulation_steps = 8")
print()
print("Result:")
print("  Effective batch size: 4 × 8 = 32 ✓")
print("  Memory usage: Only 4 samples at a time ✓")
print("  Training stability: Same as batch_size=32 ✓")
print()
print("This is why you'll see accumulation_steps in almost every")
print("fine-tuning config - it's free effective batch size!")

Gradient Accumulation Example:
============================================================

Scenario: You can fit batch_size=4 in memory
          But you want effective batch_size=32

Solution:
  per_device_batch_size = 4
  gradient_accumulation_steps = 8

Result:
  Effective batch size: 4 × 8 = 32 ✓
  Memory usage: Only 4 samples at a time ✓
  Training stability: Same as batch_size=32 ✓

This is why you'll see accumulation_steps in almost every
fine-tuning config - it's free effective batch size!

Technique 4: Gradient Checkpointing¶

This technique trades computation for memory. It’s brilliant in a slightly painful way.

The Problem¶

During the forward pass, we need to save the output of every layer. Why? Because during backpropagation, we need those saved values to compute gradients.

For a 32-layer transformer processing a batch of 8 sequences, that’s a LOT of saved activations. They can easily use 20+ GB of memory.

The Solution¶

What if we... didn’t save them?

Gradient checkpointing only saves activations at certain “checkpoint” layers (say, every 4th layer). During backpropagation, when we need an activation we didn’t save, we recompute it on the fly.

Think of it like taking notes during a lecture. You could transcribe everything (high memory), or you could write down key points and fill in the details later (low memory, more work).

Without Gradient Checkpointing:
  Forward pass:  Compute activations → Save all of them
  Backward pass: Use saved activations → Fast
  Memory: HIGH

With Gradient Checkpointing:
  Forward pass:  Compute activations → Save only checkpoints
  Backward pass: Recompute missing activations → Slower
  Memory: LOW

The Trade-off¶

You’re recomputing activations during backprop, so training is slower (typically 20-30% slower).

But the memory savings are huge: 50-80% reduction in activation memory.

When you’re memory-constrained, this is a lifesaver.

from transformers import AutoModelForCausalLM

# Gradient checkpointing is ridiculously easy to enable.
# One method call. That's it.

model = AutoModelForCausalLM.from_pretrained("gpt2")
model.gradient_checkpointing_enable()

print("Gradient Checkpointing Enabled!")
print("=" * 60)
print()
print("Memory Impact (Llama 7B, batch=8, seq_len=2048):")
print()
print("  Without checkpointing:")
print("    Activations: ~20 GB")
print("    (Every layer output saved)")
print()
print("  With checkpointing:")
print("    Activations: ~5 GB")
print("    (Only checkpoint layers saved, rest recomputed)")
print()
print("  Memory reduction: 75%!")
print()
print("Speed Trade-off:")
print("  Training is ~20-30% slower")
print("  (We're recomputing activations during backward pass)")
print()
print("When to use it:")
print("  - You're hitting OOM errors")
print("  - You're training with long sequences")
print("  - You want to increase batch size")
print()
print("When to skip it:")
print("  - You have plenty of memory")
print("  - Speed is critical")
print("  - You're using very short sequences")
print()
print("If you're fine-tuning on consumer hardware, you're probably")
print("using this. The memory savings are just too good to pass up.")

Gradient Checkpointing Enabled!
============================================================

Memory Impact (Llama 7B, batch=8, seq_len=2048):

  Without checkpointing:
    Activations: ~20 GB
    (Every layer output saved)

  With checkpointing:
    Activations: ~5 GB
    (Only checkpoint layers saved, rest recomputed)

  Memory reduction: 75%!

Speed Trade-off:
  Training is ~20-30% slower
  (We're recomputing activations during backward pass)

When to use it:
  - You're hitting OOM errors
  - You're training with long sequences
  - You want to increase batch size

When to skip it:
  - You have plenty of memory
  - Speed is critical
  - You're using very short sequences

If you're fine-tuning on consumer hardware, you're probably
using this. The memory savings are just too good to pass up.

Technique 5: Model Quantization¶

Now we’re getting into the heavy artillery.

Quantization is like... aggressive compression for neural networks. Instead of storing weights in 16-bit or 32-bit precision, we use 8-bit or even 4-bit integers.

Sounds crazy, right? How can you possibly represent a neural network weight that might be 0.0123456789 using just 4 bits (16 possible values)?

The Magic¶

Modern quantization techniques (like NormalFloat4, or “nf4”) are calibrated to the distribution of neural network weights. Most weights cluster around zero, with a long tail. NF4 packs more precision where it matters and less where it doesn’t.

It’s like how JPEG compression works: throw away information humans won’t notice. Here, we throw away precision the model doesn’t need.

The Numbers¶

Precision	Memory (7B model)	Quality	Notes
FP32	28 GB	100%	Original precision
BF16	14 GB	~99.9%	Standard for training
INT8	7 GB	~99%	Good for inference
NF4	3.5 GB	95-98%	The QLoRA breakthrough

Yes, you read that right. A 7B model in 4-bit takes 3.5 GB. That fits on a laptop GPU.

The Catch¶

Quantized models can only be used for inference or as frozen base models for fine-tuning (like with LoRA).

You can’t train a quantized model directly. The low precision causes training to diverge.

But combined with LoRA (which adds trainable adapters on top), you get QLoRA: 4-bit base model + 16-bit LoRA adapters = magic.

# Let's load a quantized model.
# This is what makes fine-tuning large models accessible to everyone.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Configure 4-bit quantization (the QLoRA approach)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit quantization
    
    # Use NormalFloat4 (nf4) - specifically designed for neural network weights
    # This gives better quality than standard 4-bit quantization
    bnb_4bit_quant_type="nf4",
    
    # Double quantization: quantize the quantization constants too
    # (yes, really - saves even more memory with minimal quality loss)
    bnb_4bit_use_double_quant=True,
    
    # When we actually compute with these weights (forward pass),
    # convert them to bf16 for the calculation
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load the model in 4-bit
# Note: This requires the `bitsandbytes` library
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B",
    quantization_config=quantization_config,
    device_map="auto",  # Automatically distribute across available devices
)

print("QLoRA Setup: 4-bit Model + LoRA Adapters")
print("=" * 60)
print()
print("This is the state-of-the-art approach for fine-tuning")
print("large models on consumer hardware.")
print()
print("Memory breakdown (7B model):")
print("  Base model (4-bit):     3.5 GB ← Quantized!")
print("  LoRA adapters (bf16):   67 MB  ← Trainable")
print("  Optimizer state:        268 MB ← Only for LoRA")
print("  Gradients:              67 MB  ← Only for LoRA")
print("  Activations (bs=8):     ~5 GB  ← With gradient checkpointing")
print("  ───────────────────────────────")
print("  Total:                  ~9 GB")
print()
print("That fits on an RTX 3080 (10 GB)!")
print("Or even an RTX 3060 (12 GB) with room to spare.")
print()
print("The quality hit? Surprisingly small. QLoRA models often")
print("perform within 1-2% of full precision fine-tuning.")
print()
print("This technique democratized LLM fine-tuning. Before QLoRA,")
print("you needed expensive multi-GPU setups. Now? Your gaming PC")
print("can fine-tune a 7B model. Pretty wild.")

QLoRA Setup: 4-bit Model + LoRA Adapters
============================================================

This is the state-of-the-art approach for fine-tuning
large models on consumer hardware.

Memory breakdown (7B model):
  Base model (4-bit):     3.5 GB ← Quantized!
  LoRA adapters (bf16):   67 MB  ← Trainable
  Optimizer state:        268 MB ← Only for LoRA
  Gradients:              67 MB  ← Only for LoRA
  Activations (bs=8):     ~5 GB  ← With gradient checkpointing
  ───────────────────────────────
  Total:                  ~9 GB

That fits on an RTX 3080 (10 GB)!
Or even an RTX 3060 (12 GB) with room to spare.

The quality hit? Surprisingly small. QLoRA models often
perform within 1-2% of full precision fine-tuning.

This technique democratized LLM fine-tuning. Before QLoRA,
you needed expensive multi-GPU setups. Now? Your gaming PC
can fine-tune a 7B model. Pretty wild.

Profiling Memory Usage¶

Before you can optimize memory, you need to measure it.

Here’s a simple profiling function that shows exactly where your memory is going. This is invaluable when debugging OOM errors or trying to squeeze more performance out of your GPU.

import torch
import gc

def profile_memory(fn, label="Operation"):
    """Profile GPU memory usage of a function.
    
    This shows you:
    - Starting memory (what was already allocated)
    - Ending memory (after the operation)
    - Delta (how much the operation added)
    - Peak (the highest memory point during execution)
    
    The delta vs peak difference tells you about temporary allocations.
    """
    if not torch.cuda.is_available():
        print("No CUDA available - can't profile GPU memory")
        print("(This is fine if you're running on CPU)")
        return
    
    # Clean up before measuring
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.empty_cache()
    gc.collect()
    
    start_mem = torch.cuda.memory_allocated()
    
    # Run the function
    result = fn()
    
    end_mem = torch.cuda.memory_allocated()
    peak_mem = torch.cuda.max_memory_allocated()
    
    print(f"\n{label}")
    print(f"  Start:  {start_mem / 1e9:.2f} GB")
    print(f"  End:    {end_mem / 1e9:.2f} GB")
    print(f"  Delta:  {(end_mem - start_mem) / 1e9:.2f} GB")
    print(f"  Peak:   {peak_mem / 1e9:.2f} GB")
    
    if peak_mem > end_mem:
        print(f"  (Peak-End: {(peak_mem - end_mem) / 1e9:.2f} GB temporary allocation)")
    
    return result

# Check current memory usage
print("Current Memory Status:")
print("=" * 60)

if torch.cuda.is_available():
    current = torch.cuda.memory_allocated()
    peak = torch.cuda.max_memory_allocated()
    total = torch.cuda.get_device_properties(0).total_memory
    
    print(f"Current allocation: {current / 1e9:.2f} GB")
    print(f"Peak allocation:    {peak / 1e9:.2f} GB")
    print(f"Total GPU memory:   {total / 1e9:.2f} GB")
    print(f"Available:          {(total - current) / 1e9:.2f} GB")
    print()
    print("Use this profiler to understand where your memory goes!")
    print()
    print("Example:")
    print("  profile_memory(")
    print("      lambda: model.forward(batch),")
    print("      label='Forward pass'")
    print("  )")
else:
    print("No CUDA available")
    print("(Running on CPU - memory profiling won't work)")

Current Memory Status:
============================================================
Current allocation: 2.06 GB
Peak allocation:    3.08 GB
Total GPU memory:   25.75 GB
Available:          23.70 GB

Use this profiler to understand where your memory goes!

Example:
  profile_memory(
      lambda: model.forward(batch),
      label='Forward pass'
  )

When Things Go Wrong: Debugging OOM Errors¶

The dreaded “CUDA out of memory” error. We’ve all been there.

Here’s your systematic debugging checklist.

The Quick Fixes (Try These First)¶

1. Cut your batch size in half

If batch_size=8 OOMs, try batch_size=4
Use gradient accumulation to maintain effective batch size
This fixes 80% of OOM issues

2. Enable gradient checkpointing

One line: model.gradient_checkpointing_enable()
Saves 50-80% on activation memory
Especially important for long sequences

3. Use gradient accumulation

Smaller batches, same training dynamics
Costs nothing but a bit of training time

4. Clear the CUDA cache

torch.cuda.empty_cache()
Defragments memory, sometimes helps
Won’t free memory that’s actually in use

Common OOM Causes¶

Problem	Symptom	Solution
Batch size too large	OOM during forward pass	Reduce batch_size by 50%
Sequences too long	OOM with long inputs	Truncate to 512 or 1024 tokens
Memory leak	OOM after many steps	Use `.item()` or `.detach()` on tensors
Fragmented memory	OOM despite “available” memory	`torch.cuda.empty_cache()`
Multiple models	OOM at model load	Delete old models: `del model; gc.collect()`
Full precision	Just generally tight	Switch to BF16/FP16

The Nuclear Option¶

If nothing else works:

# QLoRA: 4-bit model + LoRA
quantization_config = BitsAndBytesConfig(load_in_4bit=True, ...)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config
)
model.gradient_checkpointing_enable()

# LoRA config
lora_config = LoraConfig(r=8, ...)  # Smaller rank if needed
model = get_peft_model(model, lora_config)

# Training config
training_args = TrainingArguments(
    per_device_train_batch_size=1,  # Minimum
    gradient_accumulation_steps=32,  # Large accumulation
    ...
)

This’ll fit almost anything, anywhere. It’s slow, but it works.

Putting It All Together: A Memory Optimization Strategy¶

Okay, we’ve covered a lot of techniques. How do you actually use them?

Here’s my recommended approach, in order. Think of it as a ladder—start at step 1, and only climb higher if you need to.

Step 1: The Essentials (Always Do This)¶

These are free wins. Apply them to every training run.

1. Mixed precision (BF16 or FP16)

50% memory reduction
Often faster training
Literally no downside on modern GPUs

2. LoRA (if training large models >1B params)

80-95% reduction in optimizer memory
No speed penalty
Quality is excellent for most tasks

3. Find your maximum batch size

Start with batch_size=8
If OOM, halve it
If comfortable, double it
Use gradient accumulation to hit your target effective batch size

Step 2: If You’re Still Running Out of Memory¶

4. Gradient checkpointing

50-80% reduction in activation memory
20-30% slower training
Essential for long sequences

5. Gradient accumulation

Lets you use smaller batches
Maintains training stability
Slight speed cost

Step 3: The Extreme Measures¶

6. 4-bit quantization (QLoRA)

75% reduction in model memory
Small quality hit (1-2%)
Can only be used with LoRA

7. CPU offloading (DeepSpeed ZeRO-3)

Offloads optimizer state to CPU RAM
50-70% GPU memory reduction
60-80% slower training
Use as last resort

The Rule¶

Start simple. Add complexity only when you hit limits.

Don’t start with QLoRA + checkpointing + accumulation + offloading just because you can. Start with BF16 + LoRA, see how far that gets you, and add techniques as needed.

Complete Memory Optimization Example¶

Here’s what a fully optimized training setup looks like. This is the configuration that lets you train 7B models on consumer GPUs.

Goal: Fine-tune Llama 7B on a 12 GB GPU (RTX 3060/3080)

The recipe:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
import torch

# Step 1: Load model in 4-bit (QLoRA)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",              # NormalFloat4
    bnb_4bit_use_double_quant=True,         # Double quantization
    bnb_4bit_compute_dtype=torch.bfloat16   # Compute in bf16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

# Step 2: Add LoRA adapters
lora_config = LoraConfig(
    r=16,              # Rank (balance capacity vs memory)
    lora_alpha=32,     # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# Step 3: Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Step 4: Configure training with small batches + accumulation
training_args = TrainingArguments(
    output_dir="./llama-7b-finetuned",
    
    # Memory settings
    per_device_train_batch_size=4,      # Small batch per GPU
    gradient_accumulation_steps=8,       # Effective batch = 4 × 8 = 32
    
    # Precision
    bf16=True,                           # Use BF16 (if supported)
    
    # Training
    num_train_epochs=3,
    learning_rate=2e-4,
    
    # Checkpointing
    save_strategy="steps",
    save_steps=100,
)

# Step 5: Train!
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    max_seq_length=512,  # Reasonable sequence length
)

trainer.train()

Memory Breakdown:

Base model (4-bit):     ~3.5 GB
LoRA params (bf16):     ~67 MB
Optimizer state:        ~268 MB
Gradients:              ~67 MB
Activations (bs=4):     ~5 GB (with checkpointing)
Framework overhead:     ~500 MB
──────────────────────────────
Total:                  ~9.4 GB

Fits comfortably on a 12 GB GPU with room to spare!

This same approach scales:

13B model on 24 GB GPU (RTX 4090)
70B model on 48 GB GPU (A6000)

The QLoRA revolution made this all possible.

Quick Reference: Memory Optimization Techniques¶

Here’s your cheat sheet. Bookmark this cell.

Technique	Memory Savings	Speed Impact	Quality Impact	When to Use
Mixed Precision (BF16)	50%	+20% faster	Negligible	Always
LoRA	80-95% optimizer	None	Excellent	Large models (>1B)
Gradient Accumulation	Enables larger effective batch	-10-20%	None	Memory-limited
Gradient Checkpointing	50-80% activations	-20-30%	None	Long sequences or tight memory
Quantization (4-bit)	75% model	-10-20%	Small (~1-2%)	Extreme constraints
CPU Offloading	50-70% optimizer	-60-80%	None	Last resort

The Impact Hierarchy¶

Most impact for least effort:

Mixed precision (BF16)
LoRA
Gradient checkpointing

When you’re desperate: 4. Quantization (4-bit) 5. CPU offloading

Model Size → GPU Requirements¶

With all optimizations (QLoRA + gradient checkpointing + BF16):

Model Size	Min GPU Memory	Example Cards
3B	6 GB	RTX 3060, RTX 2060
7B	10 GB	RTX 3080, RTX 2080 Ti
13B	16 GB	RTX 4080, RTX 3090
30B	24 GB	RTX 4090, A5000
70B	48 GB	A6000, 2×RTX 4090

These are realistic fine-tuning numbers, not theoretical minimums.

Remember¶

The goal isn’t to use every technique. The goal is to use the minimum number of techniques needed to fit your model in memory while maintaining reasonable training speed.

Start simple. Add complexity only when needed.

What’s Next?¶

You now have the tools to fit large models on limited hardware. But memory is just one piece of the puzzle.

Next up: hyperparameter tuning. Learning rate schedules, warmup steps, weight decay—all the knobs you can turn to make your model actually learn well (not just fit in memory).

Because a model that fits in memory but doesn’t train well is... not very useful.

(But hey, at least it runs!)