Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Layer Normalization

Two Problems to Solve

We’ve computed the FFN output. But we have two problems:

Problem 1: Information loss

We replaced the attention output with the FFN output. All that careful work computing Q, K, V, attention scores... gone. The FFN output doesn’t preserve the original information.

Problem 2: Unstable activations

Deep neural networks have a tendency for activations to drift. Some dimensions grow huge, others shrink to near-zero. As you stack more layers, small imbalances compound. Gradients explode or vanish. Training becomes a nightmare.

Two techniques solve these problems: residual connections and layer normalization.

Residual Connections

The fix for information loss is beautifully simple: add instead of replace.

output=input+FFN(input)\text{output} = \text{input} + \text{FFN}(\text{input})

This is a residual connection (or skip connection). Instead of learning the full transformation, the FFN only needs to learn the change (the residual).

Benefits:

  • No information loss: The original input is preserved

  • Easier learning: Learning deltas is easier than learning full transformations

  • Better gradients: During backprop, gradients can flow directly through the addition

Residual connections were the key innovation that made very deep networks trainable (ResNet, 2015).

Layer Normalization

Even with residual connections, activations can drift. Layer normalization fixes this by normalizing each position’s vector to have:

  • Mean = 0

  • Variance = 1

The formula:

LayerNorm(x)=γxμσ2+ϵ+β\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

Where:

  • μ\mu = mean of xx across dimensions

  • σ2\sigma^2 = variance of xx across dimensions

  • ϵ\epsilon = small constant for numerical stability (prevents division by zero)

  • γ\gamma = learnable scale parameter (initialized to 1)

  • β\beta = learnable shift parameter (initialized to 0)

  • \odot = element-wise multiplication

Why learnable γ\gamma and β\beta?

If we only normalized, we’d force every position to have exactly mean 0 and variance 1. That’s too restrictive—sometimes non-zero means are useful!

The learnable parameters let the model choose to denormalize if it helps. At initialization (γ=1\gamma=1, β=0\beta=0), it’s standard normalization. During training, the model learns the optimal scale and shift for each dimension.

import random
import math

random.seed(42)

VOCAB_SIZE = 6
D_MODEL = 16
D_FF = 64
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS
EPSILON = 1e-5  # Small constant to prevent division by zero

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]
# Helper functions
def random_vector(size, scale=0.1):
    return [random.gauss(0, scale) for _ in range(size)]

def random_matrix(rows, cols, scale=0.1):
    return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]

def add_vectors(v1, v2):
    return [a + b for a, b in zip(v1, v2)]

def matmul(A, B):
    m, n, p = len(A), len(A[0]), len(B[0])
    return [[sum(A[i][k] * B[k][j] for k in range(n)) for j in range(p)] for i in range(m)]

def transpose(A):
    return [[A[i][j] for i in range(len(A))] for j in range(len(A[0]))]

def softmax(vec):
    max_val = max(v for v in vec if v != float('-inf'))
    exp_vec = [math.exp(v - max_val) if v != float('-inf') else 0 for v in vec]
    sum_exp = sum(exp_vec)
    return [e / sum_exp for e in exp_vec]

def gelu(x):
    return 0.5 * x * (1 + math.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * x**3)))

def format_vector(vec, decimals=4):
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"
# Recreate everything from previous notebooks
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]
tokens = [1, 3, 4, 5, 2]
seq_len = len(tokens)
X = [add_vectors(E_token[tokens[i]], E_pos[i]) for i in range(seq_len)]

# Attention
W_Q = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_K = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_V = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
Q_all = [matmul(X, W_Q[h]) for h in range(NUM_HEADS)]
K_all = [matmul(X, W_K[h]) for h in range(NUM_HEADS)]
V_all = [matmul(X, W_V[h]) for h in range(NUM_HEADS)]

def compute_attention(Q, K, V):
    seq_len, d_k = len(Q), len(Q[0])
    scale = math.sqrt(d_k)
    scores = matmul(Q, transpose(K))
    scaled = [[s / scale for s in row] for row in scores]
    for i in range(seq_len):
        for j in range(seq_len):
            if j > i:
                scaled[i][j] = float('-inf')
    weights = [softmax(row) for row in scaled]
    return matmul(weights, V)

attention_output_all = [compute_attention(Q_all[h], K_all[h], V_all[h]) for h in range(NUM_HEADS)]
concat_output = [attention_output_all[0][i] + attention_output_all[1][i] for i in range(seq_len)]
W_O = random_matrix(D_MODEL, D_MODEL)
multi_head_output = matmul(concat_output, transpose(W_O))

# FFN
W1 = random_matrix(D_FF, D_MODEL)
b1 = random_vector(D_FF)
W2 = random_matrix(D_MODEL, D_FF)
b2 = random_vector(D_MODEL)
hidden = matmul(multi_head_output, transpose(W1))
hidden = [[hidden[i][j] + b1[j] for j in range(D_FF)] for i in range(seq_len)]
activated = [[gelu(h) for h in row] for row in hidden]
ffn_output = matmul(activated, transpose(W2))
ffn_output = [[ffn_output[i][j] + b2[j] for j in range(D_MODEL)] for i in range(seq_len)]

print("Recreated multi-head attention and FFN outputs")
Recreated multi-head attention and FFN outputs

Step 1: Residual Connection

Add the attention output (input to FFN) to the FFN output:

# Compute residual: attention_output + FFN(attention_output)
residual = [add_vectors(multi_head_output[i], ffn_output[i]) for i in range(seq_len)]

print("Residual Connection: input + FFN(input)")
print("=" * 70)
print()
print("Example for position 0 (<BOS>):")
print(f"  Attention output: {format_vector(multi_head_output[0])}")
print(f"  + FFN output:     {format_vector(ffn_output[0])}")
print(f"  = Residual:       {format_vector(residual[0])}")
Residual Connection: input + FFN(input)
======================================================================

Example for position 0 (<BOS>):
  Attention output: [ 0.0334,  0.0033, -0.0041, -0.0073,  0.0185,  0.0074,  0.0169,  0.0107,  0.0277,  0.0060,  0.0222,  0.0241,  0.0074,  0.0067, -0.0067,  0.0063]
  + FFN output:     [ 0.0043, -0.0896,  0.0020,  0.2294,  0.1020,  0.0966, -0.2073,  0.0574,  0.1951,  0.0692, -0.0388, -0.0762,  0.1390, -0.0384,  0.1633,  0.0529]
  = Residual:       [ 0.0377, -0.0863, -0.0020,  0.2221,  0.1205,  0.1041, -0.1904,  0.0681,  0.2228,  0.0752, -0.0165, -0.0520,  0.1464, -0.0317,  0.1566,  0.0591]

Step 2: Layer Normalization

Now we normalize the residual. For each position:

Compute mean:

μ=1dmodeli=015xi\mu = \frac{1}{d_{model}} \sum_{i=0}^{15} x_i

Compute variance:

σ2=1dmodeli=015(xiμ)2\sigma^2 = \frac{1}{d_{model}} \sum_{i=0}^{15} (x_i - \mu)^2

Normalize:

x^i=xiμσ2+ϵ\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}

Scale and shift:

yi=γix^i+βiy_i = \gamma_i \cdot \hat{x}_i + \beta_i
def layer_norm(x, gamma, beta, epsilon=1e-5):
    """
    Apply layer normalization to a single vector.
    
    Args:
        x: Input vector [d_model]
        gamma: Scale parameters [d_model]
        beta: Shift parameters [d_model]
        epsilon: Small constant for numerical stability
    
    Returns:
        Normalized vector, mean, variance
    """
    # Compute mean across dimensions
    mean = sum(x) / len(x)
    
    # Compute variance across dimensions
    variance = sum((xi - mean)**2 for xi in x) / len(x)
    
    # Normalize (with epsilon for numerical stability)
    std = math.sqrt(variance + epsilon)
    x_norm = [(xi - mean) / std for xi in x]
    
    # Scale and shift
    output = [gamma[i] * x_norm[i] + beta[i] for i in range(len(x))]
    
    return output, mean, variance

# Initialize gamma and beta (learnable parameters)
gamma = [1.0] * D_MODEL  # Scale, initialized to 1
beta = [0.0] * D_MODEL   # Shift, initialized to 0

print(f"Layer norm parameters:")
print(f"  gamma (scale): initialized to 1.0 for all {D_MODEL} dimensions")
print(f"  beta (shift):  initialized to 0.0 for all {D_MODEL} dimensions")
print()
print(f"With these initial values, LayerNorm just normalizes (no scaling or shifting).")
Layer norm parameters:
  gamma (scale): initialized to 1.0 for all 16 dimensions
  beta (shift):  initialized to 0.0 for all 16 dimensions

With these initial values, LayerNorm just normalizes (no scaling or shifting).
# Detailed calculation for position 0
x = residual[0]
mean = sum(x) / D_MODEL
variance = sum((xi - mean)**2 for xi in x) / D_MODEL
std = math.sqrt(variance + EPSILON)

print("Detailed: Layer Norm for position 0 (<BOS>)")
print("=" * 60)
print()
print(f"Input: {format_vector(x)}")
print()
print(f"Step 1: Compute mean")
print(f"  μ = sum(x) / {D_MODEL} = {mean:.6f}")
print()
print(f"Step 2: Compute variance")
print(f"  σ² = sum((x - μ)²) / {D_MODEL} = {variance:.6f}")
print()
print(f"Step 3: Standard deviation (with epsilon for stability)")
print(f"  σ = sqrt({variance:.6f} + {EPSILON}) = {std:.6f}")
print()
print(f"Step 4: Normalize each element")
for i in range(3):
    norm_val = (x[i] - mean) / std
    print(f"  x̂[{i}] = ({x[i]:.4f} - {mean:.4f}) / {std:.4f} = {norm_val:.4f}")
print("  ...")
Detailed: Layer Norm for position 0 (<BOS>)
============================================================

Input: [ 0.0377, -0.0863, -0.0020,  0.2221,  0.1205,  0.1041, -0.1904,  0.0681,  0.2228,  0.0752, -0.0165, -0.0520,  0.1464, -0.0317,  0.1566,  0.0591]

Step 1: Compute mean
  μ = sum(x) / 16 = 0.052101

Step 2: Compute variance
  σ² = sum((x - μ)²) / 16 = 0.011860

Step 3: Standard deviation (with epsilon for stability)
  σ = sqrt(0.011860 + 1e-05) = 0.108949

Step 4: Normalize each element
  x̂[0] = (0.0377 - 0.0521) / 0.1089 = -0.1319
  x̂[1] = (-0.0863 - 0.0521) / 0.1089 = -1.2702
  x̂[2] = (-0.0020 - 0.0521) / 0.1089 = -0.4969
  ...
# Apply layer norm to all positions
layer_norm_output = []

for i in range(seq_len):
    output, _, _ = layer_norm(residual[i], gamma, beta, EPSILON)
    layer_norm_output.append(output)

print("Layer Norm Output")
print(f"Shape: [{seq_len}, {D_MODEL}]")
print()
for i, row in enumerate(layer_norm_output):
    print(f"  {format_vector(row)}  # {TOKEN_NAMES[tokens[i]]}")
Layer Norm Output
Shape: [5, 16]

  [-0.1319, -1.2702, -0.4969,  1.5600,  0.6278,  0.4769, -2.2261,  0.1465,  1.5670,  0.2124, -0.6301, -0.9558,  0.8655, -0.7691,  0.9595,  0.0645]  # <BOS>
  [-0.1476, -1.1384, -0.3134,  1.5437,  0.5944,  0.5080, -2.3531,  0.0272,  1.5526,  0.1223, -0.6328, -1.0414,  0.7749, -0.6856,  1.0133,  0.1760]  # I
  [-0.2966, -1.0966, -0.2274,  1.5115,  0.5190,  0.6730, -2.2896,  0.0987,  1.4495,  0.2025, -0.7481, -1.1420,  0.8840, -0.6908,  1.0545,  0.0984]  # like
  [-0.2772, -1.1953, -0.2618,  1.5366,  0.5803,  0.6281, -2.2785,  0.1627,  1.4394,  0.1992, -0.6806, -1.1222,  0.8545, -0.6914,  1.0266,  0.0796]  # transformers
  [-0.3033, -1.1778, -0.2499,  1.5637,  0.5494,  0.5963, -2.2625,  0.1577,  1.4126,  0.2331, -0.7348, -1.1068,  0.8813, -0.6716,  1.0751,  0.0376]  # <EOS>

Verification: Did It Work?

Layer norm should give us mean ≈ 0 and variance ≈ 1 for each position. Let’s check:

print("Verification: Statistics after LayerNorm")
print("=" * 60)
print()
for i in range(seq_len):
    x = layer_norm_output[i]
    mean = sum(x) / len(x)
    var = sum((xi - mean)**2 for xi in x) / len(x)
    print(f"Position {i} ({TOKEN_NAMES[tokens[i]]:12s}): mean = {mean:10.6f}, variance = {var:.6f}")

print()
print("Mean ≈ 0, variance ≈ 1 for all positions. Layer norm worked!")
print()
print("(Variance is 0.999 instead of 1.0 because we divide by n, not n-1)")
Verification: Statistics after LayerNorm
============================================================

Position 0 (<BOS>       ): mean =  -0.000000, variance = 0.999158
Position 1 (I           ): mean =  -0.000000, variance = 0.999175
Position 2 (like        ): mean =  -0.000000, variance = 0.999211
Position 3 (transformers): mean =  -0.000000, variance = 0.999223
Position 4 (<EOS>       ): mean =   0.000000, variance = 0.999231

Mean ≈ 0, variance ≈ 1 for all positions. Layer norm worked!

(Variance is 0.999 instead of 1.0 because we divide by n, not n-1)

Why Epsilon Matters

That tiny ϵ=105\epsilon = 10^{-5} prevents disaster.

If variance happened to be exactly zero (all values identical), we’d divide by zero. Even if variance is just very small, division by a tiny number creates huge values and numerical instability.

Adding epsilon ensures we never divide by anything smaller than 1050.003\sqrt{10^{-5}} \approx 0.003.

Before and After

Let’s see how the representation changed:

print("Position 1 ('I') - Before and After LayerNorm")
print("=" * 70)
print()
print(f"Before (residual):")
print(f"  {format_vector(residual[1])}")
print()
print(f"After (layer norm):")
print(f"  {format_vector(layer_norm_output[1])}")
print()
print("The values are scaled and shifted, but relative relationships preserved.")
print("Large values stay large, small values stay small.")
Position 1 ('I') - Before and After LayerNorm
======================================================================

Before (residual):
  [ 0.0281, -0.0810,  0.0098,  0.2144,  0.1098,  0.1003, -0.2148,  0.0474,  0.2153,  0.0578, -0.0253, -0.0703,  0.1297, -0.0311,  0.1560,  0.0637]

After (layer norm):
  [-0.1476, -1.1384, -0.3134,  1.5437,  0.5944,  0.5080, -2.3531,  0.0272,  1.5526,  0.1223, -0.6328, -1.0414,  0.7749, -0.6856,  1.0133,  0.1760]

The values are scaled and shifted, but relative relationships preserved.
Large values stay large, small values stay small.

The Complete Transformer Block

We just finished one complete transformer block! We computed:

Input X [5, 16]
    ↓
Multi-Head Attention
    ↓
+ Residual Connection (add X back)
    ↓
Layer Norm
    ↓
Feed-Forward Network
    ↓
+ Residual Connection (add previous output back)
    ↓
Layer Norm
    ↓
Output [5, 16]

(Note: We simplified by doing attention → FFN → residual → layer norm. Real transformers often interleave differently, but the concepts are the same.)

In GPT-3, this block is repeated 96 times. Each block refines the representation further.

What’s Next

The transformer block is done! We have 16-dimensional vectors for each position.

But these aren’t predictions yet. We need to:

  1. Project to vocabulary space: Convert 16D → 6D (one score per token in our vocabulary)

  2. Apply softmax: Convert scores to probabilities

  3. Compute loss: Measure how wrong our predictions are

That’s the next notebook.

# Store for next notebook
layer_norm_data = {
    'X': X,
    'tokens': tokens,
    'layer_norm_output': layer_norm_output,
    'gamma': gamma,
    'beta': beta
}
print("Transformer block complete. Ready for output projection and loss.")
Transformer block complete. Ready for output projection and loss.