Layer Normalization - An Introduction to Transformers

Two Problems to Solve¶

We’ve computed the FFN output. But we have two problems:

Problem 1: Information loss

We replaced the attention output with the FFN output. All that careful work computing Q, K, V, attention scores... gone. The FFN output doesn’t preserve the original information.

Problem 2: Unstable activations

Deep neural networks have a tendency for activations to drift. Some dimensions grow huge, others shrink to near-zero. As you stack more layers, small imbalances compound. Gradients explode or vanish. Training becomes a nightmare.

Two techniques solve these problems: residual connections and layer normalization.

Residual Connections¶

The fix for information loss is beautifully simple: add instead of replace.

\text{output} = \text{input} + \text{FFN}(\text{input})

(1)

This is a residual connection (or skip connection). Instead of learning the full transformation, the FFN only needs to learn the change (the residual).

Benefits:

No information loss: The original input is preserved
Easier learning: Learning deltas is easier than learning full transformations
Better gradients: During backprop, gradients can flow directly through the addition

Residual connections were the key innovation that made very deep networks trainable (ResNet, 2015).

Layer Normalization¶

Even with residual connections, activations can drift. Layer normalization fixes this by normalizing each position’s vector to have:

Mean = 0
Variance = 1

The formula:

\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

(2)

Where:

$\mu$ = mean of $x$ across dimensions
$\sigma^2$ = variance of $x$ across dimensions
$\epsilon$ = small constant for numerical stability (prevents division by zero)
$\gamma$ = learnable scale parameter (initialized to 1)
$\beta$ = learnable shift parameter (initialized to 0)
$\odot$ = element-wise multiplication

Why learnable $\gamma$ and $\beta$ ?

If we only normalized, we’d force every position to have exactly mean 0 and variance 1. That’s too restrictive—sometimes non-zero means are useful!

The learnable parameters let the model choose to denormalize if it helps. At initialization ( $\gamma=1$ , $\beta=0$ ), it’s standard normalization. During training, the model learns the optimal scale and shift for each dimension.

import random
import math

random.seed(42)

VOCAB_SIZE = 6
D_MODEL = 16
D_FF = 64
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS
EPSILON = 1e-5  # Small constant to prevent division by zero

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

# Helper functions
def random_vector(size, scale=0.1):
    return [random.gauss(0, scale) for _ in range(size)]

def random_matrix(rows, cols, scale=0.1):
    return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]

def add_vectors(v1, v2):
    return [a + b for a, b in zip(v1, v2)]

def matmul(A, B):
    m, n, p = len(A), len(A[0]), len(B[0])
    return [[sum(A[i][k] * B[k][j] for k in range(n)) for j in range(p)] for i in range(m)]

def transpose(A):
    return [[A[i][j] for i in range(len(A))] for j in range(len(A[0]))]

def softmax(vec):
    max_val = max(v for v in vec if v != float('-inf'))
    exp_vec = [math.exp(v - max_val) if v != float('-inf') else 0 for v in vec]
    sum_exp = sum(exp_vec)
    return [e / sum_exp for e in exp_vec]

def gelu(x):
    return 0.5 * x * (1 + math.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * x**3)))

def format_vector(vec, decimals=4):
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"

# Recreate everything from previous notebooks
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]
tokens = [1, 3, 4, 5, 2]
seq_len = len(tokens)
X = [add_vectors(E_token[tokens[i]], E_pos[i]) for i in range(seq_len)]

# Attention
W_Q = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_K = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_V = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
Q_all = [matmul(X, W_Q[h]) for h in range(NUM_HEADS)]
K_all = [matmul(X, W_K[h]) for h in range(NUM_HEADS)]
V_all = [matmul(X, W_V[h]) for h in range(NUM_HEADS)]

def compute_attention(Q, K, V):
    seq_len, d_k = len(Q), len(Q[0])
    scale = math.sqrt(d_k)
    scores = matmul(Q, transpose(K))
    scaled = [[s / scale for s in row] for row in scores]
    for i in range(seq_len):
        for j in range(seq_len):
            if j > i:
                scaled[i][j] = float('-inf')
    weights = [softmax(row) for row in scaled]
    return matmul(weights, V)

attention_output_all = [compute_attention(Q_all[h], K_all[h], V_all[h]) for h in range(NUM_HEADS)]
concat_output = [attention_output_all[0][i] + attention_output_all[1][i] for i in range(seq_len)]
W_O = random_matrix(D_MODEL, D_MODEL)
multi_head_output = matmul(concat_output, transpose(W_O))

# FFN
W1 = random_matrix(D_FF, D_MODEL)
b1 = random_vector(D_FF)
W2 = random_matrix(D_MODEL, D_FF)
b2 = random_vector(D_MODEL)
hidden = matmul(multi_head_output, transpose(W1))
hidden = [[hidden[i][j] + b1[j] for j in range(D_FF)] for i in range(seq_len)]
activated = [[gelu(h) for h in row] for row in hidden]
ffn_output = matmul(activated, transpose(W2))
ffn_output = [[ffn_output[i][j] + b2[j] for j in range(D_MODEL)] for i in range(seq_len)]

print("Recreated multi-head attention and FFN outputs")

Recreated multi-head attention and FFN outputs

Step 1: Residual Connection¶

Add the attention output (input to FFN) to the FFN output:

# Compute residual: attention_output + FFN(attention_output)
residual = [add_vectors(multi_head_output[i], ffn_output[i]) for i in range(seq_len)]

print("Residual Connection: input + FFN(input)")
print("=" * 70)
print()
print("Example for position 0 (<BOS>):")
print(f"  Attention output: {format_vector(multi_head_output[0])}")
print(f"  + FFN output:     {format_vector(ffn_output[0])}")
print(f"  = Residual:       {format_vector(residual[0])}")

Residual Connection: input + FFN(input)
======================================================================

Example for position 0 (<BOS>):
  Attention output: [ 0.0334,  0.0033, -0.0041, -0.0073,  0.0185,  0.0074,  0.0169,  0.0107,  0.0277,  0.0060,  0.0222,  0.0241,  0.0074,  0.0067, -0.0067,  0.0063]
  + FFN output:     [ 0.0043, -0.0896,  0.0020,  0.2294,  0.1020,  0.0966, -0.2073,  0.0574,  0.1951,  0.0692, -0.0388, -0.0762,  0.1390, -0.0384,  0.1633,  0.0529]
  = Residual:       [ 0.0377, -0.0863, -0.0020,  0.2221,  0.1205,  0.1041, -0.1904,  0.0681,  0.2228,  0.0752, -0.0165, -0.0520,  0.1464, -0.0317,  0.1566,  0.0591]

Step 2: Layer Normalization¶

Now we normalize the residual. For each position:

Compute mean:

\mu = \frac{1}{d_{model}} \sum_{i=0}^{15} x_i

(3)

Compute variance:

\sigma^2 = \frac{1}{d_{model}} \sum_{i=0}^{15} (x_i - \mu)^2

(4)

Normalize:

\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}

(5)

Scale and shift:

y_i = \gamma_i \cdot \hat{x}_i + \beta_i

(6)

def layer_norm(x, gamma, beta, epsilon=1e-5):
    """
    Apply layer normalization to a single vector.
    
    Args:
        x: Input vector [d_model]
        gamma: Scale parameters [d_model]
        beta: Shift parameters [d_model]
        epsilon: Small constant for numerical stability
    
    Returns:
        Normalized vector, mean, variance
    """
    # Compute mean across dimensions
    mean = sum(x) / len(x)
    
    # Compute variance across dimensions
    variance = sum((xi - mean)**2 for xi in x) / len(x)
    
    # Normalize (with epsilon for numerical stability)
    std = math.sqrt(variance + epsilon)
    x_norm = [(xi - mean) / std for xi in x]
    
    # Scale and shift
    output = [gamma[i] * x_norm[i] + beta[i] for i in range(len(x))]
    
    return output, mean, variance

# Initialize gamma and beta (learnable parameters)
gamma = [1.0] * D_MODEL  # Scale, initialized to 1
beta = [0.0] * D_MODEL   # Shift, initialized to 0

print(f"Layer norm parameters:")
print(f"  gamma (scale): initialized to 1.0 for all {D_MODEL} dimensions")
print(f"  beta (shift):  initialized to 0.0 for all {D_MODEL} dimensions")
print()
print(f"With these initial values, LayerNorm just normalizes (no scaling or shifting).")

Layer norm parameters:
  gamma (scale): initialized to 1.0 for all 16 dimensions
  beta (shift):  initialized to 0.0 for all 16 dimensions

With these initial values, LayerNorm just normalizes (no scaling or shifting).

# Detailed calculation for position 0
x = residual[0]
mean = sum(x) / D_MODEL
variance = sum((xi - mean)**2 for xi in x) / D_MODEL
std = math.sqrt(variance + EPSILON)

print("Detailed: Layer Norm for position 0 (<BOS>)")
print("=" * 60)
print()
print(f"Input: {format_vector(x)}")
print()
print(f"Step 1: Compute mean")
print(f"  μ = sum(x) / {D_MODEL} = {mean:.6f}")
print()
print(f"Step 2: Compute variance")
print(f"  σ² = sum((x - μ)²) / {D_MODEL} = {variance:.6f}")
print()
print(f"Step 3: Standard deviation (with epsilon for stability)")
print(f"  σ = sqrt({variance:.6f} + {EPSILON}) = {std:.6f}")
print()
print(f"Step 4: Normalize each element")
for i in range(3):
    norm_val = (x[i] - mean) / std
    print(f"  x̂[{i}] = ({x[i]:.4f} - {mean:.4f}) / {std:.4f} = {norm_val:.4f}")
print("  ...")

Detailed: Layer Norm for position 0 (<BOS>)
============================================================

Input: [ 0.0377, -0.0863, -0.0020,  0.2221,  0.1205,  0.1041, -0.1904,  0.0681,  0.2228,  0.0752, -0.0165, -0.0520,  0.1464, -0.0317,  0.1566,  0.0591]

Step 1: Compute mean
  μ = sum(x) / 16 = 0.052101

Step 2: Compute variance
  σ² = sum((x - μ)²) / 16 = 0.011860

Step 3: Standard deviation (with epsilon for stability)
  σ = sqrt(0.011860 + 1e-05) = 0.108949

Step 4: Normalize each element
  x̂[0] = (0.0377 - 0.0521) / 0.1089 = -0.1319
  x̂[1] = (-0.0863 - 0.0521) / 0.1089 = -1.2702
  x̂[2] = (-0.0020 - 0.0521) / 0.1089 = -0.4969
  ...

# Apply layer norm to all positions
layer_norm_output = []

for i in range(seq_len):
    output, _, _ = layer_norm(residual[i], gamma, beta, EPSILON)
    layer_norm_output.append(output)

print("Layer Norm Output")
print(f"Shape: [{seq_len}, {D_MODEL}]")
print()
for i, row in enumerate(layer_norm_output):
    print(f"  {format_vector(row)}  # {TOKEN_NAMES[tokens[i]]}")

Layer Norm Output
Shape: [5, 16]

  [-0.1319, -1.2702, -0.4969,  1.5600,  0.6278,  0.4769, -2.2261,  0.1465,  1.5670,  0.2124, -0.6301, -0.9558,  0.8655, -0.7691,  0.9595,  0.0645]  # <BOS>
  [-0.1476, -1.1384, -0.3134,  1.5437,  0.5944,  0.5080, -2.3531,  0.0272,  1.5526,  0.1223, -0.6328, -1.0414,  0.7749, -0.6856,  1.0133,  0.1760]  # I
  [-0.2966, -1.0966, -0.2274,  1.5115,  0.5190,  0.6730, -2.2896,  0.0987,  1.4495,  0.2025, -0.7481, -1.1420,  0.8840, -0.6908,  1.0545,  0.0984]  # like
  [-0.2772, -1.1953, -0.2618,  1.5366,  0.5803,  0.6281, -2.2785,  0.1627,  1.4394,  0.1992, -0.6806, -1.1222,  0.8545, -0.6914,  1.0266,  0.0796]  # transformers
  [-0.3033, -1.1778, -0.2499,  1.5637,  0.5494,  0.5963, -2.2625,  0.1577,  1.4126,  0.2331, -0.7348, -1.1068,  0.8813, -0.6716,  1.0751,  0.0376]  # <EOS>

Verification: Did It Work?¶

Layer norm should give us mean ≈ 0 and variance ≈ 1 for each position. Let’s check:

print("Verification: Statistics after LayerNorm")
print("=" * 60)
print()
for i in range(seq_len):
    x = layer_norm_output[i]
    mean = sum(x) / len(x)
    var = sum((xi - mean)**2 for xi in x) / len(x)
    print(f"Position {i} ({TOKEN_NAMES[tokens[i]]:12s}): mean = {mean:10.6f}, variance = {var:.6f}")

print()
print("Mean ≈ 0, variance ≈ 1 for all positions. Layer norm worked!")
print()
print("(Variance is 0.999 instead of 1.0 because we divide by n, not n-1)")

Verification: Statistics after LayerNorm
============================================================

Position 0 (<BOS>       ): mean =  -0.000000, variance = 0.999158
Position 1 (I           ): mean =  -0.000000, variance = 0.999175
Position 2 (like        ): mean =  -0.000000, variance = 0.999211
Position 3 (transformers): mean =  -0.000000, variance = 0.999223
Position 4 (<EOS>       ): mean =   0.000000, variance = 0.999231

Mean ≈ 0, variance ≈ 1 for all positions. Layer norm worked!

(Variance is 0.999 instead of 1.0 because we divide by n, not n-1)

Why Epsilon Matters¶

That tiny $\epsilon = 10^{-5}$ prevents disaster.

If variance happened to be exactly zero (all values identical), we’d divide by zero. Even if variance is just very small, division by a tiny number creates huge values and numerical instability.

Adding epsilon ensures we never divide by anything smaller than $\sqrt{10^{-5}} \approx 0.003$ .

Before and After¶

Let’s see how the representation changed:

print("Position 1 ('I') - Before and After LayerNorm")
print("=" * 70)
print()
print(f"Before (residual):")
print(f"  {format_vector(residual[1])}")
print()
print(f"After (layer norm):")
print(f"  {format_vector(layer_norm_output[1])}")
print()
print("The values are scaled and shifted, but relative relationships preserved.")
print("Large values stay large, small values stay small.")

Position 1 ('I') - Before and After LayerNorm
======================================================================

Before (residual):
  [ 0.0281, -0.0810,  0.0098,  0.2144,  0.1098,  0.1003, -0.2148,  0.0474,  0.2153,  0.0578, -0.0253, -0.0703,  0.1297, -0.0311,  0.1560,  0.0637]

After (layer norm):
  [-0.1476, -1.1384, -0.3134,  1.5437,  0.5944,  0.5080, -2.3531,  0.0272,  1.5526,  0.1223, -0.6328, -1.0414,  0.7749, -0.6856,  1.0133,  0.1760]

The values are scaled and shifted, but relative relationships preserved.
Large values stay large, small values stay small.

The Complete Transformer Block¶

We just finished one complete transformer block! We computed:

Input X [5, 16]
    ↓
Multi-Head Attention
    ↓
+ Residual Connection (add X back)
    ↓
Layer Norm
    ↓
Feed-Forward Network
    ↓
+ Residual Connection (add previous output back)
    ↓
Layer Norm
    ↓
Output [5, 16]

(Note: We simplified by doing attention → FFN → residual → layer norm. Real transformers often interleave differently, but the concepts are the same.)

In GPT-3, this block is repeated 96 times. Each block refines the representation further.

What’s Next¶

The transformer block is done! We have 16-dimensional vectors for each position.

But these aren’t predictions yet. We need to:

Project to vocabulary space: Convert 16D → 6D (one score per token in our vocabulary)
Apply softmax: Convert scores to probabilities
Compute loss: Measure how wrong our predictions are

That’s the next notebook.

# Store for next notebook
layer_norm_data = {
    'X': X,
    'tokens': tokens,
    'layer_norm_output': layer_norm_output,
    'gamma': gamma,
    'beta': beta
}
print("Transformer block complete. Ready for output projection and loss.")

Transformer block complete. Ready for output projection and loss.