Why We Need Non-Linearity¶
Multi-head attention is powerful, but it’s all linear operations.
Matrix multiplications, weighted sums, projections—these are all linear. And stacking linear operations on top of linear operations just gives you... more linear operations. A thousand linear layers can be collapsed into a single linear layer.
To learn truly complex functions, we need non-linearity. That’s where the feed-forward network comes in.
What the FFN Does¶
The feed-forward network (FFN) is surprisingly simple: a two-layer neural network applied independently to each position.
The architecture:
Expand: Project from 16 dimensions to 64 dimensions (4× expansion)
Activate: Apply GELU non-linearity
Project: Bring back down to 16 dimensions
Why the expansion? More dimensions = more room for complex transformations. The 4× ratio () is standard in transformers.
Key insight: The FFN is applied independently to each position. No cross-position interaction here. Attention lets tokens communicate; the FFN lets each token process what it’s learned.
import random
import math
random.seed(42)
VOCAB_SIZE = 6
D_MODEL = 16
D_FF = 64 # 4 * D_MODEL, standard expansion ratio
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS
TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]# Helper functions
def random_vector(size, scale=0.1):
return [random.gauss(0, scale) for _ in range(size)]
def random_matrix(rows, cols, scale=0.1):
return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]
def add_vectors(v1, v2):
return [a + b for a, b in zip(v1, v2)]
def matmul(A, B):
m, n, p = len(A), len(A[0]), len(B[0])
return [[sum(A[i][k] * B[k][j] for k in range(n)) for j in range(p)] for i in range(m)]
def transpose(A):
return [[A[i][j] for i in range(len(A))] for j in range(len(A[0]))]
def softmax(vec):
max_val = max(v for v in vec if v != float('-inf'))
exp_vec = [math.exp(v - max_val) if v != float('-inf') else 0 for v in vec]
sum_exp = sum(exp_vec)
return [e / sum_exp for e in exp_vec]
def format_vector(vec, decimals=4):
return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"# Recreate multi-head attention output from previous notebooks
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]
tokens = [1, 3, 4, 5, 2]
seq_len = len(tokens)
X = [add_vectors(E_token[tokens[i]], E_pos[i]) for i in range(seq_len)]
W_Q = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_K = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_V = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
Q_all = [matmul(X, W_Q[h]) for h in range(NUM_HEADS)]
K_all = [matmul(X, W_K[h]) for h in range(NUM_HEADS)]
V_all = [matmul(X, W_V[h]) for h in range(NUM_HEADS)]
def compute_attention(Q, K, V):
seq_len, d_k = len(Q), len(Q[0])
scale = math.sqrt(d_k)
scores = matmul(Q, transpose(K))
scaled = [[s / scale for s in row] for row in scores]
for i in range(seq_len):
for j in range(seq_len):
if j > i:
scaled[i][j] = float('-inf')
weights = [softmax(row) for row in scaled]
return matmul(weights, V)
attention_output_all = [compute_attention(Q_all[h], K_all[h], V_all[h]) for h in range(NUM_HEADS)]
concat_output = [attention_output_all[0][i] + attention_output_all[1][i] for i in range(seq_len)]
W_O = random_matrix(D_MODEL, D_MODEL)
multi_head_output = matmul(concat_output, transpose(W_O))
print("Recreated multi-head attention output")
print(f"Shape: [{seq_len}, {D_MODEL}]")Recreated multi-head attention output
Shape: [5, 16]
The GELU Activation Function¶
We’re using GELU (Gaussian Error Linear Unit) as our non-linearity. The exact formula is:
Where is the cumulative distribution function of the standard normal distribution. In practice, we use a fast approximation:
Why GELU instead of ReLU?
ReLU just zeros out negatives: . Simple, but it creates a hard cutoff—dead neurons that never recover.
GELU is smoother. It still emphasizes positive values, but negative values get gently suppressed rather than killed entirely. This smoothness helps gradients flow better during training.
def gelu(x):
"""GELU activation using tanh approximation"""
return 0.5 * x * (1 + math.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * x**3)))
# Compare GELU vs ReLU
print("GELU vs ReLU")
print("=" * 45)
print(f"{'x':>8} | {'ReLU':>10} | {'GELU':>10}")
print("-" * 45)
for x in [-2.0, -1.0, -0.5, 0.0, 0.5, 1.0, 2.0]:
relu = max(0, x)
gelu_val = gelu(x)
print(f"{x:>8.1f} | {relu:>10.4f} | {gelu_val:>10.4f}")
print()
print("Notice: GELU doesn't completely kill negative values.")
print("At x=-1, ReLU gives 0, but GELU gives -0.159.")GELU vs ReLU
=============================================
x | ReLU | GELU
---------------------------------------------
-2.0 | 0.0000 | -0.0454
-1.0 | 0.0000 | -0.1588
-0.5 | 0.0000 | -0.1543
0.0 | 0.0000 | 0.0000
0.5 | 0.5000 | 0.3457
1.0 | 1.0000 | 0.8412
2.0 | 2.0000 | 1.9546
Notice: GELU doesn't completely kill negative values.
At x=-1, ReLU gives 0, but GELU gives -0.159.
FFN Weights¶
The FFN has four sets of learnable parameters:
| Parameter | Shape | Purpose |
|---|---|---|
| [64, 16] | Expansion weights | |
| [64] | Expansion bias | |
| [16, 64] | Projection weights | |
| [16] | Projection bias |
Total: 64×16 + 64 + 16×64 + 16 = 2,128 parameters
That’s more than attention! The FFN is actually where most of the parameters live in a transformer.
# Initialize FFN weights
W1 = random_matrix(D_FF, D_MODEL) # [64, 16] - expansion
b1 = random_vector(D_FF) # [64]
W2 = random_matrix(D_MODEL, D_FF) # [16, 64] - projection
b2 = random_vector(D_MODEL) # [16]
print(f"FFN Parameters:")
print(f" W1: [{D_FF}, {D_MODEL}] = {D_FF * D_MODEL} values")
print(f" b1: [{D_FF}] = {D_FF} values")
print(f" W2: [{D_MODEL}, {D_FF}] = {D_MODEL * D_FF} values")
print(f" b2: [{D_MODEL}] = {D_MODEL} values")
print(f" Total: {D_FF * D_MODEL + D_FF + D_MODEL * D_FF + D_MODEL} parameters")FFN Parameters:
W1: [64, 16] = 1024 values
b1: [64] = 64 values
W2: [16, 64] = 1024 values
b2: [16] = 16 values
Total: 2128 parameters
Step 1: Expansion Layer¶
First, we expand from 16 dimensions to 64 dimensions:
For each position, we take the 16-dimensional vector, multiply by (shape [16, 64]), and add bias . Result: 64 dimensions.
# Compute first linear layer: hidden = input @ W1^T + b1
W1_T = transpose(W1)
hidden = matmul(multi_head_output, W1_T)
hidden = [[hidden[i][j] + b1[j] for j in range(D_FF)] for i in range(seq_len)]
print(f"Hidden layer (after expansion)")
print(f"Shape: [{seq_len}, {D_FF}]")
print()
print(f"Position 0 (<BOS>), first 8 of 64 dimensions:")
print(f" {format_vector(hidden[0][:8])}...")Hidden layer (after expansion)
Shape: [5, 64]
Position 0 (<BOS>), first 8 of 64 dimensions:
[ 0.0000, 0.0439, 0.0457, 0.1031, -0.0962, -0.0283, 0.0890, 0.0303]...
Step 2: GELU Activation¶
Apply GELU element-wise to all 64 dimensions at each position.
# Apply GELU activation element-wise
activated = [[gelu(h) for h in row] for row in hidden]
print(f"After GELU activation")
print()
print(f"Position 0, first 4 dimensions:")
print(f" Before GELU: {format_vector(hidden[0][:4])}")
print(f" After GELU: {format_vector(activated[0][:4])}")
print()
print("Values shrink (especially negatives) but maintain sign.")After GELU activation
Position 0, first 4 dimensions:
Before GELU: [ 0.0000, 0.0439, 0.0457, 0.1031]
After GELU: [ 0.0000, 0.0227, 0.0237, 0.0558]
Values shrink (especially negatives) but maintain sign.
# Compute second linear layer: output = activated @ W2^T + b2
W2_T = transpose(W2)
ffn_output = matmul(activated, W2_T)
ffn_output = [[ffn_output[i][j] + b2[j] for j in range(D_MODEL)] for i in range(seq_len)]
print(f"FFN Output")
print(f"Shape: [{seq_len}, {D_MODEL}]")
print()
for i, row in enumerate(ffn_output):
print(f" {format_vector(row)} # {TOKEN_NAMES[tokens[i]]}")FFN Output
Shape: [5, 16]
[ 0.0043, -0.0896, 0.0020, 0.2294, 0.1020, 0.0966, -0.2073, 0.0574, 0.1951, 0.0692, -0.0388, -0.0762, 0.1390, -0.0384, 0.1633, 0.0529] # <BOS>
[ 0.0012, -0.0877, -0.0015, 0.2298, 0.0984, 0.0971, -0.2083, 0.0581, 0.1963, 0.0669, -0.0434, -0.0800, 0.1372, -0.0373, 0.1639, 0.0528] # I
[-0.0003, -0.0905, 0.0001, 0.2295, 0.0975, 0.0969, -0.2105, 0.0582, 0.1989, 0.0687, -0.0433, -0.0817, 0.1337, -0.0350, 0.1647, 0.0542] # like
[ 0.0001, -0.0893, -0.0010, 0.2295, 0.0969, 0.0972, -0.2107, 0.0590, 0.1985, 0.0678, -0.0429, -0.0819, 0.1327, -0.0335, 0.1639, 0.0539] # transformers
[-0.0004, -0.0894, -0.0002, 0.2300, 0.0976, 0.0970, -0.2113, 0.0588, 0.1994, 0.0691, -0.0428, -0.0819, 0.1326, -0.0337, 0.1642, 0.0539] # <EOS>
Before and After¶
Let’s compare what went into the FFN (multi-head attention output) with what came out.
print("Position 1 ('I') - Before and After FFN")
print("=" * 70)
print()
print(f"Before FFN (attention output):")
print(f" {format_vector(multi_head_output[1])}")
print()
print(f"After FFN:")
print(f" {format_vector(ffn_output[1])}")
print()
print("The FFN has transformed the representation through")
print("expansion → non-linearity → projection.")Position 1 ('I') - Before and After FFN
======================================================================
Before FFN (attention output):
[ 0.0269, 0.0066, 0.0113, -0.0154, 0.0114, 0.0032, -0.0065, -0.0108, 0.0190, -0.0091, 0.0180, 0.0097, -0.0075, 0.0061, -0.0079, 0.0110]
After FFN:
[ 0.0012, -0.0877, -0.0015, 0.2298, 0.0984, 0.0971, -0.2083, 0.0581, 0.1963, 0.0669, -0.0434, -0.0800, 0.1372, -0.0373, 0.1639, 0.0528]
The FFN has transformed the representation through
expansion → non-linearity → projection.
What the FFN Accomplishes¶
The FFN serves several purposes:
Non-linearity: Attention is linear; GELU adds the non-linear transformations needed to learn complex functions.
Position-wise processing: Each token gets independent processing time. Attention mixed information between tokens; FFN lets each token digest what it learned.
Feature transformation: The expansion to 64 dimensions gives the model room to create new feature combinations, emphasize important patterns, and suppress noise.
Memory storage: Research suggests FFNs act as key-value memories, storing factual knowledge learned during training.
What’s Next¶
There’s a problem: we just replaced the attention output with the FFN output. All that information from attention is gone!
That’s where residual connections come in. Instead of replacing, we’ll add the FFN output to the original input. This way:
The original information is preserved
The FFN learns to compute changes rather than complete replacements
Gradients can flow directly through the residual path
We’ll also apply layer normalization to keep activations stable. That’s the next notebook.
# Store for next notebook
ffn_data = {
'X': X,
'tokens': tokens,
'multi_head_output': multi_head_output,
'ffn_output': ffn_output,
'W1': W1, 'b1': b1,
'W2': W2, 'b2': b2
}
print("FFN complete. Ready for layer normalization.")FFN complete. Ready for layer normalization.