The Core Idea of Attention¶
Attention lets each token gather information from other tokens.
When processing the word “like” in “I like transformers,” the model might want to know: what’s the subject? What’s the object? What came before? Attention lets the model look at other positions and pull in relevant information.
But how does a token decide which other tokens are “relevant”? That’s where Query, Key, and Value come in.
The Database Analogy¶
Think of attention like a fuzzy database lookup:
Query (Q): “What am I looking for?”
Key (K): “What do I contain?” (the label or tag)
Value (V): “What information should I return?” (the actual content)
In a normal database, you query with an exact key and get back the matching value. In attention, you query with a vector, compare it to all keys, and get back a weighted combination of all values—weighted by how well each key matches your query.
The “fuzzy” part is crucial. There’s no exact match. Every key contributes something; good matches contribute more.
Why Three Separate Projections?¶
You might wonder: why not just use the embeddings directly? Why create separate Q, K, V representations?
Here’s the insight: what you’re looking for might be different from what you contain.
Consider the word “it” in “The cat sat on the mat. It was tired.”
As a query, “it” is looking for its antecedent (what does “it” refer to?)
As a key, “it” is saying “I’m a pronoun that could be referenced”
As a value, “it” contains information about being a subject, being tired, etc.
These are different roles. The same token needs to express different things depending on whether it’s doing the looking (query) or being looked at (key/value).
Separate projections let the model learn these different roles independently.
Multi-Head Attention: Multiple Perspectives¶
We’re using multi-head attention with 2 heads. What does that mean?
Each head is an independent attention mechanism with its own Q, K, V projections. Different heads can learn to focus on different things:
Head 0 might learn syntactic patterns (subject-verb relationships)
Head 1 might learn semantic patterns (related concepts)
It’s like having multiple experts examine the same data from different angles.
Our architecture:
d_model = 16(embedding dimension)num_heads = 2d_k = d_model / num_heads = 8(dimension per head)
Each head projects from 16 dimensions down to 8 dimensions. Later, we’ll concatenate the 2 heads back to 16 dimensions.
import random
# Set seed for reproducibility (same as previous notebook)
random.seed(42)
# Model dimensions
VOCAB_SIZE = 6
D_MODEL = 16
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS # 8 dimensions per head
TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]
print(f"Embedding dimension (d_model): {D_MODEL}")
print(f"Number of attention heads: {NUM_HEADS}")
print(f"Dimension per head (d_k): {D_K}")Embedding dimension (d_model): 16
Number of attention heads: 2
Dimension per head (d_k): 8
# Helper functions
def random_vector(size, scale=0.1):
"""Generate a random vector with values drawn from N(0, scale^2)"""
return [random.gauss(0, scale) for _ in range(size)]
def random_matrix(rows, cols, scale=0.1):
"""Generate a random matrix with values drawn from N(0, scale^2)"""
return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]
def add_vectors(v1, v2):
"""Element-wise addition of two vectors"""
return [a + b for a, b in zip(v1, v2)]
def format_vector(vec, decimals=4):
"""Format a vector as a readable string"""
return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"# Recreate embeddings from previous notebook (same random seed ensures same values)
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]
tokens = [1, 3, 4, 5, 2] # <BOS>, I, like, transformers, <EOS>
seq_len = len(tokens)
# Compute input embeddings X
token_embeddings = [E_token[token_id] for token_id in tokens]
X = [add_vectors(token_embeddings[i], E_pos[i]) for i in range(seq_len)]
print(f"Input matrix X recreated from previous notebook")
print(f"Shape: [{seq_len}, {D_MODEL}]")Input matrix X recreated from previous notebook
Shape: [5, 16]
The Projection Weights¶
For each attention head, we have three weight matrices:
: Query projection, shape
[d_model, d_k]=[16, 8]: Key projection, shape
[d_model, d_k]=[16, 8]: Value projection, shape
[d_model, d_k]=[16, 8]
Each matrix projects from 16 dimensions to 8 dimensions. These weights are learned during training—the model figures out what projections are useful for the prediction task.
With 2 heads, we have 6 weight matrices total (3 per head).
# Initialize weight matrices for each head
W_Q = [] # Query weights
W_K = [] # Key weights
W_V = [] # Value weights
for head in range(NUM_HEADS):
W_Q.append(random_matrix(D_MODEL, D_K)) # [16, 8]
W_K.append(random_matrix(D_MODEL, D_K)) # [16, 8]
W_V.append(random_matrix(D_MODEL, D_K)) # [16, 8]
print(f"Initialized {NUM_HEADS} heads, each with:")
print(f" W_Q: [{D_MODEL}, {D_K}]")
print(f" W_K: [{D_MODEL}, {D_K}]")
print(f" W_V: [{D_MODEL}, {D_K}]")
print(f"\nTotal weight matrices: {NUM_HEADS * 3}")Initialized 2 heads, each with:
W_Q: [16, 8]
W_K: [16, 8]
W_V: [16, 8]
Total weight matrices: 6
Matrix Multiplication: A Quick Review¶
The projection operation is just matrix multiplication. Let’s make sure we understand exactly what that means.
When we multiply matrices and :
has shape
[m, n]has shape
[n, p]Result has shape
[m, p]
The key rule: number of columns in must equal number of rows in .
Each element of the result is a dot product:
In words: take row from , take column from , multiply element-wise, sum.
def matmul(A, B):
"""
Multiply matrices A @ B.
A has shape [m, n], B has shape [n, p], result has shape [m, p].
"""
m = len(A) # number of rows in A
n = len(A[0]) # number of columns in A (= rows in B)
p = len(B[0]) # number of columns in B
# Initialize result matrix with zeros
result = [[0.0] * p for _ in range(m)]
# Compute each element
for i in range(m):
for j in range(p):
# Dot product of row i from A and column j from B
result[i][j] = sum(A[i][k] * B[k][j] for k in range(n))
return result
# Quick example to verify
A = [[1, 2, 3], [4, 5, 6]] # [2, 3]
B = [[1, 4], [2, 5], [3, 6]] # [3, 2]
result = matmul(A, B) # Should be [2, 2]
print("Example: [2, 3] @ [3, 2] = [2, 2]")
print(f"A = {A}")
print(f"B = {B}")
print(f"A @ B = {result}")
print()
print("Verification:")
print(f" result[0][0] = 1*1 + 2*2 + 3*3 = {1*1 + 2*2 + 3*3}")
print(f" result[0][1] = 1*4 + 2*5 + 3*6 = {1*4 + 2*5 + 3*6}")Example: [2, 3] @ [3, 2] = [2, 2]
A = [[1, 2, 3], [4, 5, 6]]
B = [[1, 4], [2, 5], [3, 6]]
A @ B = [[14, 32], [32, 77]]
Verification:
result[0][0] = 1*1 + 2*2 + 3*3 = 14
result[0][1] = 1*4 + 2*5 + 3*6 = 32
Computing Q, K, V¶
Now we can compute the projections. For each head:
Each row of is the query vector for one token. Same for and .
Let’s compute them for both heads.
# Compute Q, K, V for each head
Q_all = [] # Will hold Q matrices for each head
K_all = [] # Will hold K matrices for each head
V_all = [] # Will hold V matrices for each head
for head in range(NUM_HEADS):
Q = matmul(X, W_Q[head]) # [5, 16] @ [16, 8] = [5, 8]
K = matmul(X, W_K[head]) # [5, 16] @ [16, 8] = [5, 8]
V = matmul(X, W_V[head]) # [5, 16] @ [16, 8] = [5, 8]
Q_all.append(Q)
K_all.append(K)
V_all.append(V)
print(f"Computed Q, K, V for {NUM_HEADS} heads")
print(f"Each Q, K, V has shape [{seq_len}, {D_K}]")Computed Q, K, V for 2 heads
Each Q, K, V has shape [5, 8]
Detailed Example: Computing One Query Vector¶
Let’s trace through exactly how we compute the query vector for position 0 (<BOS>) in head 0.
We’re computing:
Where:
is a 16-dimensional vector (the embedding for
<BOS>)is a
[16, 8]matrixis an 8-dimensional vector (the query for
<BOS>)
Each element of is a dot product between and one column of .
print("Computing Q[0] for Head 0 (query for <BOS>)")
print("=" * 70)
print()
print(f"Input: X[0] (embedding for <BOS>), shape [16]")
print(f" {format_vector(X[0])}")
print()
print(f"Weight: W_Q[0], shape [16, 8]")
print(f" (16 rows, 8 columns - too big to print fully)")
print()
print(f"Output: Q[0] = X[0] @ W_Q[0], shape [8]")
print()
# Show detailed calculation for first two output dimensions
for j in range(2):
print(f"Q[0][{j}] = X[0] · W_Q[0][:, {j}] (dot product with column {j})")
# Get column j of W_Q[0]
col_j = [W_Q[0][i][j] for i in range(D_MODEL)]
# Show first few terms
terms = [f"({X[0][i]:.4f} × {col_j[i]:.4f})" for i in range(3)]
print(f" = {' + '.join(terms)} + ...")
# Compute actual value
value = sum(X[0][i] * col_j[i] for i in range(D_MODEL))
print(f" = {value:.4f}")
print()
print(f"Full result: Q[0] = {format_vector(Q_all[0][0])}")Computing Q[0] for Head 0 (query for <BOS>)
======================================================================
Input: X[0] (embedding for <BOS>), shape [16]
[ 0.1473, 0.1281, 0.1995, -0.0465, 0.2125, -0.1338, -0.0829, -0.0638, 0.0722, 0.1183, 0.1193, 0.0937, -0.1594, -0.0402, 0.1124, -0.2064]
Weight: W_Q[0], shape [16, 8]
(16 rows, 8 columns - too big to print fully)
Output: Q[0] = X[0] @ W_Q[0], shape [8]
Q[0][0] = X[0] · W_Q[0][:, 0] (dot product with column 0)
= (0.1473 × 0.0871) + (0.1281 × -0.0745) + (0.1995 × 0.0003) + ...
= -0.0179
Q[0][1] = X[0] · W_Q[0][:, 1] (dot product with column 1)
= (0.1473 × 0.0608) + (0.1281 × 0.0523) + (0.1995 × 0.2266) + ...
= 0.1390
Full result: Q[0] = [-0.0179, 0.1390, -0.1115, 0.0441, -0.0565, -0.0221, 0.1540, -0.0131]
Head 0: All Q, K, V Matrices¶
head = 0
print(f"HEAD {head}: Query Matrix Q")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(Q_all[head]):
print(f" Q[{i}] = {format_vector(row)} # {TOKEN_NAMES[tokens[i]]}")HEAD 0: Query Matrix Q
Shape: [5, 8]
Q[0] = [-0.0179, 0.1390, -0.1115, 0.0441, -0.0565, -0.0221, 0.1540, -0.0131] # <BOS>
Q[1] = [-0.0997, -0.0394, 0.0301, 0.0469, 0.0628, -0.0026, -0.0506, 0.0320] # I
Q[2] = [-0.0154, 0.0507, -0.0404, 0.0923, 0.0319, -0.0150, 0.0833, -0.0375] # like
Q[3] = [ 0.0012, -0.0905, 0.0421, 0.0099, 0.1038, 0.0244, -0.0546, -0.0397] # transformers
Q[4] = [ 0.0812, 0.0104, 0.0022, 0.0003, -0.0376, 0.0182, 0.0318, -0.0184] # <EOS>
print(f"HEAD {head}: Key Matrix K")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(K_all[head]):
print(f" K[{i}] = {format_vector(row)} # {TOKEN_NAMES[tokens[i]]}")HEAD 0: Key Matrix K
Shape: [5, 8]
K[0] = [ 0.0817, 0.0209, 0.0114, 0.0069, 0.0258, 0.0144, -0.0401, 0.0410] # <BOS>
K[1] = [-0.0228, 0.0577, -0.0045, -0.0131, 0.0082, -0.0335, 0.0272, 0.0137] # I
K[2] = [ 0.0675, -0.0504, -0.1121, 0.0738, 0.0479, -0.1313, 0.0103, 0.0228] # like
K[3] = [-0.1202, 0.1335, 0.0520, 0.0626, -0.0597, 0.0077, 0.0658, -0.0298] # transformers
K[4] = [ 0.0189, -0.0549, 0.0358, -0.0400, -0.0008, 0.0210, 0.0411, -0.0375] # <EOS>
print(f"HEAD {head}: Value Matrix V")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(V_all[head]):
print(f" V[{i}] = {format_vector(row)} # {TOKEN_NAMES[tokens[i]]}")HEAD 0: Value Matrix V
Shape: [5, 8]
V[0] = [-0.0090, -0.0398, 0.0085, -0.0527, -0.0375, -0.0001, -0.0328, 0.0792] # <BOS>
V[1] = [ 0.0048, 0.0079, 0.0088, 0.0217, -0.1038, -0.0268, 0.0811, -0.1041] # I
V[2] = [ 0.0390, 0.1344, 0.0726, 0.0888, 0.0703, 0.1238, -0.1341, 0.1226] # like
V[3] = [-0.0103, 0.0407, -0.0746, 0.0207, 0.0585, -0.0899, 0.0405, -0.0838] # transformers
V[4] = [-0.0202, -0.0619, -0.0048, -0.0391, 0.0689, -0.0415, -0.0032, 0.0630] # <EOS>
Head 1: Different Projections, Different Representation¶
Head 1 has its own weight matrices, so it produces completely different Q, K, V representations from the same input. This is the “multi” in multi-head attention—multiple parallel views of the data.
head = 1
print(f"HEAD {head}: Query Matrix Q")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(Q_all[head]):
print(f" Q[{i}] = {format_vector(row)} # {TOKEN_NAMES[tokens[i]]}")HEAD 1: Query Matrix Q
Shape: [5, 8]
Q[0] = [-0.0801, 0.0205, -0.0577, 0.0358, 0.0203, -0.0472, 0.1419, 0.0332] # <BOS>
Q[1] = [ 0.0791, 0.0428, -0.0408, 0.0261, -0.0520, -0.0152, -0.0639, -0.0355] # I
Q[2] = [-0.0232, 0.0231, -0.0204, 0.0449, 0.0019, 0.0651, 0.0958, -0.0080] # like
Q[3] = [ 0.0913, 0.0219, 0.0457, -0.0627, 0.0176, -0.1209, -0.1008, -0.0297] # transformers
Q[4] = [ 0.0314, -0.0331, -0.0224, -0.0109, -0.0103, -0.0073, 0.0198, -0.0383] # <EOS>
print(f"HEAD {head}: Key Matrix K")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(K_all[head]):
print(f" K[{i}] = {format_vector(row)} # {TOKEN_NAMES[tokens[i]]}")HEAD 1: Key Matrix K
Shape: [5, 8]
K[0] = [ 0.0800, 0.0257, -0.0117, -0.1056, 0.0339, -0.0891, -0.0083, -0.0737] # <BOS>
K[1] = [ 0.0565, 0.0479, -0.0409, -0.0089, -0.0037, 0.0547, -0.0085, -0.0782] # I
K[2] = [-0.0624, 0.1632, 0.0750, -0.0765, 0.0238, 0.0042, -0.0385, 0.1000] # like
K[3] = [ 0.0276, -0.0325, -0.0956, 0.0622, -0.0129, -0.0202, -0.0572, 0.0587] # transformers
K[4] = [ 0.0606, -0.0210, -0.0280, -0.0021, 0.0533, 0.0302, -0.0483, 0.0772] # <EOS>
print(f"HEAD {head}: Value Matrix V")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(V_all[head]):
print(f" V[{i}] = {format_vector(row)} # {TOKEN_NAMES[tokens[i]]}")HEAD 1: Value Matrix V
Shape: [5, 8]
V[0] = [ 0.0107, -0.0291, -0.0100, -0.0312, 0.0214, 0.0372, 0.0105, 0.0279] # <BOS>
V[1] = [-0.0506, -0.0011, 0.0151, 0.0528, -0.0033, -0.0783, -0.0746, -0.0666] # I
V[2] = [-0.0562, -0.0003, 0.0484, -0.0677, 0.1120, 0.0491, 0.0651, -0.0207] # like
V[3] = [ 0.0521, -0.0033, -0.0165, 0.0878, 0.0455, 0.0866, 0.0211, -0.0656] # transformers
V[4] = [-0.0155, 0.0273, -0.0714, -0.0334, 0.0643, 0.0217, 0.0260, 0.0643] # <EOS>
What We’ve Computed¶
Starting from input [5, 16], we now have for each head:
| Matrix | Shape | Meaning |
|---|---|---|
| Q | [5, 8] | What each token is looking for |
| K | [5, 8] | What each token offers as a match |
| V | [5, 8] | What information each token carries |
These are the building blocks for attention. In the next notebook, we’ll use Q and K to compute attention scores (how much should each token attend to each other token?), then use those scores to take weighted combinations of V.
What’s Next¶
We have Q, K, V. Now comes the actual attention computation:
Attention scores: (how well does each query match each key?)
Scaling: Divide by (we’ll explain why)
Masking: Prevent tokens from attending to future positions
Softmax: Convert scores to probabilities
Weighted sum:
This is where tokens actually start “talking” to each other.
# Store for next notebook
qkv_data = {
'X': X,
'tokens': tokens,
'W_Q': W_Q,
'W_K': W_K,
'W_V': W_V,
'Q': Q_all,
'K': K_all,
'V': V_all,
'D_MODEL': D_MODEL,
'D_K': D_K,
'NUM_HEADS': NUM_HEADS
}
print("QKV projections complete. Ready for attention computation.")QKV projections complete. Ready for attention computation.