Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

QKV Projections

The Core Idea of Attention

Attention lets each token gather information from other tokens.

When processing the word “like” in “I like transformers,” the model might want to know: what’s the subject? What’s the object? What came before? Attention lets the model look at other positions and pull in relevant information.

But how does a token decide which other tokens are “relevant”? That’s where Query, Key, and Value come in.

The Database Analogy

Think of attention like a fuzzy database lookup:

  • Query (Q): “What am I looking for?”

  • Key (K): “What do I contain?” (the label or tag)

  • Value (V): “What information should I return?” (the actual content)

In a normal database, you query with an exact key and get back the matching value. In attention, you query with a vector, compare it to all keys, and get back a weighted combination of all values—weighted by how well each key matches your query.

The “fuzzy” part is crucial. There’s no exact match. Every key contributes something; good matches contribute more.

Why Three Separate Projections?

You might wonder: why not just use the embeddings directly? Why create separate Q, K, V representations?

Here’s the insight: what you’re looking for might be different from what you contain.

Consider the word “it” in “The cat sat on the mat. It was tired.”

  • As a query, “it” is looking for its antecedent (what does “it” refer to?)

  • As a key, “it” is saying “I’m a pronoun that could be referenced”

  • As a value, “it” contains information about being a subject, being tired, etc.

These are different roles. The same token needs to express different things depending on whether it’s doing the looking (query) or being looked at (key/value).

Separate projections let the model learn these different roles independently.

Multi-Head Attention: Multiple Perspectives

We’re using multi-head attention with 2 heads. What does that mean?

Each head is an independent attention mechanism with its own Q, K, V projections. Different heads can learn to focus on different things:

  • Head 0 might learn syntactic patterns (subject-verb relationships)

  • Head 1 might learn semantic patterns (related concepts)

It’s like having multiple experts examine the same data from different angles.

Our architecture:

  • d_model = 16 (embedding dimension)

  • num_heads = 2

  • d_k = d_model / num_heads = 8 (dimension per head)

Each head projects from 16 dimensions down to 8 dimensions. Later, we’ll concatenate the 2 heads back to 16 dimensions.

import random

# Set seed for reproducibility (same as previous notebook)
random.seed(42)

# Model dimensions
VOCAB_SIZE = 6
D_MODEL = 16
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS  # 8 dimensions per head

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

print(f"Embedding dimension (d_model): {D_MODEL}")
print(f"Number of attention heads: {NUM_HEADS}")
print(f"Dimension per head (d_k): {D_K}")
Embedding dimension (d_model): 16
Number of attention heads: 2
Dimension per head (d_k): 8
# Helper functions
def random_vector(size, scale=0.1):
    """Generate a random vector with values drawn from N(0, scale^2)"""
    return [random.gauss(0, scale) for _ in range(size)]

def random_matrix(rows, cols, scale=0.1):
    """Generate a random matrix with values drawn from N(0, scale^2)"""
    return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]

def add_vectors(v1, v2):
    """Element-wise addition of two vectors"""
    return [a + b for a, b in zip(v1, v2)]

def format_vector(vec, decimals=4):
    """Format a vector as a readable string"""
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"
# Recreate embeddings from previous notebook (same random seed ensures same values)
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]

tokens = [1, 3, 4, 5, 2]  # <BOS>, I, like, transformers, <EOS>
seq_len = len(tokens)

# Compute input embeddings X
token_embeddings = [E_token[token_id] for token_id in tokens]
X = [add_vectors(token_embeddings[i], E_pos[i]) for i in range(seq_len)]

print(f"Input matrix X recreated from previous notebook")
print(f"Shape: [{seq_len}, {D_MODEL}]")
Input matrix X recreated from previous notebook
Shape: [5, 16]

The Projection Weights

For each attention head, we have three weight matrices:

  • WQW_Q: Query projection, shape [d_model, d_k] = [16, 8]

  • WKW_K: Key projection, shape [d_model, d_k] = [16, 8]

  • WVW_V: Value projection, shape [d_model, d_k] = [16, 8]

Each matrix projects from 16 dimensions to 8 dimensions. These weights are learned during training—the model figures out what projections are useful for the prediction task.

With 2 heads, we have 6 weight matrices total (3 per head).

# Initialize weight matrices for each head
W_Q = []  # Query weights
W_K = []  # Key weights
W_V = []  # Value weights

for head in range(NUM_HEADS):
    W_Q.append(random_matrix(D_MODEL, D_K))  # [16, 8]
    W_K.append(random_matrix(D_MODEL, D_K))  # [16, 8]
    W_V.append(random_matrix(D_MODEL, D_K))  # [16, 8]

print(f"Initialized {NUM_HEADS} heads, each with:")
print(f"  W_Q: [{D_MODEL}, {D_K}]")
print(f"  W_K: [{D_MODEL}, {D_K}]")
print(f"  W_V: [{D_MODEL}, {D_K}]")
print(f"\nTotal weight matrices: {NUM_HEADS * 3}")
Initialized 2 heads, each with:
  W_Q: [16, 8]
  W_K: [16, 8]
  W_V: [16, 8]

Total weight matrices: 6

Matrix Multiplication: A Quick Review

The projection operation is just matrix multiplication. Let’s make sure we understand exactly what that means.

When we multiply matrices AA and BB:

  • AA has shape [m, n]

  • BB has shape [n, p]

  • Result has shape [m, p]

The key rule: number of columns in AA must equal number of rows in BB.

Each element of the result is a dot product:

resultij=k=0n1AikBkj\text{result}_{ij} = \sum_{k=0}^{n-1} A_{ik} \cdot B_{kj}

In words: take row ii from AA, take column jj from BB, multiply element-wise, sum.

def matmul(A, B):
    """
    Multiply matrices A @ B.
    A has shape [m, n], B has shape [n, p], result has shape [m, p].
    """
    m = len(A)       # number of rows in A
    n = len(A[0])    # number of columns in A (= rows in B)
    p = len(B[0])    # number of columns in B
    
    # Initialize result matrix with zeros
    result = [[0.0] * p for _ in range(m)]
    
    # Compute each element
    for i in range(m):
        for j in range(p):
            # Dot product of row i from A and column j from B
            result[i][j] = sum(A[i][k] * B[k][j] for k in range(n))
    
    return result

# Quick example to verify
A = [[1, 2, 3], [4, 5, 6]]  # [2, 3]
B = [[1, 4], [2, 5], [3, 6]]  # [3, 2]
result = matmul(A, B)  # Should be [2, 2]

print("Example: [2, 3] @ [3, 2] = [2, 2]")
print(f"A = {A}")
print(f"B = {B}")
print(f"A @ B = {result}")
print()
print("Verification:")
print(f"  result[0][0] = 1*1 + 2*2 + 3*3 = {1*1 + 2*2 + 3*3}")
print(f"  result[0][1] = 1*4 + 2*5 + 3*6 = {1*4 + 2*5 + 3*6}")
Example: [2, 3] @ [3, 2] = [2, 2]
A = [[1, 2, 3], [4, 5, 6]]
B = [[1, 4], [2, 5], [3, 6]]
A @ B = [[14, 32], [32, 77]]

Verification:
  result[0][0] = 1*1 + 2*2 + 3*3 = 14
  result[0][1] = 1*4 + 2*5 + 3*6 = 32

Computing Q, K, V

Now we can compute the projections. For each head:

Q=XWQshape: [5,16]×[16,8]=[5,8]Q = X \cdot W_Q \quad \text{shape: } [5, 16] \times [16, 8] = [5, 8]
K=XWKshape: [5,16]×[16,8]=[5,8]K = X \cdot W_K \quad \text{shape: } [5, 16] \times [16, 8] = [5, 8]
V=XWVshape: [5,16]×[16,8]=[5,8]V = X \cdot W_V \quad \text{shape: } [5, 16] \times [16, 8] = [5, 8]

Each row of QQ is the query vector for one token. Same for KK and VV.

Let’s compute them for both heads.

# Compute Q, K, V for each head
Q_all = []  # Will hold Q matrices for each head
K_all = []  # Will hold K matrices for each head
V_all = []  # Will hold V matrices for each head

for head in range(NUM_HEADS):
    Q = matmul(X, W_Q[head])  # [5, 16] @ [16, 8] = [5, 8]
    K = matmul(X, W_K[head])  # [5, 16] @ [16, 8] = [5, 8]
    V = matmul(X, W_V[head])  # [5, 16] @ [16, 8] = [5, 8]
    
    Q_all.append(Q)
    K_all.append(K)
    V_all.append(V)

print(f"Computed Q, K, V for {NUM_HEADS} heads")
print(f"Each Q, K, V has shape [{seq_len}, {D_K}]")
Computed Q, K, V for 2 heads
Each Q, K, V has shape [5, 8]

Detailed Example: Computing One Query Vector

Let’s trace through exactly how we compute the query vector for position 0 (<BOS>) in head 0.

We’re computing:

Q[0]=X[0]WQ[0]Q[0] = X[0] \cdot W_Q[0]

Where:

  • X[0]X[0] is a 16-dimensional vector (the embedding for <BOS>)

  • WQ[0]W_Q[0] is a [16, 8] matrix

  • Q[0]Q[0] is an 8-dimensional vector (the query for <BOS>)

Each element of Q[0]Q[0] is a dot product between X[0]X[0] and one column of WQ[0]W_Q[0].

print("Computing Q[0] for Head 0 (query for <BOS>)")
print("=" * 70)
print()
print(f"Input: X[0] (embedding for <BOS>), shape [16]")
print(f"  {format_vector(X[0])}")
print()
print(f"Weight: W_Q[0], shape [16, 8]")
print(f"  (16 rows, 8 columns - too big to print fully)")
print()
print(f"Output: Q[0] = X[0] @ W_Q[0], shape [8]")
print()

# Show detailed calculation for first two output dimensions
for j in range(2):
    print(f"Q[0][{j}] = X[0] · W_Q[0][:, {j}]  (dot product with column {j})")
    
    # Get column j of W_Q[0]
    col_j = [W_Q[0][i][j] for i in range(D_MODEL)]
    
    # Show first few terms
    terms = [f"({X[0][i]:.4f} × {col_j[i]:.4f})" for i in range(3)]
    print(f"       = {' + '.join(terms)} + ...")
    
    # Compute actual value
    value = sum(X[0][i] * col_j[i] for i in range(D_MODEL))
    print(f"       = {value:.4f}")
    print()

print(f"Full result: Q[0] = {format_vector(Q_all[0][0])}")
Computing Q[0] for Head 0 (query for <BOS>)
======================================================================

Input: X[0] (embedding for <BOS>), shape [16]
  [ 0.1473,  0.1281,  0.1995, -0.0465,  0.2125, -0.1338, -0.0829, -0.0638,  0.0722,  0.1183,  0.1193,  0.0937, -0.1594, -0.0402,  0.1124, -0.2064]

Weight: W_Q[0], shape [16, 8]
  (16 rows, 8 columns - too big to print fully)

Output: Q[0] = X[0] @ W_Q[0], shape [8]

Q[0][0] = X[0] · W_Q[0][:, 0]  (dot product with column 0)
       = (0.1473 × 0.0871) + (0.1281 × -0.0745) + (0.1995 × 0.0003) + ...
       = -0.0179

Q[0][1] = X[0] · W_Q[0][:, 1]  (dot product with column 1)
       = (0.1473 × 0.0608) + (0.1281 × 0.0523) + (0.1995 × 0.2266) + ...
       = 0.1390

Full result: Q[0] = [-0.0179,  0.1390, -0.1115,  0.0441, -0.0565, -0.0221,  0.1540, -0.0131]

Head 0: All Q, K, V Matrices

head = 0
print(f"HEAD {head}: Query Matrix Q")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(Q_all[head]):
    print(f"  Q[{i}] = {format_vector(row)}  # {TOKEN_NAMES[tokens[i]]}")
HEAD 0: Query Matrix Q
Shape: [5, 8]

  Q[0] = [-0.0179,  0.1390, -0.1115,  0.0441, -0.0565, -0.0221,  0.1540, -0.0131]  # <BOS>
  Q[1] = [-0.0997, -0.0394,  0.0301,  0.0469,  0.0628, -0.0026, -0.0506,  0.0320]  # I
  Q[2] = [-0.0154,  0.0507, -0.0404,  0.0923,  0.0319, -0.0150,  0.0833, -0.0375]  # like
  Q[3] = [ 0.0012, -0.0905,  0.0421,  0.0099,  0.1038,  0.0244, -0.0546, -0.0397]  # transformers
  Q[4] = [ 0.0812,  0.0104,  0.0022,  0.0003, -0.0376,  0.0182,  0.0318, -0.0184]  # <EOS>
print(f"HEAD {head}: Key Matrix K")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(K_all[head]):
    print(f"  K[{i}] = {format_vector(row)}  # {TOKEN_NAMES[tokens[i]]}")
HEAD 0: Key Matrix K
Shape: [5, 8]

  K[0] = [ 0.0817,  0.0209,  0.0114,  0.0069,  0.0258,  0.0144, -0.0401,  0.0410]  # <BOS>
  K[1] = [-0.0228,  0.0577, -0.0045, -0.0131,  0.0082, -0.0335,  0.0272,  0.0137]  # I
  K[2] = [ 0.0675, -0.0504, -0.1121,  0.0738,  0.0479, -0.1313,  0.0103,  0.0228]  # like
  K[3] = [-0.1202,  0.1335,  0.0520,  0.0626, -0.0597,  0.0077,  0.0658, -0.0298]  # transformers
  K[4] = [ 0.0189, -0.0549,  0.0358, -0.0400, -0.0008,  0.0210,  0.0411, -0.0375]  # <EOS>
print(f"HEAD {head}: Value Matrix V")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(V_all[head]):
    print(f"  V[{i}] = {format_vector(row)}  # {TOKEN_NAMES[tokens[i]]}")
HEAD 0: Value Matrix V
Shape: [5, 8]

  V[0] = [-0.0090, -0.0398,  0.0085, -0.0527, -0.0375, -0.0001, -0.0328,  0.0792]  # <BOS>
  V[1] = [ 0.0048,  0.0079,  0.0088,  0.0217, -0.1038, -0.0268,  0.0811, -0.1041]  # I
  V[2] = [ 0.0390,  0.1344,  0.0726,  0.0888,  0.0703,  0.1238, -0.1341,  0.1226]  # like
  V[3] = [-0.0103,  0.0407, -0.0746,  0.0207,  0.0585, -0.0899,  0.0405, -0.0838]  # transformers
  V[4] = [-0.0202, -0.0619, -0.0048, -0.0391,  0.0689, -0.0415, -0.0032,  0.0630]  # <EOS>

Head 1: Different Projections, Different Representation

Head 1 has its own weight matrices, so it produces completely different Q, K, V representations from the same input. This is the “multi” in multi-head attention—multiple parallel views of the data.

head = 1
print(f"HEAD {head}: Query Matrix Q")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(Q_all[head]):
    print(f"  Q[{i}] = {format_vector(row)}  # {TOKEN_NAMES[tokens[i]]}")
HEAD 1: Query Matrix Q
Shape: [5, 8]

  Q[0] = [-0.0801,  0.0205, -0.0577,  0.0358,  0.0203, -0.0472,  0.1419,  0.0332]  # <BOS>
  Q[1] = [ 0.0791,  0.0428, -0.0408,  0.0261, -0.0520, -0.0152, -0.0639, -0.0355]  # I
  Q[2] = [-0.0232,  0.0231, -0.0204,  0.0449,  0.0019,  0.0651,  0.0958, -0.0080]  # like
  Q[3] = [ 0.0913,  0.0219,  0.0457, -0.0627,  0.0176, -0.1209, -0.1008, -0.0297]  # transformers
  Q[4] = [ 0.0314, -0.0331, -0.0224, -0.0109, -0.0103, -0.0073,  0.0198, -0.0383]  # <EOS>
print(f"HEAD {head}: Key Matrix K")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(K_all[head]):
    print(f"  K[{i}] = {format_vector(row)}  # {TOKEN_NAMES[tokens[i]]}")
HEAD 1: Key Matrix K
Shape: [5, 8]

  K[0] = [ 0.0800,  0.0257, -0.0117, -0.1056,  0.0339, -0.0891, -0.0083, -0.0737]  # <BOS>
  K[1] = [ 0.0565,  0.0479, -0.0409, -0.0089, -0.0037,  0.0547, -0.0085, -0.0782]  # I
  K[2] = [-0.0624,  0.1632,  0.0750, -0.0765,  0.0238,  0.0042, -0.0385,  0.1000]  # like
  K[3] = [ 0.0276, -0.0325, -0.0956,  0.0622, -0.0129, -0.0202, -0.0572,  0.0587]  # transformers
  K[4] = [ 0.0606, -0.0210, -0.0280, -0.0021,  0.0533,  0.0302, -0.0483,  0.0772]  # <EOS>
print(f"HEAD {head}: Value Matrix V")
print(f"Shape: [{seq_len}, {D_K}]")
print()
for i, row in enumerate(V_all[head]):
    print(f"  V[{i}] = {format_vector(row)}  # {TOKEN_NAMES[tokens[i]]}")
HEAD 1: Value Matrix V
Shape: [5, 8]

  V[0] = [ 0.0107, -0.0291, -0.0100, -0.0312,  0.0214,  0.0372,  0.0105,  0.0279]  # <BOS>
  V[1] = [-0.0506, -0.0011,  0.0151,  0.0528, -0.0033, -0.0783, -0.0746, -0.0666]  # I
  V[2] = [-0.0562, -0.0003,  0.0484, -0.0677,  0.1120,  0.0491,  0.0651, -0.0207]  # like
  V[3] = [ 0.0521, -0.0033, -0.0165,  0.0878,  0.0455,  0.0866,  0.0211, -0.0656]  # transformers
  V[4] = [-0.0155,  0.0273, -0.0714, -0.0334,  0.0643,  0.0217,  0.0260,  0.0643]  # <EOS>

What We’ve Computed

Starting from input XX [5, 16], we now have for each head:

MatrixShapeMeaning
Q[5, 8]What each token is looking for
K[5, 8]What each token offers as a match
V[5, 8]What information each token carries

These are the building blocks for attention. In the next notebook, we’ll use Q and K to compute attention scores (how much should each token attend to each other token?), then use those scores to take weighted combinations of V.

What’s Next

We have Q, K, V. Now comes the actual attention computation:

  1. Attention scores: scores=QKT\text{scores} = Q \cdot K^T (how well does each query match each key?)

  2. Scaling: Divide by dk\sqrt{d_k} (we’ll explain why)

  3. Masking: Prevent tokens from attending to future positions

  4. Softmax: Convert scores to probabilities

  5. Weighted sum: output=weightsV\text{output} = \text{weights} \cdot V

This is where tokens actually start “talking” to each other.

# Store for next notebook
qkv_data = {
    'X': X,
    'tokens': tokens,
    'W_Q': W_Q,
    'W_K': W_K,
    'W_V': W_V,
    'Q': Q_all,
    'K': K_all,
    'V': V_all,
    'D_MODEL': D_MODEL,
    'D_K': D_K,
    'NUM_HEADS': NUM_HEADS
}
print("QKV projections complete. Ready for attention computation.")
QKV projections complete. Ready for attention computation.