Attention - An Introduction to Transformers

This Is the Important Part¶

If there’s one idea that makes transformers work, this is it.

Attention is what lets the model understand context. It’s how the word “bank” can mean different things in “river bank” vs “bank account.” It’s how pronouns find their antecedents. It’s how relationships between distant words get captured.

We have our Q, K, V projections from the previous notebook. Now we use them to actually compute attention: how much should each token pay attention to each other token?

The Attention Formula¶

The full attention computation is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

(1)

That’s dense. Let’s break it into five steps:

Compute raw scores: $\text{scores} = Q \cdot K^T$
Scale: $\text{scaled} = \frac{\text{scores}}{\sqrt{d_k}}$
Apply causal mask: Set future positions to $-\infty$
Softmax: Convert to probabilities
Weighted sum: $\text{output} = \text{weights} \cdot V$

Each step has a purpose. Let’s go through them.

import random
import math

# Set seed for reproducibility
random.seed(42)

# Model dimensions
VOCAB_SIZE = 6
D_MODEL = 16
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS  # 8

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

# Helper functions
def random_vector(size, scale=0.1):
    return [random.gauss(0, scale) for _ in range(size)]

def random_matrix(rows, cols, scale=0.1):
    return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]

def add_vectors(v1, v2):
    return [a + b for a, b in zip(v1, v2)]

def matmul(A, B):
    m, n, p = len(A), len(A[0]), len(B[0])
    return [[sum(A[i][k] * B[k][j] for k in range(n)) for j in range(p)] for i in range(m)]

def transpose(A):
    """Transpose a matrix: swap rows and columns"""
    rows, cols = len(A), len(A[0])
    return [[A[i][j] for i in range(rows)] for j in range(cols)]

def dot_product(v1, v2):
    """Compute dot product of two vectors"""
    return sum(a * b for a, b in zip(v1, v2))

def format_vector(vec, decimals=4):
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"

# Recreate Q, K, V from previous notebooks
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]
tokens = [1, 3, 4, 5, 2]  # <BOS>, I, like, transformers, <EOS>
seq_len = len(tokens)
X = [add_vectors(E_token[tokens[i]], E_pos[i]) for i in range(seq_len)]

# QKV weights and projections
W_Q = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_K = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_V = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]

Q_all = [matmul(X, W_Q[h]) for h in range(NUM_HEADS)]
K_all = [matmul(X, W_K[h]) for h in range(NUM_HEADS)]
V_all = [matmul(X, W_V[h]) for h in range(NUM_HEADS)]

print(f"Recreated Q, K, V from previous notebooks")
print(f"Shape of each: [{seq_len}, {D_K}]")

Recreated Q, K, V from previous notebooks
Shape of each: [5, 8]

Step 1: Compute Attention Scores¶

The first step is to compute how well each query matches each key. We do this with a matrix multiplication:

\text{scores} = Q \cdot K^T

(2)

Shapes:

$Q$ : [5, 8] — 5 queries, each 8-dimensional
$K^T$ : [8, 5] — transpose of K (5 keys, each 8-dimensional)
$\text{scores}$ : [5, 5] — score for each (query, key) pair

The element $\text{scores}_{ij}$ is the dot product of query $i$ with key $j$ . It measures: “how well does token $i$ ’s query match token $j$ ’s key?”

Higher score = better match = more attention.

# Work with Head 0 for this walkthrough
head = 0
Q = Q_all[head]
K = K_all[head]
V = V_all[head]

# Compute scores = Q @ K^T
K_T = transpose(K)
scores = matmul(Q, K_T)

print(f"HEAD {head}: Attention Scores (Q @ K^T)")
print(f"Shape: [{seq_len}, {D_K}] @ [{D_K}, {seq_len}] = [{seq_len}, {seq_len}]")
print()
print("Each row is one query; each column is one key.")
print("scores[i][j] = how much should token i attend to token j?")
print()

# Print with token labels
max_label = max(len(TOKEN_NAMES[t]) for t in tokens)
header = " " * (max_label + 1) + "  ".join([f"{TOKEN_NAMES[tokens[j]]:>{max_label}}" for j in range(seq_len)])
print(header)
for i, row in enumerate(scores):
    values = "  ".join([f"{v:{max_label}.4f}" for v in row])
    print(f"{TOKEN_NAMES[tokens[i]]:>{max_label}} [{values}]")

HEAD 0: Attention Scores (Q @ K^T)
Shape: [5, 8] @ [8, 5] = [5, 5]

Each row is one query; each column is one key.
scores[i][j] = how much should token i attend to token j?

                    <BOS>             I          like  transformers         <EOS>
       <BOS> [     -0.0126        0.0213       -0.0152        0.0211       -0.0137]
           I [      0.0021       -0.0134        0.0119       -0.0027        0.0091]
        like [     -0.0140        0.0097       -0.0039        0.0169       -0.0061]
transformers [     -0.0018       -0.0119        0.0046       -0.0016        0.0088]
       <EOS> [     -0.0022        0.0084       -0.0022       -0.0016       -0.0069]

What Does a Score Mean?¶

Let’s look at one specific score: how much should “I” (position 1) attend to “<BOS>” (position 0)?

This is the dot product of “I”'s query with “<BOS>”'s key.

print("Computing scores[1][0]: how much should 'I' attend to '<BOS>'?")
print("=" * 60)
print()
print(f"Query for 'I' (Q[1]):")
print(f"  {format_vector(Q[1])}")
print()
print(f"Key for '<BOS>' (K[0]):")
print(f"  {format_vector(K[0])}")
print()
print("Score = dot product of these vectors")

# Show the dot product calculation
score = dot_product(Q[1], K[0])
terms = [f"({Q[1][i]:.4f} × {K[0][i]:.4f})" for i in range(3)]
print(f"      = {' + '.join(terms)} + ...")
print(f"      = {score:.4f}")

Computing scores[1][0]: how much should 'I' attend to '<BOS>'?
============================================================

Query for 'I' (Q[1]):
  [-0.0997, -0.0394,  0.0301,  0.0469,  0.0628, -0.0026, -0.0506,  0.0320]

Key for '<BOS>' (K[0]):
  [-0.0090, -0.0398,  0.0085, -0.0527, -0.0375, -0.0001, -0.0328,  0.0792]

Score = dot product of these vectors
      = (-0.0997 × -0.0090) + (-0.0394 × -0.0398) + (0.0301 × 0.0085) + ...
      = 0.0021

Step 2: Scale the Scores¶

We divide all scores by $\sqrt{d_k} = \sqrt{8} \approx 2.83$ :

\text{scaled\_scores} = \frac{\text{scores}}{\sqrt{d_k}}

(3)

Why scale?

Dot products of high-dimensional vectors can be large. If $d_k = 8$ and each element is around 0.1, the dot product could be around $8 \times 0.1 \times 0.1 = 0.08$ . That’s fine.

But as $d_k$ grows (say, to 64 or 128), dot products grow proportionally. Large inputs to softmax push it toward extreme values—one position gets weight ~1.0, everything else gets ~0.0. The gradients become tiny, training stalls.

Dividing by $\sqrt{d_k}$ keeps the variance of the scores roughly constant regardless of $d_k$ . The softmax stays in a “healthy” range where gradients flow well.

(This is one of those details that seems minor but matters a lot in practice.)

scale = math.sqrt(D_K)
print(f"Scaling factor: sqrt({D_K}) = {scale:.4f}")
print()

scaled_scores = [[s / scale for s in row] for row in scores]

print(f"Scaled Scores (scores / {scale:.4f})")
print()
max_label = max(len(TOKEN_NAMES[t]) for t in tokens)
header = " " * (max_label + 1) + "  ".join([f"{TOKEN_NAMES[tokens[j]]:>{max_label}}" for j in range(seq_len)])
print(header)
for i, row in enumerate(scaled_scores):
    values = "  ".join([f"{v:{max_label}.4f}" for v in row])
    print(f"{TOKEN_NAMES[tokens[i]]:>{max_label}} [{values}]")

Scaling factor: sqrt(8) = 2.8284

Scaled Scores (scores / 2.8284)

                    <BOS>             I          like  transformers         <EOS>
       <BOS> [     -0.0045        0.0075       -0.0054        0.0075       -0.0048]
           I [      0.0007       -0.0047        0.0042       -0.0009        0.0032]
        like [     -0.0049        0.0034       -0.0014        0.0060       -0.0021]
transformers [     -0.0006       -0.0042        0.0016       -0.0006        0.0031]
       <EOS> [     -0.0008        0.0030       -0.0008       -0.0006       -0.0024]

Step 3: Apply the Causal Mask¶

This is a decoder-only model (like GPT). It generates text left-to-right, one token at a time. When predicting the next token, it can only see previous tokens—not future ones.

We enforce this with a causal mask: set scores for future positions to $-\infty$ .

\text{masked}_{ij} = \begin{cases} \text{scaled}_{ij} & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

(4)

Why $-\infty$ ? Because $e^{-\infty} = 0$ . When we apply softmax, positions with $-\infty$ will get weight 0. They contribute nothing.

The mask pattern:

Position:      0    1    2    3    4
0 (<BOS>)    [ ok  -∞   -∞   -∞   -∞ ]  can only see itself
1 (I)        [ ok   ok  -∞   -∞   -∞ ]  can see 0, 1
2 (like)     [ ok   ok   ok  -∞   -∞ ]  can see 0, 1, 2
3 (trans.)   [ ok   ok   ok   ok  -∞ ]  can see 0, 1, 2, 3
4 (<EOS>)    [ ok   ok   ok   ok   ok ]  can see all

# Apply causal mask
masked_scores = []
for i in range(seq_len):
    row = []
    for j in range(seq_len):
        if j <= i:  # Can attend to this position
            row.append(scaled_scores[i][j])
        else:  # Future position - mask it out
            row.append(float('-inf'))
    masked_scores.append(row)

print("Masked Scores (future positions set to -inf)")
print()
max_label = max(len(TOKEN_NAMES[t]) for t in tokens)
header = " " * (max_label + 1) + "  ".join([f"{TOKEN_NAMES[tokens[j]]:>{max_label}}" for j in range(seq_len)])
print(header)
for i, row in enumerate(masked_scores):
    values = []
    for v in row:
        if v == float('-inf'):
            values.append(f"{'-inf':>{max_label}}")
        else:
            values.append(f"{v:{max_label}.4f}")
    print(f"{TOKEN_NAMES[tokens[i]]:>{max_label}} [{', '.join(values)}]")

Masked Scores (future positions set to -inf)

                    <BOS>             I          like  transformers         <EOS>
       <BOS> [     -0.0045,         -inf,         -inf,         -inf,         -inf]
           I [      0.0007,      -0.0047,         -inf,         -inf,         -inf]
        like [     -0.0049,       0.0034,      -0.0014,         -inf,         -inf]
transformers [     -0.0006,      -0.0042,       0.0016,      -0.0006,         -inf]
       <EOS> [     -0.0008,       0.0030,      -0.0008,      -0.0006,      -0.0024]

Step 4: Softmax¶

Now we convert scores to probabilities using softmax:

\text{weight}_i = \frac{e^{\text{score}_i}}{\sum_j e^{\text{score}_j}}

(5)

Softmax does three things:

Exponentiation — makes all values positive
Normalization — makes them sum to 1
Amplification — larger scores get disproportionately larger weights

The result is a probability distribution over positions. Higher scores → higher weights → more attention.

def softmax(vec):
    """
    Compute softmax of a vector, handling -inf values.
    
    We subtract the max for numerical stability (doesn't change the result,
    but prevents overflow when exponentiating large numbers).
    """
    # Find max of non-inf values
    finite_vals = [v for v in vec if v != float('-inf')]
    max_val = max(finite_vals) if finite_vals else 0
    
    # Exponentiate (shifted by max)
    exp_vec = []
    for v in vec:
        if v == float('-inf'):
            exp_vec.append(0.0)  # e^(-inf) = 0
        else:
            exp_vec.append(math.exp(v - max_val))
    
    # Normalize
    total = sum(exp_vec)
    return [e / total for e in exp_vec]

# Let's trace through softmax for position 1 ("I")
print("Example: Softmax for position 1 ('I')")
print("=" * 60)
print()
print(f"Masked scores: {masked_scores[1][:2]} (only positions 0,1 are visible)")
print()

s0, s1 = masked_scores[1][0], masked_scores[1][1]
print(f"Step 1: Exponentiate (subtracting max={max(s0,s1):.4f} for stability)")
exp_0 = math.exp(s0 - max(s0, s1))
exp_1 = math.exp(s1 - max(s0, s1))
print(f"  exp({s0:.4f} - {max(s0,s1):.4f}) = exp({s0 - max(s0,s1):.4f}) = {exp_0:.4f}")
print(f"  exp({s1:.4f} - {max(s0,s1):.4f}) = exp({s1 - max(s0,s1):.4f}) = {exp_1:.4f}")
print()

total = exp_0 + exp_1
print(f"Step 2: Sum = {exp_0:.4f} + {exp_1:.4f} = {total:.4f}")
print()

print(f"Step 3: Normalize")
print(f"  weight[0] = {exp_0:.4f} / {total:.4f} = {exp_0/total:.4f}")
print(f"  weight[1] = {exp_1:.4f} / {total:.4f} = {exp_1/total:.4f}")
print()
print(f"Sum of weights: {exp_0/total + exp_1/total:.4f} (should be 1.0)")

Example: Softmax for position 1 ('I')
============================================================

Masked scores: [0.000737600323286676, -0.004733186959169455] (only positions 0,1 are visible)

Step 1: Exponentiate (subtracting max=0.0007 for stability)
  exp(0.0007 - 0.0007) = exp(0.0000) = 1.0000
  exp(-0.0047 - 0.0007) = exp(-0.0055) = 0.9945

Step 2: Sum = 1.0000 + 0.9945 = 1.9945

Step 3: Normalize
  weight[0] = 1.0000 / 1.9945 = 0.5014
  weight[1] = 0.9945 / 1.9945 = 0.4986

Sum of weights: 1.0000 (should be 1.0)

# Apply softmax to all rows
attention_weights = [softmax(row) for row in masked_scores]

print("Attention Weights (after softmax)")
print()
print("Each row sums to 1.0. These are the 'attention probabilities'.")
print()
max_label = max(len(TOKEN_NAMES[t]) for t in tokens)
header = " " * (max_label + 1) + "  ".join([f"{TOKEN_NAMES[tokens[j]]:>{max_label}}" for j in range(seq_len)])
print(header)
for i, row in enumerate(attention_weights):
    values = "  ".join([f"{v:{max_label}.4f}" for v in row])
    row_sum = sum(row)
    print(f"{TOKEN_NAMES[tokens[i]]:>{max_label}} [{values}]  sum={row_sum:.4f}")

Attention Weights (after softmax)

Each row sums to 1.0. These are the 'attention probabilities'.

                    <BOS>             I          like  transformers         <EOS>
       <BOS> [      1.0000        0.0000        0.0000        0.0000        0.0000]  sum=1.0000
           I [      0.5014        0.4986        0.0000        0.0000        0.0000]  sum=1.0000
        like [      0.3320        0.3348        0.3332        0.0000        0.0000]  sum=1.0000
transformers [      0.2501        0.2492        0.2506        0.2501        0.0000]  sum=1.0000
       <EOS> [      0.1999        0.2007        0.1999        0.2000        0.1996]  sum=1.0000

Interpreting the Weights¶

Look at what we computed:

Position 0 (<BOS>): 100% attention to itself. It has no choice—it can only see itself.
Position 1 (I): About 50-50 between <BOS> and itself.
Later positions: Spread attention more evenly across all visible positions.

The weights are nearly uniform because our model is untrained—the Q and K projections are random noise. In a trained model, you’d see much more interesting patterns:

Verbs attending strongly to their subjects
Pronouns attending to their antecedents
Related concepts clustering together

Step 5: Weighted Sum of Values¶

Finally, we use the attention weights to compute a weighted combination of values:

\text{output} = \text{weights} \cdot V

(6)

Shapes:

$\text{weights}$ : [5, 5] — attention from each position to each position
$V$ : [5, 8] — value vector for each position
$\text{output}$ : [5, 8] — new representation for each position

Each output vector is a weighted average of value vectors, where the weights come from the attention.

This is how information flows: token $i$ gathers information from other tokens by taking a weighted sum of their values.

# Compute attention output
attention_output = matmul(attention_weights, V)

print(f"Attention Output for Head {head}")
print(f"Shape: [{seq_len}, {seq_len}] @ [{seq_len}, {D_K}] = [{seq_len}, {D_K}]")
print()
for i, row in enumerate(attention_output):
    print(f"  output[{i}] = {format_vector(row)}  # {TOKEN_NAMES[tokens[i]]}")

Attention Output for Head 0
Shape: [5, 5] @ [5, 8] = [5, 8]

  output[0] = [ 0.0800,  0.0257, -0.0117, -0.1056,  0.0339, -0.0891, -0.0083, -0.0737]  # <BOS>
  output[1] = [ 0.0683,  0.0368, -0.0263, -0.0574,  0.0152, -0.0174, -0.0084, -0.0760]  # I
  output[2] = [ 0.0247,  0.0789,  0.0074, -0.0635,  0.0180, -0.0098, -0.0184, -0.0173]  # like
  output[3] = [ 0.0254,  0.0511, -0.0182, -0.0322,  0.0103, -0.0126, -0.0282,  0.0018]  # transformers
  output[4] = [ 0.0325,  0.0367, -0.0202, -0.0262,  0.0188, -0.0040, -0.0321,  0.0167]  # <EOS>

# Detailed calculation for position 1
print("Detailed: Computing output for position 1 ('I')")
print("=" * 60)
print()
print("output[1] = sum of (attention_weight × value) for each visible position")
print()

w0, w1 = attention_weights[1][0], attention_weights[1][1]
print(f"Attention weights: {w0:.4f} to '<BOS>', {w1:.4f} to 'I'")
print()
print(f"V[0] (value for '<BOS>'): {format_vector(V[0])}")
print(f"V[1] (value for 'I'):     {format_vector(V[1])}")
print()
print(f"output[1] = {w0:.4f} × V[0] + {w1:.4f} × V[1]")
print()

# Compute manually
manual_output = [w0 * V[0][d] + w1 * V[1][d] for d in range(D_K)]
print(f"Result: {format_vector(manual_output)}")

Detailed: Computing output for position 1 ('I')
============================================================

output[1] = sum of (attention_weight × value) for each visible position

Attention weights: 0.5014 to '<BOS>', 0.4986 to 'I'

V[0] (value for '<BOS>'): [ 0.0800,  0.0257, -0.0117, -0.1056,  0.0339, -0.0891, -0.0083, -0.0737]
V[1] (value for 'I'):     [ 0.0565,  0.0479, -0.0409, -0.0089, -0.0037,  0.0547, -0.0085, -0.0782]

output[1] = 0.5014 × V[0] + 0.4986 × V[1]

Result: [ 0.0683,  0.0368, -0.0263, -0.0574,  0.0152, -0.0174, -0.0084, -0.0760]

The Complete Attention Function¶

Let’s wrap all five steps into a single function and run it for both heads.

def attention(Q, K, V, causal=True):
    """
    Compute scaled dot-product attention.
    
    Args:
        Q: Query matrix [seq_len, d_k]
        K: Key matrix [seq_len, d_k]
        V: Value matrix [seq_len, d_k]
        causal: If True, apply causal mask (can't attend to future)
    
    Returns:
        output: [seq_len, d_k] - weighted sum of values
        weights: [seq_len, seq_len] - attention weights
    """
    seq_len = len(Q)
    d_k = len(Q[0])
    scale = math.sqrt(d_k)
    
    # Step 1: scores = Q @ K^T
    K_T = transpose(K)
    scores = matmul(Q, K_T)
    
    # Step 2: Scale
    scaled = [[s / scale for s in row] for row in scores]
    
    # Step 3: Causal mask
    if causal:
        for i in range(seq_len):
            for j in range(seq_len):
                if j > i:
                    scaled[i][j] = float('-inf')
    
    # Step 4: Softmax
    weights = [softmax(row) for row in scaled]
    
    # Step 5: Weighted sum
    output = matmul(weights, V)
    
    return output, weights

# Compute attention for both heads
attention_output_all = []
attention_weights_all = []

for h in range(NUM_HEADS):
    output, weights = attention(Q_all[h], K_all[h], V_all[h])
    attention_output_all.append(output)
    attention_weights_all.append(weights)
    
    print(f"HEAD {h}: Attention Weights")
    for i, row in enumerate(weights):
        values = ", ".join([f"{v:.4f}" for v in row])
        print(f"  [{values}]  # {TOKEN_NAMES[tokens[i]]}")
    print()

HEAD 0: Attention Weights
  [1.0000, 0.0000, 0.0000, 0.0000, 0.0000]  # <BOS>
  [0.5014, 0.4986, 0.0000, 0.0000, 0.0000]  # I
  [0.3320, 0.3348, 0.3332, 0.0000, 0.0000]  # like
  [0.2501, 0.2492, 0.2506, 0.2501, 0.0000]  # transformers
  [0.1999, 0.2007, 0.1999, 0.2000, 0.1996]  # <EOS>

HEAD 1: Attention Weights
  [1.0000, 0.0000, 0.0000, 0.0000, 0.0000]  # <BOS>
  [0.5009, 0.4991, 0.0000, 0.0000, 0.0000]  # I
  [0.3342, 0.3337, 0.3322, 0.0000, 0.0000]  # like
  [0.2514, 0.2494, 0.2510, 0.2482, 0.0000]  # transformers
  [0.1999, 0.1997, 0.2001, 0.2000, 0.2003]  # <EOS>

What We’ve Computed¶

For each head, we now have:

What	Shape	Meaning
Attention weights	[5, 5]	How much each position attends to each other
Attention output	[5, 8]	New representation incorporating context

The output for each position is now a mixture of information from other positions. Token representations are no longer independent—they’ve started to incorporate context.

This is the power of attention: each token’s representation can now depend on the entire sequence (up to its position).

What’s Next¶

We have attention outputs from two heads, each with shape [5, 8]. But our model expects d_model = 16 dimensions.

Next, we’ll:

Concatenate the head outputs: [5, 8] + [5, 8] → [5, 16]
Project through an output matrix to mix information across heads

This is the “multi” in multi-head attention.

# Store for next notebook
attention_data = {
    'attention_weights': attention_weights_all,
    'attention_output': attention_output_all,
    'X': X,
    'tokens': tokens,
    'Q': Q_all,
    'K': K_all,
    'V': V_all
}
print("Attention computation complete. Ready for multi-head combination.")

Attention computation complete. Ready for multi-head combination.