Cross-Entropy Loss - An Introduction to Transformers

The Final Step of the Forward Pass¶

We’ve come a long way. Our input text has been tokenized, embedded, processed through attention, transformed by the feed-forward network, and stabilized with layer normalization. Each token now has a 16-dimensional representation that encodes both its identity and its context.

But the model hasn’t actually made any predictions yet.

We have hidden states—rich, context-aware vectors. But we need probabilities. We need the model to tell us: “Given everything I’ve seen so far, what token should come next?”

This notebook covers the final two operations of the forward pass:

Output projection: Convert 16-dimensional hidden states to 6-dimensional logits (one score per vocabulary token)
Loss computation: Measure how wrong our predictions are using cross-entropy

The loss is the single number that tells us how badly the model performed. It’s what we’ll be minimizing during training.

The Language Modeling Task¶

Before we dive into the math, let’s be crystal clear about what our model is trying to do.

A language model predicts the next token given all previous tokens. This is called autoregressive generation—each prediction depends only on what came before, not what comes after.

Our input sequence is:

Text:       <BOS>  I   like  transformers  <EOS>
Token IDs:  [1,    3,  4,    5,            2]
Positions:  [0,    1,  2,    3,            4]

At each position, the model must predict what comes next:

Position	Input Token	Target (Next Token)
0	`<BOS>`	`I` (token 3)
1	`I`	`like` (token 4)
2	`like`	`transformers` (token 5)
3	`transformers`	`<EOS>` (token 2)
4	`<EOS>`	— (sequence ended)

Position 4 is the end-of-sequence marker. There’s nothing to predict after that. So we have 4 predictions to make and 4 losses to compute.

(This is why the causal mask in attention was so important—when predicting at position 2, the model can only see positions 0, 1, and 2. It can’t peek ahead at “transformers” or “”.)

import random
import math

random.seed(42)

# Model hyperparameters
VOCAB_SIZE = 6
D_MODEL = 16
D_FF = 64
MAX_SEQ_LEN = 5
NUM_HEADS = 2
D_K = D_MODEL // NUM_HEADS
EPSILON = 1e-5

TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

# Helper functions
def random_vector(size, scale=0.1):
    return [random.gauss(0, scale) for _ in range(size)]

def random_matrix(rows, cols, scale=0.1):
    return [[random.gauss(0, scale) for _ in range(cols)] for _ in range(rows)]

def add_vectors(v1, v2):
    return [a + b for a, b in zip(v1, v2)]

def matmul(A, B):
    m, n = len(A), len(A[0])
    p = len(B[0])
    return [[sum(A[i][k] * B[k][j] for k in range(n)) for j in range(p)] for i in range(m)]

def transpose(A):
    return [[A[i][j] for i in range(len(A))] for j in range(len(A[0]))]

def softmax(vec):
    max_val = max(vec)
    exp_vec = [math.exp(v - max_val) for v in vec]
    sum_exp = sum(exp_vec)
    return [e / sum_exp for e in exp_vec]

def softmax_causal(vec):
    max_val = max(v for v in vec if v != float('-inf'))
    exp_vec = [math.exp(v - max_val) if v != float('-inf') else 0 for v in vec]
    sum_exp = sum(exp_vec)
    return [e / sum_exp for e in exp_vec]

def gelu(x):
    return 0.5 * x * (1 + math.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * x**3)))

def layer_norm(x, gamma, beta, epsilon=1e-5):
    mean = sum(x) / len(x)
    variance = sum((xi - mean)**2 for xi in x) / len(x)
    std = math.sqrt(variance + epsilon)
    x_norm = [(xi - mean) / std for xi in x]
    return [gamma[i] * x_norm[i] + beta[i] for i in range(len(x))]

def format_vector(vec, decimals=4):
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"

# Recreate the full forward pass from previous notebooks
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]
tokens = [1, 3, 4, 5, 2]
seq_len = len(tokens)
X = [add_vectors(E_token[tokens[i]], E_pos[i]) for i in range(seq_len)]

# Attention
W_Q = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_K = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
W_V = [random_matrix(D_MODEL, D_K) for _ in range(NUM_HEADS)]
Q_all = [matmul(X, W_Q[h]) for h in range(NUM_HEADS)]
K_all = [matmul(X, W_K[h]) for h in range(NUM_HEADS)]
V_all = [matmul(X, W_V[h]) for h in range(NUM_HEADS)]

def compute_attention(Q, K, V):
    seq_len, d_k = len(Q), len(Q[0])
    scale = math.sqrt(d_k)
    scores = matmul(Q, transpose(K))
    scaled = [[s / scale for s in row] for row in scores]
    for i in range(seq_len):
        for j in range(seq_len):
            if j > i:
                scaled[i][j] = float('-inf')
    weights = [softmax_causal(row) for row in scaled]
    return matmul(weights, V)

attention_output_all = [compute_attention(Q_all[h], K_all[h], V_all[h]) for h in range(NUM_HEADS)]
concat_output = [attention_output_all[0][i] + attention_output_all[1][i] for i in range(seq_len)]
W_O = random_matrix(D_MODEL, D_MODEL)
multi_head_output = matmul(concat_output, transpose(W_O))

# FFN
W1 = random_matrix(D_FF, D_MODEL)
b1 = random_vector(D_FF)
W2 = random_matrix(D_MODEL, D_FF)
b2 = random_vector(D_MODEL)
hidden = [[sum(multi_head_output[i][k] * W1[j][k] for k in range(D_MODEL)) + b1[j] for j in range(D_FF)] for i in range(seq_len)]
activated = [[gelu(h) for h in row] for row in hidden]
ffn_output = [[sum(activated[i][k] * W2[j][k] for k in range(D_FF)) + b2[j] for j in range(D_MODEL)] for i in range(seq_len)]

# Residual + LayerNorm
residual = [add_vectors(multi_head_output[i], ffn_output[i]) for i in range(seq_len)]
gamma = [1.0] * D_MODEL
beta = [0.0] * D_MODEL
layer_norm_output = [layer_norm(residual[i], gamma, beta, EPSILON) for i in range(seq_len)]

print("Recreated full forward pass through transformer block")
print(f"Hidden states shape: [{seq_len}, {D_MODEL}]")

Recreated full forward pass through transformer block
Hidden states shape: [5, 16]

Step 1: Output Projection (Hidden States → Logits)¶

Our hidden states are 16-dimensional vectors. But our vocabulary has 6 tokens. We need to convert from d_model = 16 dimensions to vocab_size = 6 dimensions.

This is done by the language modeling head (often called the “LM head” or “output projection”)—a simple linear layer with no bias:

\text{logits} = \text{hidden} \cdot W_{lm}^T

(1)

Where:

$\text{hidden}$ has shape [seq_len, d_model] = [5, 16]
$W_{lm}$ has shape [vocab_size, d_model] = [6, 16]
$\text{logits}$ has shape [seq_len, vocab_size] = [5, 6]

What are logits?

Logits are raw, unnormalized scores. Each logit represents how strongly the model believes in that token being the correct prediction. Higher logits = more confidence. But logits can be negative, and they don’t sum to 1—they’re not probabilities yet.

# Initialize LM head weight matrix
W_lm = random_matrix(VOCAB_SIZE, D_MODEL)  # [6, 16]

print(f"LM Head Weight Matrix W_lm")
print(f"Shape: [{VOCAB_SIZE}, {D_MODEL}]")
print(f"Parameters: {VOCAB_SIZE * D_MODEL} = 96")

LM Head Weight Matrix W_lm
Shape: [6, 16]
Parameters: 96 = 96

# Compute logits: hidden_state @ W_lm^T
W_lm_T = transpose(W_lm)
logits = matmul(layer_norm_output, W_lm_T)

print("Logits (unnormalized scores)")
print(f"Shape: [{seq_len}, {VOCAB_SIZE}]")
print()
print(f"{'Position':<12} {'<PAD>':>8} {'<BOS>':>8} {'<EOS>':>8} {'I':>8} {'like':>8} {'trans':>8}")
print("-"*70)
for i, row in enumerate(logits):
    print(f"{TOKEN_NAMES[tokens[i]]:<12} {row[0]:>8.4f} {row[1]:>8.4f} {row[2]:>8.4f} {row[3]:>8.4f} {row[4]:>8.4f} {row[5]:>8.4f}")

Logits (unnormalized scores)
Shape: [5, 6]

Position        <PAD>    <BOS>    <EOS>        I     like    trans
----------------------------------------------------------------------
<BOS>          0.3362  -0.1025  -0.0514  -0.0053   0.2907   0.0848
I              0.3332  -0.1114  -0.0743  -0.0182   0.2730   0.0714
like           0.3179  -0.1641  -0.0885  -0.0041   0.3290   0.0326
transformers   0.3252  -0.1279  -0.0646  -0.0031   0.3202   0.0518
<EOS>          0.3209  -0.1388  -0.0673   0.0066   0.3317   0.0356

Understanding the Logits¶

Each row gives us scores for all 6 vocabulary tokens at that position. Look at position 0 (<BOS>):

The model needs to predict I (token 3) here. But with random weights, the scores are essentially random. The model has no idea yet that <BOS> should predict I.

Let’s see a detailed calculation for one logit:

# Detailed calculation for logits[0][3] (position 0 predicting token 3 "I")
print("Detailed: Computing logits[0][3] (probability score for 'I' at position 0)")
print("="*70)
print()
print("logits[0][3] = hidden[0] · W_lm[3]")
print()
print(f"hidden[0] (16 dims): {format_vector(layer_norm_output[0])}")
print()
print(f"W_lm[3] (16 dims):   {format_vector(W_lm[3])}")
print()
dot_product = sum(layer_norm_output[0][j] * W_lm[3][j] for j in range(D_MODEL))
print(f"Dot product = {dot_product:.6f}")
print(f"Actual logits[0][3] = {logits[0][3]:.6f}")

Detailed: Computing logits[0][3] (probability score for 'I' at position 0)
======================================================================

logits[0][3] = hidden[0] · W_lm[3]

hidden[0] (16 dims): [-0.1319, -1.2702, -0.4969,  1.5600,  0.6278,  0.4769, -2.2261,  0.1465,  1.5670,  0.2124, -0.6301, -0.9558,  0.8655, -0.7691,  0.9595,  0.0645]

W_lm[3] (16 dims):   [-0.0919,  0.0860, -0.0569,  0.0092, -0.0476, -0.1328, -0.0465,  0.1397, -0.0415, -0.0676, -0.0428, -0.0335,  0.0229, -0.0412, -0.0098, -0.0491]

Dot product = -0.005250
Actual logits[0][3] = -0.005250

Step 2: Softmax (Logits → Probabilities)¶

Logits are useful for computation, but humans think in probabilities. And the loss function needs probabilities.

Softmax converts arbitrary real numbers into a valid probability distribution:

P(\text{token}_i) = \frac{\exp(\text{logit}_i)}{\sum_{j=1}^{V} \exp(\text{logit}_j)}

(2)

Where $V$ is the vocabulary size (6 in our case).

Why softmax works:

Exponentiation ( $\exp$ ) makes all values positive
Dividing by the sum ensures everything adds to 1
Larger logits dominate because exp grows exponentially

If one logit is much larger than the others, it will have probability close to 1. If all logits are similar, probabilities will be uniform.

Numerical stability:

In practice, we subtract the maximum logit before exponentiating:

P(\text{token}_i) = \frac{\exp(\text{logit}_i - \max(\text{logits}))}{\sum_{j} \exp(\text{logit}_j - \max(\text{logits}))}

(3)

This prevents overflow when logits are large. It doesn’t change the result because:

\frac{\exp(a - c)}{\exp(b - c)} = \frac{\exp(a)}{\exp(b)}

(4)

# Detailed softmax calculation for position 0
print("Detailed: Softmax for position 0")
print("="*60)
print()
print(f"Logits: {format_vector(logits[0])}")
print()

max_logit = max(logits[0])
print(f"Step 1: Subtract max for stability")
print(f"  max(logits) = {max_logit:.4f}")
shifted = [l - max_logit for l in logits[0]]
print(f"  shifted = {format_vector(shifted)}")
print()

print(f"Step 2: Exponentiate")
exp_vals = [math.exp(s) for s in shifted]
print(f"  exp(shifted) = {format_vector(exp_vals)}")
print()

print(f"Step 3: Normalize")
sum_exp = sum(exp_vals)
print(f"  sum = {sum_exp:.4f}")
probs_manual = [e / sum_exp for e in exp_vals]
print(f"  probabilities = {format_vector(probs_manual)}")
print(f"  sum of probs = {sum(probs_manual):.4f}")

Detailed: Softmax for position 0
============================================================

Logits: [ 0.3362, -0.1025, -0.0514, -0.0053,  0.2907,  0.0848]

Step 1: Subtract max for stability
  max(logits) = 0.3362
  shifted = [ 0.0000, -0.4387, -0.3876, -0.3415, -0.0455, -0.2514]

Step 2: Exponentiate
  exp(shifted) = [ 1.0000,  0.6449,  0.6787,  0.7107,  0.9555,  0.7777]

Step 3: Normalize
  sum = 4.7674
  probabilities = [ 0.2098,  0.1353,  0.1424,  0.1491,  0.2004,  0.1631]
  sum of probs = 1.0000

# Apply softmax to all positions
probs = [softmax(row) for row in logits]

print("Probabilities (after softmax)")
print(f"Shape: [{seq_len}, {VOCAB_SIZE}]")
print()
print(f"{'Position':<12} {'<PAD>':>8} {'<BOS>':>8} {'<EOS>':>8} {'I':>8} {'like':>8} {'trans':>8} {'Sum':>8}")
print("-"*80)
for i, row in enumerate(probs):
    row_sum = sum(row)
    print(f"{TOKEN_NAMES[tokens[i]]:<12} {row[0]:>8.4f} {row[1]:>8.4f} {row[2]:>8.4f} {row[3]:>8.4f} {row[4]:>8.4f} {row[5]:>8.4f} {row_sum:>8.4f}")

Probabilities (after softmax)
Shape: [5, 6]

Position        <PAD>    <BOS>    <EOS>        I     like    trans      Sum
--------------------------------------------------------------------------------
<BOS>          0.2098   0.1353   0.1424   0.1491   0.2004   0.1631   1.0000
I              0.2118   0.1358   0.1409   0.1490   0.1994   0.1630   1.0000
like           0.2096   0.1294   0.1396   0.1519   0.2119   0.1576   1.0000
transformers   0.2088   0.1327   0.1414   0.1504   0.2078   0.1589   1.0000
<EOS>          0.2082   0.1315   0.1412   0.1521   0.2105   0.1565   1.0000

Interpreting the Probabilities¶

Each row now sums to 1.0000—a valid probability distribution. Look at position 0 (<BOS>):

The model assigns about equal probability to all tokens. It gives <PAD> the highest probability (~21%) and <BOS> the lowest (~14%). But it should be predicting I with high confidence.

This is what an untrained model looks like: random guessing. The probabilities are roughly uniform because the random weights don’t encode any useful patterns yet.

Step 3: Cross-Entropy Loss¶

Now we need a single number that measures how wrong our predictions are. This is the loss function.

For language modeling, we use cross-entropy loss:

L = -\log P(\text{correct token})

(5)

That’s it. Take the probability the model assigned to the correct answer, take its logarithm, and negate it.

Why does this make sense?

If the model is confident and correct ( $P = 0.99$ ): $L = -\log(0.99) \approx 0.01$ (low loss, good!)
If the model is uncertain ( $P = 0.5$ ): $L = -\log(0.5) \approx 0.69$ (medium loss)
If the model is wrong ( $P = 0.01$ ): $L = -\log(0.01) \approx 4.6$ (high loss, bad!)
If the model is completely wrong ( $P \to 0$ ): $L \to \infty$ (catastrophic)

The negative log has nice properties:

It’s always positive (since probabilities are between 0 and 1)
It’s 0 when we’re perfectly confident and correct
It goes to infinity as we become more wrong
It penalizes confident wrong answers more than uncertain ones

Connection to information theory:

Cross-entropy has deep roots in information theory. $-\log_2 P$ is the number of bits needed to encode an event with probability $P$ . Minimizing cross-entropy is equivalent to maximizing the likelihood of the data under the model.

# Define target tokens (what we should predict)
# At position i, we predict token i+1
targets = [3, 4, 5, 2]  # I, like, transformers, <EOS>

print("Targets (what the model should predict)")
print("="*60)
print()
for i in range(len(targets)):
    print(f"Position {i}: {TOKEN_NAMES[tokens[i]]:12s} → should predict: {TOKEN_NAMES[targets[i]]} (token {targets[i]})")

Targets (what the model should predict)
============================================================

Position 0: <BOS>        → should predict: I (token 3)
Position 1: I            → should predict: like (token 4)
Position 2: like         → should predict: transformers (token 5)
Position 3: transformers → should predict: <EOS> (token 2)

# Detailed cross-entropy calculation for position 0
print("Detailed: Cross-entropy loss for position 0")
print("="*60)
print()
print(f"Current token: {TOKEN_NAMES[tokens[0]]} (position 0)")
print(f"Target token: {TOKEN_NAMES[targets[0]]} (token ID {targets[0]})")
print()
print(f"Probability distribution: {format_vector(probs[0])}")
print(f"                          <PAD>  <BOS>  <EOS>    I     like  trans")
print()
prob_target = probs[0][targets[0]]
print(f"P(target) = P('I') = probs[0][3] = {prob_target:.6f}")
print()
loss_0 = -math.log(prob_target)
print(f"Loss = -log({prob_target:.6f}) = {loss_0:.6f}")

Detailed: Cross-entropy loss for position 0
============================================================

Current token: <BOS> (position 0)
Target token: I (token ID 3)

Probability distribution: [ 0.2098,  0.1353,  0.1424,  0.1491,  0.2004,  0.1631]
                          <PAD>  <BOS>  <EOS>    I     like  trans

P(target) = P('I') = probs[0][3] = 0.149082

Loss = -log(0.149082) = 1.903260

# Compute cross-entropy loss for all positions
losses = []

print("Cross-Entropy Loss Calculation")
print("="*70)
print()
print(f"{'Position':<12} {'Current':<12} {'Target':<12} {'P(target)':>10} {'Loss':>10}")
print("-"*70)

for i in range(len(targets)):
    target = targets[i]
    prob_target = probs[i][target]
    loss = -math.log(prob_target)
    losses.append(loss)
    print(f"{i:<12} {TOKEN_NAMES[tokens[i]]:<12} {TOKEN_NAMES[target]:<12} {prob_target:>10.4f} {loss:>10.4f}")

total_loss = sum(losses)
avg_loss = total_loss / len(losses)
print("-"*70)
print(f"{'Total loss':<36} {' ':>10} {total_loss:>10.4f}")
print(f"{'Average loss':<36} {' ':>10} {avg_loss:>10.4f}")

Cross-Entropy Loss Calculation
======================================================================

Position     Current      Target        P(target)       Loss
----------------------------------------------------------------------
0            <BOS>        I                0.1491     1.9033
1            I            like             0.1994     1.6123
2            like         transformers     0.1576     1.8479
3            transformers <EOS>            0.1414     1.9560
----------------------------------------------------------------------
Total loss                                          7.3195
Average loss                                        1.8299

Interpreting the Loss¶

Our average loss is about 1.83. Is that good or bad?

Baseline: random guessing

If the model assigned equal probability to all 6 tokens (uniform distribution), the probability of any token would be $1/6 \approx 0.167$ .

The loss for random guessing:

L_{\text{random}} = -\log(1/6) = \log(6) \approx 1.79

(6)

What our loss tells us:

Our model’s loss (~1.83) is slightly worse than random guessing. That’s exactly what we’d expect from an untrained model with random weights. The model hasn’t learned anything yet—it’s essentially flipping a 6-sided die.

What we want:

A well-trained model should have loss close to 0, meaning it predicts the correct next token with high probability.

Loss Value	Interpretation
~1.79	Random guessing (uniform over 6 tokens)
~1.0	Model has learned some patterns
~0.5	Model is fairly confident and usually correct
~0.1	Model is very good at this task
~0.0	Perfect predictions (never happens in practice)

# Compare our loss to random guessing
random_loss = -math.log(1/VOCAB_SIZE)

print("Loss Comparison")
print("="*40)
print(f"Our model's average loss: {avg_loss:.4f}")
print(f"Random guessing loss:     {random_loss:.4f}")
print()

if avg_loss > random_loss:
    diff = avg_loss - random_loss
    print(f"We're {diff:.4f} worse than random guessing.")
    print("This is expected for an untrained model!")
else:
    diff = random_loss - avg_loss
    print(f"We're {diff:.4f} better than random guessing.")
    print("Got lucky with weight initialization!")

Loss Comparison
========================================
Our model's average loss: 1.8299
Random guessing loss:     1.7918

We're 0.0381 worse than random guessing.
This is expected for an untrained model!

Perplexity: A More Intuitive Metric¶

Loss values like 1.83 are hard to interpret. Perplexity is a more intuitive alternative.

\text{Perplexity} = \exp(\text{loss})

(7)

Perplexity can be thought of as “the effective number of choices the model is considering.” A perplexity of 6 means the model is as uncertain as if it were choosing uniformly among 6 options.

Loss	Perplexity	Interpretation
1.79	6.0	Random among 6 tokens
1.10	3.0	Narrowed to ~3 likely tokens
0.69	2.0	Coin flip between 2 tokens
0.10	1.1	Very confident, ~1 choice
0.00	1.0	Perfect certainty

perplexity = math.exp(avg_loss)
random_perplexity = math.exp(random_loss)

print("Perplexity")
print("="*40)
print(f"Our perplexity: {perplexity:.2f}")
print(f"Random perplexity: {random_perplexity:.2f}")
print()
print(f"The model is as uncertain as if it were choosing")
print(f"uniformly among {perplexity:.1f} tokens.")

Perplexity
========================================
Our perplexity: 6.23
Random perplexity: 6.00

The model is as uncertain as if it were choosing
uniformly among 6.2 tokens.

The Forward Pass is Complete!¶

We’ve traced the entire forward pass from text to loss:

"I like transformers"
        ↓
    [Tokenization]        → [1, 3, 4, 5, 2]
        ↓
    [Embeddings]          → [5, 16] matrix (token + position)
        ↓
    [Q/K/V Projections]   → Query, Key, Value matrices
        ↓
    [Attention]           → Context-aware representations
        ↓
    [Multi-head]          → Combined from 2 heads
        ↓
    [Feed-forward]        → Non-linear transformations
        ↓
    [Layer norm]          → Stabilized activations
        ↓
    [Output projection]   → [5, 6] logits
        ↓
    [Softmax]             → [5, 6] probabilities
        ↓
    [Cross-entropy]       → Loss = 1.83

The loss (1.83) tells us the model is performing at random-guessing level. That’s our starting point.

What’s Next: Backpropagation¶

We know the model is wrong. The question now is: how do we make it less wrong?

We have about 2,600 parameters (embedding matrices, attention weights, FFN weights, layer norm parameters, LM head). Each one contributes somehow to the final loss.

Backpropagation answers the question: “For each parameter, if I nudged it slightly, how much would the loss change?”

This “how much would the loss change” is the gradient. Once we have gradients for all parameters, we can update them to reduce the loss.

The next notebook starts the backward pass: computing gradients by walking backward through the computation graph, applying the chain rule at each step.

# Store everything for backpropagation
forward_pass_data = {
    'tokens': tokens,
    'targets': targets,
    'X': X,
    'layer_norm_output': layer_norm_output,
    'logits': logits,
    'probs': probs,
    'losses': losses,
    'avg_loss': avg_loss,
    # All weights
    'E_token': E_token,
    'E_pos': E_pos,
    'W_Q': W_Q, 'W_K': W_K, 'W_V': W_V,
    'W_O': W_O,
    'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2,
    'W_lm': W_lm,
    'gamma': gamma, 'beta': beta
}

print(f"Forward pass complete.")
print(f"Average loss: {avg_loss:.4f}")
print(f"Perplexity: {perplexity:.2f}")
print()
print("Data stored for backpropagation.")

Forward pass complete.
Average loss: 1.8299
Perplexity: 6.23

Data stored for backpropagation.