Tokenization & Embeddings - An Introduction to Transformers

The Fundamental Problem¶

Neural networks only understand numbers.

Not words. Not characters. Not meaning. Just floating-point numbers arranged in vectors and matrices.

So before a transformer can do anything with text, we need to convert it into numbers. But not just any numbers—we need representations that capture something useful about what words mean and how they relate to each other.

This notebook covers that conversion: from the string “I like transformers” to a matrix of numbers the model can actually process.

Step 1: Tokenization (Text → Integer IDs)¶

The first step is simple: break text into pieces and assign each piece a number.

These pieces are called tokens. In real models, tokens might be words, parts of words (“un” + “believe” + “able”), or even individual characters. GPT-3 uses about 50,000 different tokens. We’ll use 6.

Our vocabulary:

Token ID	Token	Purpose
0	`<PAD>`	Padding (for batching sequences of different lengths)
1	`<BOS>`	Beginning of sequence marker
2	`<EOS>`	End of sequence marker
3	`I`	The word “I”
4	`like`	The word “like”
5	`transformers`	The word “transformers”

The special tokens (<PAD>, <BOS>, <EOS>) are conventions that help the model understand sequence boundaries. <BOS> says “a new sequence starts here.” <EOS> says “the sequence ends here.” <PAD> fills in gaps when sequences have different lengths.

import random

# Set seed for reproducibility
random.seed(42)

# Model dimensions
VOCAB_SIZE = 6      # Number of unique tokens
D_MODEL = 16        # Embedding dimension (size of each token's vector)
MAX_SEQ_LEN = 5     # Maximum sequence length

# Human-readable token names
TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

Tokenizing Our Input¶

Let’s convert “I like transformers” into token IDs:

Original text:   "I like transformers"
With markers:    <BOS> I like transformers <EOS>
Token IDs:       [1,   3, 4,   5,           2]

We add <BOS> at the start and <EOS> at the end. The result is a list of 5 integers.

# Our tokenized input
tokens = [1, 3, 4, 5, 2]  # <BOS>, I, like, transformers, <EOS>
seq_len = len(tokens)

print(f"Token IDs: {tokens}")
print(f"As text: {' '.join(TOKEN_NAMES[t] for t in tokens)}")
print(f"Sequence length: {seq_len}")

Token IDs: [1, 3, 4, 5, 2]
As text: <BOS> I like transformers <EOS>
Sequence length: 5

The Language Modeling Task¶

Before we go further, let’s be clear about what the model is trying to do.

A language model predicts the next token. At each position, it looks at all the previous tokens and guesses what comes next. This is called autoregressive generation—each prediction depends only on what came before.

For our sequence:

Position	Sees	Must Predict
0	`<BOS>`	`I`
1	`<BOS> I`	`like`
2	`<BOS> I like`	`transformers`
3	`<BOS> I like transformers`	`<EOS>`

Position 4 is <EOS>, which marks the end—nothing to predict.

This is why we need a causal mask later (we’ll see it in the attention notebook). The model at position 2 shouldn’t be able to peek at position 3 and 4—that would be cheating.

Step 2: Token Embeddings (IDs → Vectors)¶

Okay, we have integers. But integers are terrible representations for learning.

Why? Because integers imply ordering and distance that doesn’t exist. Token 5 isn’t “bigger” than token 3. Token 4 isn’t “between” tokens 3 and 5 in any meaningful way. The numbering is arbitrary.

What we need is a continuous vector for each token—something we can do math with, something where distance and direction have meaning.

Enter: embeddings.

The Embedding Matrix¶

We create a matrix $E_{token}$ of shape [vocab_size, d_model] = [6, 16]. Each row is a 16-dimensional vector representing one token:

E_{token} = \begin{bmatrix} - & \text{embedding for token 0 (<PAD>)} & - \\ - & \text{embedding for token 1 (<BOS>)} & - \\ - & \text{embedding for token 2 (<EOS>)} & - \\ - & \text{embedding for token 3 (I)} & - \\ - & \text{embedding for token 4 (like)} & - \\ - & \text{embedding for token 5 (transformers)} & - \end{bmatrix}

(1)

To get the embedding for token ID $i$ , we just look up row $i$ . That’s it. Embeddings are just table lookups.

def random_vector(size, scale=0.1):
    """Generate a random vector with values drawn from N(0, scale^2)"""
    return [random.gauss(0, scale) for _ in range(size)]

def format_vector(vec, decimals=4):
    """Format a vector as a readable string"""
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"

# Initialize token embedding matrix with small random values
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]

print(f"Token Embedding Matrix E_token")
print(f"Shape: [{VOCAB_SIZE}, {D_MODEL}]")
print()
for i, row in enumerate(E_token):
    print(f"  Token {i} ({TOKEN_NAMES[i]:12s}): {format_vector(row)}")

Token Embedding Matrix E_token
Shape: [6, 16]

  Token 0 (<PAD>       ): [-0.0144, -0.0173, -0.0111,  0.0702, -0.0128, -0.1497,  0.0332, -0.0267, -0.0217,  0.0116,  0.0232,  0.1164,  0.0657,  0.0111, -0.0738, -0.1015]
  Token 1 (<BOS>       ): [ 0.0246,  0.1311,  0.0042, -0.0106,  0.0532, -0.1454, -0.0312,  0.0490,  0.0873, -0.0241,  0.0377,  0.0248,  0.0782, -0.1113,  0.0568, -0.1515]
  Token 2 (<EOS>       ): [-0.2620, -0.0607, -0.0916,  0.0876,  0.0664, -0.1219,  0.0847, -0.1002, -0.0086, -0.0294,  0.0114,  0.0819,  0.0638,  0.0350,  0.0650,  0.0478]
  Token 3 (I           ): [-0.0627, -0.0717, -0.0470,  0.0499, -0.0250,  0.2336, -0.0819, -0.1099,  0.0768,  0.1422,  0.0506,  0.0836,  0.1426, -0.0094, -0.1423, -0.0532]
  Token 4 (like        ): [ 0.0953, -0.1444,  0.0034,  0.0253, -0.0316,  0.0724,  0.0581,  0.2321,  0.0620, -0.0609, -0.0562, -0.0832,  0.0952, -0.0567, -0.0070,  0.0749]
  Token 5 (transformers): [-0.0723, -0.0294, -0.1841, -0.1082, -0.0568,  0.0416,  0.1193, -0.0018,  0.0261,  0.0168,  0.1085,  0.0893,  0.0274, -0.1011,  0.0903,  0.0381]

Why Random Initialization?¶

These embeddings start as random noise. They don’t encode any meaning yet.

That’s fine. During training, the model will adjust these vectors so that:

Similar words end up with similar embeddings
The vectors capture useful features for the prediction task
Relationships between words are encoded in the geometry (e.g., king - man + woman ≈ queen)

We initialize with small random values (scale=0.1) to break symmetry and avoid numerical issues at the start of training.

Step 3: Position Embeddings (Where in the Sequence?)¶

Here’s something weird about transformers: they have no built-in sense of order.

Think about it. The attention mechanism (which we’ll see later) compares every token to every other token simultaneously. It’s not processing left-to-right like a human reading. It’s looking at all tokens at once.

This means: without something extra, the model has no idea that “I” comes before “like” which comes before “transformers.” It would process “transformers like I” exactly the same way.

That’s... bad. Word order matters.

The Solution: Add Position Information¶

We give each position its own embedding vector, then add it to the token embedding:

\text{input}_i = E_{token}[\text{token}_i] + E_{pos}[i]

(2)

The position embedding matrix $E_{pos}$ has shape [max_seq_len, d_model] = [5, 16]. Each row is a unique vector for that position.

After adding position embeddings:

The word “I” at position 1 has a different representation than “I” at position 3
The model can learn that certain patterns occur at certain positions

# Initialize position embedding matrix
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]

print(f"Position Embedding Matrix E_pos")
print(f"Shape: [{MAX_SEQ_LEN}, {D_MODEL}]")
print()
for i, row in enumerate(E_pos):
    print(f"  Position {i}: {format_vector(row)}")

Position Embedding Matrix E_pos
Shape: [5, 16]

  Position 0: [ 0.1227, -0.0030,  0.1953, -0.0359,  0.1593,  0.0115, -0.0516, -0.1128, -0.0151,  0.1423,  0.0816,  0.0689, -0.2376,  0.0711,  0.0556, -0.0550]
  Position 1: [-0.0627, -0.0002,  0.1725, -0.1055, -0.0428,  0.1362, -0.0446, -0.0364,  0.0098, -0.1241,  0.0220, -0.1210,  0.0885,  0.0003,  0.2283,  0.0281]
  Position 2: [ 0.1366, -0.1303, -0.0122,  0.0323,  0.1746, -0.1681,  0.0991,  0.0591,  0.1534,  0.0712,  0.0052, -0.0522, -0.1248,  0.0195, -0.0192,  0.2020]
  Position 3: [-0.0611,  0.0320, -0.1569, -0.0395,  0.0261,  0.0824,  0.1448, -0.0044, -0.1117,  0.0458,  0.0517,  0.0492, -0.0700,  0.1133,  0.0088,  0.0700]
  Position 4: [ 0.1274,  0.0609,  0.0287,  0.2153,  0.0244, -0.0296,  0.0112,  0.1483,  0.0119,  0.0519,  0.1196, -0.0513, -0.1727,  0.0299,  0.0230, -0.0608]

Why Add Instead of Concatenate?¶

You might wonder: why add the position embedding to the token embedding? Why not concatenate them (stick them side by side)?

Concatenation would work, but it has a cost: it would increase the dimension. If token embeddings are 16-dimensional and position embeddings are 16-dimensional, concatenation gives you 32-dimensional vectors. Every subsequent layer would need to be larger.

Addition is cheaper. We keep the same dimension, and the model learns to “share” the 16 dimensions between token identity and position information.

(There’s also a deeper reason: addition lets the model learn to ignore position when it doesn’t matter, by having position embeddings that are nearly orthogonal to the directions the model cares about. But that’s getting into the weeds.)

Putting It Together: The Input Matrix X¶

Now we can build the actual input to our transformer.

For each position $i$ in our sequence:

Look up the token embedding: $E_{token}[\text{tokens}[i]]$
Look up the position embedding: $E_{pos}[i]$
Add them together: $X[i] = E_{token}[\text{tokens}[i]] + E_{pos}[i]$

Let’s compute this for our sequence [1, 3, 4, 5, 2]:

def add_vectors(v1, v2):
    """Element-wise addition of two vectors"""
    return [a + b for a, b in zip(v1, v2)]

# Look up token embeddings for our sequence
token_embeddings = [E_token[token_id] for token_id in tokens]

# Add position embeddings
X = [add_vectors(token_embeddings[i], E_pos[i]) for i in range(seq_len)]

print("Computing input embeddings X = E_token[token_id] + E_pos[position]")
print("=" * 80)
print()

for i in range(seq_len):
    token_id = tokens[i]
    token_name = TOKEN_NAMES[token_id]
    print(f"Position {i}: token '{token_name}' (ID {token_id})")
    print(f"  Token embedding E_token[{token_id}]:")
    print(f"    {format_vector(token_embeddings[i])}")
    print(f"  Position embedding E_pos[{i}]:")
    print(f"    {format_vector(E_pos[i])}")
    print(f"  Sum (X[{i}]):")
    print(f"    {format_vector(X[i])}")
    print()

Computing input embeddings X = E_token[token_id] + E_pos[position]
================================================================================

Position 0: token '<BOS>' (ID 1)
  Token embedding E_token[1]:
    [ 0.0246,  0.1311,  0.0042, -0.0106,  0.0532, -0.1454, -0.0312,  0.0490,  0.0873, -0.0241,  0.0377,  0.0248,  0.0782, -0.1113,  0.0568, -0.1515]
  Position embedding E_pos[0]:
    [ 0.1227, -0.0030,  0.1953, -0.0359,  0.1593,  0.0115, -0.0516, -0.1128, -0.0151,  0.1423,  0.0816,  0.0689, -0.2376,  0.0711,  0.0556, -0.0550]
  Sum (X[0]):
    [ 0.1473,  0.1281,  0.1995, -0.0465,  0.2125, -0.1338, -0.0829, -0.0638,  0.0722,  0.1183,  0.1193,  0.0937, -0.1594, -0.0402,  0.1124, -0.2064]

Position 1: token 'I' (ID 3)
  Token embedding E_token[3]:
    [-0.0627, -0.0717, -0.0470,  0.0499, -0.0250,  0.2336, -0.0819, -0.1099,  0.0768,  0.1422,  0.0506,  0.0836,  0.1426, -0.0094, -0.1423, -0.0532]
  Position embedding E_pos[1]:
    [-0.0627, -0.0002,  0.1725, -0.1055, -0.0428,  0.1362, -0.0446, -0.0364,  0.0098, -0.1241,  0.0220, -0.1210,  0.0885,  0.0003,  0.2283,  0.0281]
  Sum (X[1]):
    [-0.1254, -0.0720,  0.1255, -0.0556, -0.0678,  0.3698, -0.1265, -0.1463,  0.0866,  0.0181,  0.0726, -0.0374,  0.2312, -0.0091,  0.0860, -0.0251]

Position 2: token 'like' (ID 4)
  Token embedding E_token[4]:
    [ 0.0953, -0.1444,  0.0034,  0.0253, -0.0316,  0.0724,  0.0581,  0.2321,  0.0620, -0.0609, -0.0562, -0.0832,  0.0952, -0.0567, -0.0070,  0.0749]
  Position embedding E_pos[2]:
    [ 0.1366, -0.1303, -0.0122,  0.0323,  0.1746, -0.1681,  0.0991,  0.0591,  0.1534,  0.0712,  0.0052, -0.0522, -0.1248,  0.0195, -0.0192,  0.2020]
  Sum (X[2]):
    [ 0.2319, -0.2747, -0.0089,  0.0576,  0.1430, -0.0957,  0.1571,  0.2913,  0.2154,  0.0103, -0.0510, -0.1353, -0.0296, -0.0371, -0.0262,  0.2770]

Position 3: token 'transformers' (ID 5)
  Token embedding E_token[5]:
    [-0.0723, -0.0294, -0.1841, -0.1082, -0.0568,  0.0416,  0.1193, -0.0018,  0.0261,  0.0168,  0.1085,  0.0893,  0.0274, -0.1011,  0.0903,  0.0381]
  Position embedding E_pos[3]:
    [-0.0611,  0.0320, -0.1569, -0.0395,  0.0261,  0.0824,  0.1448, -0.0044, -0.1117,  0.0458,  0.0517,  0.0492, -0.0700,  0.1133,  0.0088,  0.0700]
  Sum (X[3]):
    [-0.1334,  0.0027, -0.3410, -0.1478, -0.0307,  0.1240,  0.2642, -0.0063, -0.0856,  0.0626,  0.1602,  0.1385, -0.0427,  0.0122,  0.0991,  0.1081]

Position 4: token '<EOS>' (ID 2)
  Token embedding E_token[2]:
    [-0.2620, -0.0607, -0.0916,  0.0876,  0.0664, -0.1219,  0.0847, -0.1002, -0.0086, -0.0294,  0.0114,  0.0819,  0.0638,  0.0350,  0.0650,  0.0478]
  Position embedding E_pos[4]:
    [ 0.1274,  0.0609,  0.0287,  0.2153,  0.0244, -0.0296,  0.0112,  0.1483,  0.0119,  0.0519,  0.1196, -0.0513, -0.1727,  0.0299,  0.0230, -0.0608]
  Sum (X[4]):
    [-0.1346,  0.0002, -0.0629,  0.3029,  0.0908, -0.1515,  0.0959,  0.0481,  0.0032,  0.0225,  0.1310,  0.0306, -0.1088,  0.0649,  0.0880, -0.0130]

The Resulting Matrix¶

We now have our input matrix $X$ with shape [seq_len, d_model] = [5, 16].

Each row is a 16-dimensional vector representing one token at one position. This matrix is the input to the transformer block.

print("=" * 80)
print("FINAL INPUT MATRIX X")
print("=" * 80)
print(f"Shape: [{seq_len}, {D_MODEL}]")
print()
for i, row in enumerate(X):
    token_name = TOKEN_NAMES[tokens[i]]
    print(f"  X[{i}] = {format_vector(row)}  # {token_name}")

================================================================================
FINAL INPUT MATRIX X
================================================================================
Shape: [5, 16]

  X[0] = [ 0.1473,  0.1281,  0.1995, -0.0465,  0.2125, -0.1338, -0.0829, -0.0638,  0.0722,  0.1183,  0.1193,  0.0937, -0.1594, -0.0402,  0.1124, -0.2064]  # <BOS>
  X[1] = [-0.1254, -0.0720,  0.1255, -0.0556, -0.0678,  0.3698, -0.1265, -0.1463,  0.0866,  0.0181,  0.0726, -0.0374,  0.2312, -0.0091,  0.0860, -0.0251]  # I
  X[2] = [ 0.2319, -0.2747, -0.0089,  0.0576,  0.1430, -0.0957,  0.1571,  0.2913,  0.2154,  0.0103, -0.0510, -0.1353, -0.0296, -0.0371, -0.0262,  0.2770]  # like
  X[3] = [-0.1334,  0.0027, -0.3410, -0.1478, -0.0307,  0.1240,  0.2642, -0.0063, -0.0856,  0.0626,  0.1602,  0.1385, -0.0427,  0.0122,  0.0991,  0.1081]  # transformers
  X[4] = [-0.1346,  0.0002, -0.0629,  0.3029,  0.0908, -0.1515,  0.0959,  0.0481,  0.0032,  0.0225,  0.1310,  0.0306, -0.1088,  0.0649,  0.0880, -0.0130]  # <EOS>

What We’ve Built¶

Let’s recap the transformation:

"I like transformers"
        ↓ tokenization
[1, 3, 4, 5, 2]                    # 5 integers
        ↓ token embeddings
[5, 16] matrix                     # 5 rows × 16 columns
        ↓ add position embeddings
[5, 16] matrix (X)                 # final input

We’ve gone from a string to a matrix of continuous values. The model can now:

Do math with these vectors (add, multiply, take dot products)
Learn to adjust the embeddings during training
Use the position information to understand word order

This matrix $X$ will flow through the rest of the transformer: attention, feed-forward layers, layer normalization, and finally produce predictions.

What’s Next¶

The input matrix $X$ is ready. Now comes the core of the transformer: self-attention.

Before we can compute attention, we need to project $X$ into three different representations:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What information should I pass along?”

These projections are called QKV projections, and they’re the subject of the next notebook.

# Store data for use in subsequent notebooks
embedding_data = {
    'X': X,
    'E_token': E_token,
    'E_pos': E_pos,
    'tokens': tokens,
    'TOKEN_NAMES': TOKEN_NAMES,
    'VOCAB_SIZE': VOCAB_SIZE,
    'D_MODEL': D_MODEL,
    'MAX_SEQ_LEN': MAX_SEQ_LEN
}
print("Embedding data ready for the next notebook.")

Embedding data ready for the next notebook.