Skip to content

The Complete Transformer

The complete picture: We now assemble all our components into a working decoder-only transformer (GPT-style). This is a complete language model that can be trained to predict the next word in a sequence.

What is “decoder-only”? The original transformer paper had both an encoder (for reading input) and decoder (for generating output), used for translation. Modern language models like GPT use only the decoder part, which is simpler and works great for text generation. The key difference is that decoder-only models use causal masking—they can only look at previous tokens, not future ones.

  1. Token Embedding: Convert input token IDs (integers) to dense vectors

  2. Positional Encoding: Add position information to tell the model where each token is

  3. Transformer Blocks (×N): Stack multiple identical blocks (we use 6; GPT-3 uses 96). Each block refines the representations through attention and feed-forward processing

  4. Final LayerNorm: One last normalization to stabilize the final outputs

  5. Output Projection: Project from d_model dimensions to vocabulary size, giving us scores (logits) for every possible next token

What are logits? The model outputs “logits”—raw, unnormalized scores for each token in the vocabulary. Higher scores mean the model thinks that token is more likely to come next. We can convert these to probabilities using softmax, then either pick the highest (greedy decoding) or sample from the distribution (for more creative generation).

def __init__(
self, vocab_size, d_model=512, num_heads=8,
num_layers=6, d_ff=2048, max_seq_len=5000, dropout=0.1
):
super().__init__()
# Token and positional embeddings
self.token_embedding = TokenEmbedding(vocab_size, d_model)
self.pos_encoding = PositionalEncoding(d_model, max_seq_len)
# Stack of transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# Final layer norm and output projection
self.ln_f = nn.LayerNorm(d_model)
self.output_proj = nn.Linear(d_model, vocab_size)
def forward(self, x, mask=None):
# Create causal mask if not provided
if mask is None:
mask = self.create_causal_mask(x.size(1)).to(x.device)
# 1. Embed tokens and add positions
x = self.token_embedding(x) # (batch, seq) → (batch, seq, d_model)
x = self.pos_encoding(x)
# 2. Pass through all transformer blocks
for block in self.blocks:
x = block(x, mask=mask) # (batch, seq, d_model) → (batch, seq, d_model)
# 3. Final normalization and projection to vocabulary
x = self.ln_f(x)
logits = self.output_proj(x) # (batch, seq, d_model) → (batch, seq, vocab_size)
return logits

See the full implementation: src/transformer/model.py