The Complete Transformer
The complete picture: We now assemble all our components into a working decoder-only transformer (GPT-style). This is a complete language model that can be trained to predict the next word in a sequence.
What is “decoder-only”? The original transformer paper had both an encoder (for reading input) and decoder (for generating output), used for translation. Modern language models like GPT use only the decoder part, which is simpler and works great for text generation. The key difference is that decoder-only models use causal masking—they can only look at previous tokens, not future ones.
How Data Flows Through the Model
Section titled “How Data Flows Through the Model”-
Token Embedding: Convert input token IDs (integers) to dense vectors
-
Positional Encoding: Add position information to tell the model where each token is
-
Transformer Blocks (×N): Stack multiple identical blocks (we use 6; GPT-3 uses 96). Each block refines the representations through attention and feed-forward processing
-
Final LayerNorm: One last normalization to stabilize the final outputs
-
Output Projection: Project from d_model dimensions to vocabulary size, giving us scores (logits) for every possible next token
What are logits? The model outputs “logits”—raw, unnormalized scores for each token in the vocabulary. Higher scores mean the model thinks that token is more likely to come next. We can convert these to probabilities using softmax, then either pick the highest (greedy decoding) or sample from the distribution (for more creative generation).
Implementation
Section titled “Implementation”def __init__( self, vocab_size, d_model=512, num_heads=8, num_layers=6, d_ff=2048, max_seq_len=5000, dropout=0.1): super().__init__()
# Token and positional embeddings self.token_embedding = TokenEmbedding(vocab_size, d_model) self.pos_encoding = PositionalEncoding(d_model, max_seq_len)
# Stack of transformer blocks self.blocks = nn.ModuleList([ TransformerBlock(d_model, num_heads, d_ff, dropout) for _ in range(num_layers) ])
# Final layer norm and output projection self.ln_f = nn.LayerNorm(d_model) self.output_proj = nn.Linear(d_model, vocab_size)def forward(self, x, mask=None): # Create causal mask if not provided if mask is None: mask = self.create_causal_mask(x.size(1)).to(x.device)
# 1. Embed tokens and add positions x = self.token_embedding(x) # (batch, seq) → (batch, seq, d_model) x = self.pos_encoding(x)
# 2. Pass through all transformer blocks for block in self.blocks: x = block(x, mask=mask) # (batch, seq, d_model) → (batch, seq, d_model)
# 3. Final normalization and projection to vocabulary x = self.ln_f(x) logits = self.output_proj(x) # (batch, seq, d_model) → (batch, seq, vocab_size)
return logitsFull Code
Section titled “Full Code”See the full implementation: src/transformer/model.py