Skip to content

Transformer Block

Bringing it all together: A transformer block combines all our components into one repeatable unit. The full transformer model is just many of these blocks stacked on top of each other (GPT-3 has 96 blocks!).

Each block contains four key components:

  • Multi-head attention: Communication layer—tokens gather information from other tokens
  • Feed-forward network: Computation layer—each token processes its gathered information
  • Layer normalization: Stabilizes training by normalizing activations (prevents them from growing too large or small)
  • Residual connections: “Skip connections” that create gradient highways for training deep networks

We use the Pre-LN (Pre-Layer Normalization) approach used in modern models like GPT-2 and GPT-3. This means we apply layer normalization before each sub-layer (attention or FFN) rather than after. This makes training more stable, especially for very deep networks.

Transformer block architecture

def forward(self, x, mask=None):
# First sub-layer: Multi-head attention with residual
residual = x
x = self.norm1(x) # Pre-LN
x = self.attention(x, mask=mask)
x = self.dropout1(x)
x = x + residual # Residual connection
# Second sub-layer: Feed-forward with residual
residual = x
x = self.norm2(x) # Pre-LN
x = self.ffn(x)
x = self.dropout2(x)
x = x + residual # Residual connection
return x

See the full implementation: src/transformer/block.py