From Noise to Images - An Introduction to Transformers

So far, everything we’ve built has been about language—predicting tokens, following instructions, reasoning through problems. But transformers aren’t limited to text.

The same attention mechanism that revolutionized NLP (Natural Language Processing) has transformed computer vision. And nowhere is this more visible than in image generation.

The Trick¶

Every image you’ve ever seen from Stable Diffusion, Midjourney, or DALL-E started as pure random noise. A generative model learned to transform that noise into coherent images.

In this section, we’ll build one from scratch.

The Problem We’re Solving¶

Here’s the setup:

We have training data (real images)
We want to sample new images from the same distribution
But we don’t know the distribution explicitly—we only have examples

The strategy: learn a transformation from a simple distribution (Gaussian noise) to our complex data distribution. If we can learn this transformation, we can generate new samples by:

Sample noise
Apply our learned transformation
Out comes a realistic image

Why Flow Matching?¶

Several approaches exist for generative modeling:

Approach	Core Idea	Challenge
GANs (Generative Adversarial Networks)	Generator fools discriminator	Training instability
VAEs (Variational Autoencoders)	Encode/decode through latent space	Blurry outputs
DDPM (Denoising Diffusion Probabilistic Models)	Gradually denoise over many steps	Slow sampling
Flow Matching	Learn straight paths from noise to data	Simple and fast

Flow matching has become the preferred choice because:

Simpler mathematics — no stochastic differential equations
Faster sampling — straight paths require fewer steps
State-of-the-art results — used in Stable Diffusion 3, Flux, and more

The Core Idea¶

Flow matching constructs a continuous path between noise and data:

x_t = (1-t) \cdot x_{\text{data}} + t \cdot x_{\text{noise}}

(1)

We train a neural network to predict the velocity along this path. Then to generate:

Start with pure noise at $t=1$
Follow the velocity field backward to $t=0$
Arrive at a realistic image

The velocity is constant (straight lines!), which makes everything clean and efficient.

What We’ll Build¶

Notebook	Topic	What You’ll Learn
Flow Matching	The basics	Linear interpolation, velocity fields, Euler sampling
Diffusion Transformer	DiT (Diffusion Transformer) architecture	Patchify, adaLN, transformers for images
Class Conditioning	Controlled generation	Classifier-free guidance
Text Conditioning	Text-to-image	CLIP (Contrastive Language-Image Pre-training) encoder, cross-attention
Latent Diffusion	Scaling up	VAE (Variational Autoencoder) compression, the Stable Diffusion approach

By the end, you’ll understand how modern image generation works—from the mathematical foundations to the architectural choices that make it practical.

Prerequisites¶

This section assumes familiarity with:

PyTorch — tensors, modules, training loops
Transformers — attention, the basics from earlier sections
Basic probability — distributions, sampling

The math gets a bit more involved than the language modeling sections (ODEs, or Ordinary Differential Equations, and flow equations), but we’ll build intuition step by step.

This is the final section of the book. By the end, you’ll have built transformers for both language and vision—understanding not just how they work, but why the same architecture succeeds across such different domains.