Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

From Noise to Images

So far, everything we’ve built has been about language—predicting tokens, following instructions, reasoning through problems. But transformers aren’t limited to text.

The same attention mechanism that revolutionized NLP (Natural Language Processing) has transformed computer vision. And nowhere is this more visible than in image generation.

The Trick

Every image you’ve ever seen from Stable Diffusion, Midjourney, or DALL-E started as pure random noise. A generative model learned to transform that noise into coherent images.

In this section, we’ll build one from scratch.

The Problem We’re Solving

Here’s the setup:

  • We have training data (real images)

  • We want to sample new images from the same distribution

  • But we don’t know the distribution explicitly—we only have examples

The strategy: learn a transformation from a simple distribution (Gaussian noise) to our complex data distribution. If we can learn this transformation, we can generate new samples by:

  1. Sample noise

  2. Apply our learned transformation

  3. Out comes a realistic image

Why Flow Matching?

Several approaches exist for generative modeling:

ApproachCore IdeaChallenge
GANs (Generative Adversarial Networks)Generator fools discriminatorTraining instability
VAEs (Variational Autoencoders)Encode/decode through latent spaceBlurry outputs
DDPM (Denoising Diffusion Probabilistic Models)Gradually denoise over many stepsSlow sampling
Flow MatchingLearn straight paths from noise to dataSimple and fast

Flow matching has become the preferred choice because:

  1. Simpler mathematics — no stochastic differential equations

  2. Faster sampling — straight paths require fewer steps

  3. State-of-the-art results — used in Stable Diffusion 3, Flux, and more

The Core Idea

Flow matching constructs a continuous path between noise and data:

xt=(1t)xdata+txnoisex_t = (1-t) \cdot x_{\text{data}} + t \cdot x_{\text{noise}}

We train a neural network to predict the velocity along this path. Then to generate:

  1. Start with pure noise at t=1t=1

  2. Follow the velocity field backward to t=0t=0

  3. Arrive at a realistic image

The velocity is constant (straight lines!), which makes everything clean and efficient.

What We’ll Build

NotebookTopicWhat You’ll Learn
Flow MatchingThe basicsLinear interpolation, velocity fields, Euler sampling
Diffusion TransformerDiT (Diffusion Transformer) architecturePatchify, adaLN, transformers for images
Class ConditioningControlled generationClassifier-free guidance
Text ConditioningText-to-imageCLIP (Contrastive Language-Image Pre-training) encoder, cross-attention
Latent DiffusionScaling upVAE (Variational Autoencoder) compression, the Stable Diffusion approach

By the end, you’ll understand how modern image generation works—from the mathematical foundations to the architectural choices that make it practical.

Prerequisites

This section assumes familiarity with:

  • PyTorch — tensors, modules, training loops

  • Transformers — attention, the basics from earlier sections

  • Basic probability — distributions, sampling

The math gets a bit more involved than the language modeling sections (ODEs, or Ordinary Differential Equations, and flow equations), but we’ll build intuition step by step.


This is the final section of the book. By the end, you’ll have built transformers for both language and vision—understanding not just how they work, but why the same architecture succeeds across such different domains.