What a load of noise: A beginner’s mathematical guide to AI diffusion models

Madeleine Hall is overwhelmed

post

We asked ChatGPT to create a header image for the text in this article and it gave us this.

These two little vowels have infiltrated daily conversation, and we’re not talking about a decent draw from the Scrabble bag. From geopolitical debates to everyday chitchats—AI is everywhere, and people won’t shut up about it.

Who can blame them? It’s in your pocket when you ask your phone for the weather, on your commute when GPS (shout out to Gladys from issue 15’s significant figures) finds the best route, and in your inbox when your email filters out spam. AI helps pick your playlist, suggests what to watch after a long day, and powers the chatbot answering questions when you shop online.

Most of the time, you don’t even notice it. It’s just there making life a little smoother. But, as AI gets smarter people are asking bigger questions. Will it change the way we work? Is it sustainable? Can we trust it? Whether you’re excited, worried or disillusioned, one thing’s for sure: the conversation is just getting started.

I cannot keep up. The state of the art, the current highest-level of performance achieved in a particular domain, is continuously shifting. New models are constantly being released, outperforming predecessors as new techniques are discovered. Trying to chase and understand the ever-changing best feels like fighting a losing battle. But AI isn’t going anywhere. You could spend a lifetime learning, researching, using and arguing about it (increasingly many people are). So where do you start? I myself, as a trendy individual, am choosing to start with tackling one of the hottest AI tools of the season: diffusion models.

AI, NN, DL,… anyone for Scrabble?

Under the umbrella term `artificial intelligence’, which broadly means any intelligence exhibited by machines, a technique that’s shot to fame is neural networks. Neural networks (NNs) are a category of algorithms made up of layers of interconnected nodes. They are designed to recognise patterns and, once trained, can recognise those patterns in fresh data. A sub-category of these are deep learning algorithms, which are neural networks with many layers—typically more than three (I guess for computer scientists, $3$ counts as `many’).

Deep learning (DL) underpins much of today’s trendiest AI. One of the most successful techniques is the use of large language models, specifically a generative (capable of producing new content) pre-trained (trained on a huge dataset before being fine-tuned for specific tasks and conditioned on user inputs) transformer (an architecture processing text as tokens and applying self-attention mechanisms to understand context). Slap a chatbot interface on top of your generative pre-trained transformer, along with some safety training and reinforcement learning, and bam—you’ve got yourself ChatGPT, which can generate text bountifully.

Other generative AI tools you may have played with are Dall-E 2 and Stable Diffusion, which generate images. These tools are powered by diffusion models, which work very differently from transformers like ChatGPT. Instead of predicting the next word in a sequence, diffusion models start with pure randomness and gradually refine it into a meaningful image. The great news, for the mathematically curious, is that at the heart of these tools are key ideas from probability theory and stochastic processes.

The process trio: forward, reverse, and sampling

Diffusion models have three main components: a forward process, which systematically adds noise to an image over multiple steps; a reverse process, where a neural network learns to gradually remove this noise; and a sampling process, where the trained model generates new images by starting from random noise and iteratively refining it.

Forward: adding noise

Let’s start with an initial image, $x_0$. This could be an image of a frog, or a cat, or a frog on a cat. Mathematically, we consider $x_0$ as a single sample from a broader distribution, $x_0 \sim q(x)$. This could be the distribution of cat images, frog images, or a blend of the two. Our ultimate goal is to model $q(x)$ so that we can generate new images from the same underlying distribution as what we’ve started with.

“You haven’t seen a cat around here, have you?”

To begin, we take our image and start corrupting it by adding a little noise. After the first step, the image with a bit of noise added is given by:
\[
x_1 = \sqrt{1-\beta_1}x_0 + \sqrt{\beta_1}\varepsilon_1,
\]
where $ \varepsilon_1 $ is randomly drawn from a standard Gaussian distribution, $ \varepsilon_1 \sim \mathcal{N}(0, 1) $.

We repeat this multiple times, adding more noise at each step, and define $\alpha_t := 1-\beta_t$ to simplify notation downstream:
\[
x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{\beta_t}\varepsilon_t.
\]
By the time we’ve done this many times, say $ T $ steps in total, the image $x_T $ is indistinguishable from random noise.

After T steps, our cute travelling frogs have become indistinguishable from noise.

The fixed constants $\beta_1, …, \beta_T \in (0,1)$ vary across the time steps, which allows for fine control over the process. We choose these hyperparameters ahead of training our model. Smaller values of $\beta$ give a gentler nudge towards chaos, since this keeps more of the image from the previous step.

In terms of $q(x)$, because the noise added at each step are IID samples drawn from $\mathcal{N}(0,1)$, we write:
\[q(x_t|x_{t-1}) = \mathcal{N}(x_t | \sqrt{\alpha_t}x_{t-1}, \beta_t{I}).\]
Since each step only depends on the previous one, this forms what’s known as a Markov process. The function $q(x_t | x_{t-1})$ is called the transition kernel, as it defines the probability of moving from one step to the next.

The hyperparameters $\beta_t$ are often called the noise schedule. These are the only parameters we provide the forward process, as we directly compute $\alpha_t$ from $\beta_t$. In a sense, whilst $\beta$ represents how much noise we’re adding at each step, $\alpha$ represents how much of the original image to keep.

If $\beta_t$ is too large in the early steps, the image quickly becomes dominated by noise, making it difficult for the neural network in the reverse process to learn meaningful structure. On the other hand, if $\beta_t$ is too small the corruption process is too slow, requiring more steps and computational resources.

A well-chosen noise schedule ensures that each step is neither drastic nor trivial. A typical strategy is using a linear or quadratic increase in $\beta_t$, so that early steps retain more information, while later steps gradually transition to full noise.

Reverse: removing noise

If the forward process was about drowning our image in noise, the reverse process is about fishing it back out. However, because the noise is randomly sampled at each step, there isn’t a single deterministic path to simply rewind the process. Fortunately, we have neural networks up our sleeve. We can train a neural network to approximate the most likely reverse step, predicting how much noise was added at each stage.

In the forward process, we defined the transition kernel $q(x_t | x_{t-1})$. Ideally, we would simply invert this to get $q(x_{t-1} | x_{t})$, the probability of undoing a step of noise. However, calculating this directly is tricky because it depends on the entire data distribution.

Instead, we approximate it using a learnable reverse transition kernel, $p_\theta(x_{t-1}| x_t)$, parametrised by a neural network: \[p_\theta(x_{t-1}| x_t) = \mathcal{N}(x_{t-1} | \mu_\theta(x_t, t), \sigma_t^2).\]

This means, rather than inverting the forward process exactly, we train a neural network to estimate the most likely denoised image $\mu_\theta(x_t, t)$ at each step. The noise added in each step of the forward process is Gaussian, so we can reasonably assume the reverse process is also Gaussian.

It’s common to assume the variance $\sigma_t^2$ is fixed, allowing us to focus on learning the mean $\mu_\theta(x_t, t)$. Therefore, the reverse step is given by:
\[
x_{t-1} = \mu_\theta(x_t, t) + \sigma_t\varepsilon,
\]
where $\sigma_t:=\sqrt{1-\overline{\alpha}_t} $ is the (known or learned) standard deviation at step $t$, and $\overline{\alpha}_t = \alpha_1\alpha_2…\alpha_t$ (the product of all the $\alpha$’s up to and including $t$).

To train our model, we optimise a loss function that ensures our learned mean $\mu_\theta(x_t, t)$ correctly predicts the noise $\varepsilon_t $ that was added at each step. The training objective is:
\[
L = \mathbb{E}_q \left[ \| \varepsilon_t – \varepsilon_\theta(x_t, t) \|^2 \right],
\]
where $\varepsilon_\theta(x_t, t) $ is the neural network’s prediction of the noise. By minimising this loss function, we ensure our model can accurately peel back the noise layer by layer during the reverse process.

In short, the reverse process trains a neural network that takes inputs $x_t$ and $t$ to learn the best $\theta$ such that its output $\varepsilon_\theta(x_t, t)$ best reflects the amount of noise added to our image at step $t$.

Sampling: creating new images

Because, through the reverse process, we’ve learned the size and structure of the noise added to our image at each step, we can use this knowledge to generate new images.

Once the reverse process is trained, we can use it to generate new images from scratch. We begin with a random noise sample, $x_T \sim \mathcal{N}(0, 1)$, and iteratively apply the learned denoising function. Step by step, the noise is reduced, revealing an image that resembles samples from $q(x)$—be that a cat, or a frog, or a blend of the two!

After training, we can start with pure noise and apply a sequence of denoising functions to add back cat- and frogness. Perfect. Image: Adobe Firefly

This iterative denoising process is how models like Stable Diffusion and Dall-E 2 generate high-quality images, starting from pure noise.

It’s just a little noise

While the nuts and bolts used in AI will continue to evolve, these underlying principles will remain fundamental. By diving into some of the maths of diffusion models, I hope this gives you a solid foundation to follow (and maybe even contribute to) the next breakthroughs in AI. And if you still feel overwhelmed, well, at least now you know it’s just a little noise.

Madeleine Hall is a mathematical consultant at the Smith Institute, based in Oxford. She likes writing, open water swimming, the Oxford comma, and tHiS mEmE. Her PhD research was on optimal swimming for microorganisms. She has found none of her results of any use in the lido.

More from Chalkdust