Exploring Diffusion Networks

Diffusion networks generate images by learning to reverse a noising process. The forward pass corrupts an image with Gaussian noise over many timesteps until it becomes pure noise; the reverse pass — a neural network — is trained to predict and remove that noise step by step. This is the architectural core of DALL-E and Stable Diffusion, and it is what makes diffusion training fundamentally more stable than the adversarial loop used by GANs, at the cost of slower inference.

In what follows we walk through the forward process, the parametrised backward process, the U-Net that does the heavy lifting, the timestep encoding trick, and the loss term that makes the whole thing trainable. We close with how this architecture compares with GANs in practice — the trade-off most teams actually need to make.

What is a diffusion model?

Diffusion models are probabilistic generative models that produce new, high-quality samples resembling the training distribution. They are useful for denoising and data generation because they can be trained to preserve underlying structure while removing or adding controlled noise.

A diffusion model is best understood as a latent-variable model, where the latent space is reached not through an encoder network (as in a VAE) but through a Markov chain of T timesteps that incrementally add Gaussian noise to the image. The Markov property — each step depends only on the previous one — is what makes the math tractable.

Two processes sit at the heart of the model:

Forward process. Starts with the original image x₀. At every step, a controlled amount of Gaussian noise is added, until at step T the image is indistinguishable from pure noise.
Parametrised backward process. A neural network is trained to predict the noise added between successive steps. If the network learns this well, you can start from pure noise and iteratively subtract the predicted noise to walk backward to a clean x₀ — that is, generate a new image.

Unlike VAEs, the input image and the latent variable share the same dimensions. That property is why a U-Net architecture — same input and output spatial shape — is the natural choice for the noise predictor.

The forward diffusion process

The forward process adds noise to the image step by step:

Each subsequent sample depends only on its predecessor. The Gaussian noise added between steps is parameterised by a mean and a fixed variance:

βₜ is the variance schedule and controls how much noise enters at each timestep. As t grows, βₜ grows, so the 1 − βₜ term shrinks and the running mean approaches zero. I is the identity matrix, which fixes the variance per channel. Some variants of diffusion networks learn the variance as well; for clarity we stick with the fixed-variance formulation where only the mean is learned.

Choosing βₜ is called noise scheduling and matters more than it looks. Set it wrong and the image at step T either still carries usable signal (the model never learns to invert pure noise) or collapses to noise too early (training signal disappears). Common schedule families include linear, cosine, cosine (logSNR), sigmoid, sigmoid (logSNR), and quadratic.

A useful adjustment is the input scaling factor b:

Although the forward process is described iteratively, the sum of Gaussians is itself a Gaussian — so via the reparameterisation trick you can sample xₜ at any timestep directly without walking through every prior step:

Equivalently:

This uses the cumulative variance schedule αₜ, precomputed as:

This shortcut is what makes training tractable: instead of unrolling T steps, you pick a random t per minibatch, jump directly to xₜ, and train the noise predictor on that single noisy sample.

The backward denoising process

The backward step is used for two distinct things — training and sampling.

During training, we don’t iterate over every timestep. We pick a random t, sample noise ϵ, use the forward shortcut to produce xₜ, feed xₜ into the network, and ask it to predict ϵ. The loss is the L2 distance between the predicted and the true noise.

During sampling — generating a new image — we do iterate. We start from xₜ (pure Gaussian noise), feed it to the trained network, predict the noise, subtract a calibrated portion to obtain xₜ₋₁, and repeat all the way down to x₀. This is also the reason diffusion inference is slow compared with GANs: a single forward pass through a GAN generator emits a finished image, while diffusion requires hundreds or thousands of network evaluations per image.

Architecture — why U-Net

A U-Net is well suited as the noise predictor because input and output share spatial dimensions. U-Nets have been workhorses in image segmentation, and conditional GAN variants have long used them for synthesis too.

The defining feature of a U-Net is its hierarchical structure: the input passes through a stack of down-sampling layers, losing spatial resolution while gaining feature channels, reaches a bottleneck, then climbs back up through symmetric up-sampling layers with skip (residual) connections at each matching level. More recent diffusion U-Nets also include attention layers at lower spatial resolutions, where the model can afford the quadratic cost and benefits most from global context.

The math of the reverse pass

The reverse process is described by:

We start from pure noise with unit variance and zero mean. The model learns the conditional probability density of the previous timestep given the current one:

This density ρθ is defined by the predicted Gaussian noise distribution. To get xₜ₋₁ we remove the predicted noise from xₜ:

The full update for xₜ₋₁ is:

ϵθ is the U-Net’s output, an estimate of the ϵ that was used in the forward process to construct xₜ. The σₜz term injects a small amount of fresh Gaussian noise into each update; this comes from stochastic gradient Langevin dynamics, a sampling technique originally from molecular physics, and it prevents the chain from collapsing into local minima.

Timestep encoding

Because the same U-Net weights are shared across all T timesteps, the network needs to know which step it is currently denoising. The standard solution is positional embedding — assigning a unique vector to each timestep index.

For the kₜₕ position in a sequence of length L, the encoding is:

With the variables read as:

k: position in the input sequence, 0 ≤ k < L/2
d: dimension of the embedding space
P(k,j): the position function
n: an arbitrary scalar, usually 10000
i: column index of the encoding matrix, 0 ≤ i < d/2

Even columns use sine, odd columns use cosine. The properties that make this useful are simple:

Both sine and cosine output values in [−1, 1], so the embedding is naturally normalised.
Each position gets a unique vector — different k means a different sinusoidal pattern.
The relative distance between two positions is recoverable through linear combinations of their encodings, which is the property NLP transformers rely on.

Plotted as a matrix with k on the y-axis and i on the x-axis:

Each row of this matrix is the embedding vector for one timestep.

Inside the U-Net, for any given timestep we look up the embedding vector, push it through a dense layer to match the feature depth of the current image tensor, broadcast it across the spatial dimensions, and add it to the feature maps at every up- and down-sampling block.

Loss metric

The original derivation in Ho et al. uses the variational lower bound, mirroring the VAE objective. The first loss term they derive is in terms of the true and predicted means of xₜ₋₁:

Reparametrising the Gaussian term so that the model predicts the noise ϵ directly from xₜ is cleaner — xₜ is what we already have at training time:

Substituting back:

Empirically, the paper found that dropping the per-timestep weighting term and using a simplified objective worked better:

In other words: train the U-Net by minimising the squared error between the predicted noise ϵθ and the actual noise ϵ that was sampled when generating xₜ. The result looks deceptively simple — the derivation behind it is not. For a step-by-step treatment, Lilian Weng’s diffusion-model write-up remains one of the cleanest references.

Diffusion vs GAN — when each architecture earns its place

Property	Diffusion model	GAN
Training stability	Stable; single L2 loss, no min–max game	Notoriously unstable; mode collapse and adversarial oscillation are common failure modes
Inference cost	Hundreds to thousands of network passes per image	Single forward pass — orders of magnitude faster
Sample diversity	High; full distribution coverage by construction	Prone to mode collapse — observed-pattern across published GAN training reports
Controllability	Strong via classifier-free guidance and conditioning at every denoising step	Conditioning works but is coupled to the generator architecture
Data requirements	Generally larger; longer training schedules	Can succeed on smaller datasets when training converges

This is the trade-off most teams actually face. If you need real-time generation (game rendering, video synthesis at frame rate, latency-bound interactive tools), the diffusion inference cost is prohibitive without distillation tricks, and a well-trained GAN — or a distilled diffusion model that emits in 1–4 steps — is the right answer. If you need high sample diversity, stable training, and fine-grained conditioning (text-to-image, inpainting, controllable synthesis), diffusion wins on every axis except wall-clock.

For a deeper look at the architectural choices behind text-to-image diffusion specifically, see understanding generative AI and stable diffusion models. The schedule decisions that determine whether training converges at all are covered in the diffusion forward process and noise schedule.

FAQ

Closing

Diffusion networks are the right tool when training stability and sample diversity are non-negotiable, and they are the wrong tool when inference latency is. At TechnoLynx we treat the GAN-versus-diffusion choice as a deployment-shape decision, not an architectural preference — the training cost, the serving cost, and the conditioning surface together determine which architecture earns its place on a given project. The next architectural question is usually distillation: how few sampling steps can we tolerate before quality degrades past the use case’s threshold.

References

[1] Chen, T. (2023). On the Importance of Noise Scheduling for Diffusion Models
[2] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models
[3] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., et al. (2021). Zero-Shot Text-to-Image Generation
[4] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models
[5] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation
[6] Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., et al. (2022). Diffusion Models: A Comprehensive Survey of Methods and Applications