Generative image models like Stable Diffusion, DALL-E, and Midjourney don't paint — they denoise. The model learns to reverse a gradual noise-adding process, starting from pure random noise and iteratively refining it into a coherent image guided by a text prompt. The mechanism is surprisingly elegant.

Most people assume image generation models work by somehow 'drawing' based on descriptions. The actual mechanism is counterintuitive and elegant. Diffusion models are trained on a simple principle: take a real image, add a tiny bit of noise, then ask a neural network to predict and remove that noise. Do this iteratively with increasing noise levels. Eventually, you have a model that knows how to denoise images at any noise level, from slightly noisy to pure random noise. To generate new images, run this process in reverse. Start with pure random noise — pixels that look like TV static. Apply the trained denoising network. The output is slightly less noisy. Apply it again. And again. After 20–50 iterations, the noise has been transformed into a coherent image. Text guidance works by conditioning the denoising network on the text embedding — at each step, the model denoises toward images that match the text description. Classifier-free guidance amplifies this conditioning by comparing conditional and unconditional predictions. Modern refinements include latent diffusion (operate in compressed latent space for efficiency), flow matching (faster training and inference), and control networks like ControlNet (use additional conditioning like sketches, depth maps, or poses). The same fundamental mechanism now powers video models (add temporal dimension), 3D models (operate in 3D space), and even some text models. Diffusion is not just a technique — it's a fundamentally different approach to generation than the autoregressive token-by-token generation used in language models.

How Diffusion Models Generate Images: From Noise to Coherent Pictures