Part 0: Sampling from the Model

The seed used for Part A was 222

In this part, you will sample images from the diffusion model. The code generates images by sampling from the model with varying numbers of inference steps. Examining three sets of images generated with 20, 40, and 100 inference steps reveals a clear increase in image quality.

20 Steps

Image 1 Image 1 Image 1

40 Steps

Image 1 Image 1 Image 1

100 Steps

Image 1 Image 1 Image 1

Part 1: Sampling Loops

Part 1.1: Forward Process

The forward process is implemented using the equation:

x_t = √(ᾱ_t) x_0 + √(1 - ᾱ_t) ε, where ε ∼ N(0, 1)
    

This process adds Gaussian noise to an image while reducing the influence of the original image.

Timestep Description Image
t = 0 Original image. Original Image
t = 250 The tower remains visible, but random noise begins to affect clarity, especially in the sky and edges. Noisy Image at t=250
t = 500 Noise obscures much of the tower's shape, with colorful static making details hard to discern. Noisy Image at t=500
t = 750 The image is almost entirely noise, with only a faint outline of the tower visible. Noisy Image at t=750

Part 1.2: Gaussian Denoising

Applied Gaussian blur filtering to denoise image I did this for the noise values [250, 500, 750]. This clearly shows that this method is not effective at denoising

Image 1

Part 1.3: One-Step Denoising

Utilized our pretrained UNet model to estimate and remove noise in just one step. We see that this does much better than the Gaussian denoising

Image 1

Part 1.4: Iterative Denoising

The above one-step denoising does well but there is still room for improvement. To better this method: we will use iterative denoising to recover clean images from noisy inputs using strided timesteps. Using a stride of 30 timesteps from 990 down to 0, we can observe an improvement in image quality. The results clearly show how multiple smaller denoising steps produce significantly better results than attempting to denoise in a single large step. We will compare iterative denoising with one-step denoising and classical denoising

Noise 40

Noise 40

Noise 240

Noise 240

Noise 390

Noise 390

Noise 540

Noise 540

Noise 690

Noise 690

Now all three together:

Noise 40

Gaussian

Noise 240

One-Step

Noise 390

Iterative

Part 1.5: Diffusion Model Sampling

The model, prompted with "a high-quality photo," transforms Gaussian noise into an image by iteratively removing noise based on patterns from its training data. Each step brings the output closer to the natural image manifold.

Noise 40

Image 1

Noise 240

Image 2

Noise 390

Image 3

Noise 540

Image 4

Noise 690

Image 5

Part 1.6: Classifier-Free Guidance

CFG improves image generation by blending two noise estimates:

  • Conditional: Based on the prompt: "a high quality photo".
  • Unconditional: Based on an empty prompt "".

The estimates are combined using the formula:

noise_est = noise_est_uncond + cfg_scale * (noise_est_cond - noise_est_uncond)

With cfg_scale > 1, we get higher quality images aligned with the prompt.

The iterative_denoise_cfg function achieves this by:

  • Running the UNet model twice per step: once with prompt embeddings, once with null embeddings.
  • Combining predictions using the CFG formula before applying each denoising step.

For the images below, we used cfg_scale = 7.

Noise 40

Image 1

Noise 240

Image 2

Noise 390

Image 3

Noise 540

Image 4

Noise 690

Image 5

Part 1.7: Image Translation and Conditioning

Part 1.7.1: Editing hand drawn and web images

we take a real image, add noise to it, and then denoise. This effectively allows us to make edits to existing images. The more noise we add, the larger the edit will be. This works because in order to denoise an image, the diffusion model must to some extent "hallucinate" new things -- the model has to be "creative."

Here, we're going to take the original test image, noise it a little, and force it back onto the image manifold without any conditioning. Effectively, we're going to get an image that is similar to the test image

We test a range of starting indices [1, 3, 5, 7, 10, 20] that correspond to different levels editing.

Noise 40

start index 1 noise

Noise 240

start index 3 noise

Noise 390

start index 5 noise

Noise 540

start index 7 noise

Noise 690

start index 10 noise

Noise 690

start index 20 noise

Noise 690

Original

Now for two hand drawn images

Noise 40

start index 1 noise

Noise 240

start index 3 noise

Noise 390

start index 5 noise

Noise 540

start index 7 noise

Noise 690

start index 10 noise

Noise 690

start index 20 noise

Noise 690

Original

Noise 40

start index 1 noise

Noise 240

start index 3 noise

Noise 390

start index 5 noise

Noise 540

start index 7 noise

Noise 690

start index 10 noise

Noise 690

start index 20 noise

Noise 690

Original

Part 1.7.1: Inpainting

This method lets you edit an image using a binary mask. Given:

  • Original Image: x_orig
  • Binary Mask: m

The result keeps the original content where m = 0 and generates new content where m = 1.

How It Works

  1. Run the diffusion denoising loop.
  2. At each step, update the image to match the original where m = 0.

The update rule is:

x_t ← m * x_t + (1 - m) * forward(x_orig, t)

This ensures that the unmasked areas retain the original image while the masked areas are updated with new content.

Example

Use this method to inpaint specific parts of an image, such as restoring the top of the Campanile in a photo.

The test image

Noise 40

Campanile

Noise 240

Mask

Noise 390

To replace

Noise 540

InPainted

Noise 40

Capybara

Noise 240

Mask

Noise 390

To replace

Noise 540

InPainted

Noise 40

Water Bottle

Noise 240

Mask

Noise 390

To replace

Noise 540

InPainted

Part 1.7.3: Text-Conditioned Image-to-image Translation

In this section, we extend the previous technique by guiding the image projection using a text prompt. This allowing more specific edits beyond just projecting to the natural image manifold.

How It Works

Instead of using the default prompt "a high quality photo", we have specified custom text prompt to guide the projection.

Edited versions of the test image, using the given prompt at noise levels: [1, 3, 5, 7, 10, 20].

capybara image to dog prompt: "a photo of a dog"

Noise 40

capybara

Noise 40

390 noise

Noise 240

690 noise

Noise 390

780 noise

Noise 540

840 noise

Noise 40

900 noise

Noise 40

960 noise

waterbottle image to barista prompt:'a photo of a hipster barista'

Noise 40

waterbottle

Noise 40

390 noise

Noise 240

690 noise

Noise 390

780 noise

Noise 540

840 noise

Noise 40

900 noise

Noise 40

960 noise

Campanile image to rocket prompt: "a rocket ship"

Noise 40

Campanile

Noise 40

390 noise

Noise 240

690 noise

Noise 390

780 noise

Noise 540

840 noise

Noise 40

900 noise

Noise 40

960 noise

Part 1.8 Visual Anagrams

We will now generate images that look like one thing but, when flipped upside down, reveal something else. In this example, we will generate an image that looks like "an oil painting of an old man" but, when flipped upside down, reveals "an oil painting of people around a campfire."

How It Works

The process involves denoising an image in two ways, combining the results to create the illusion:

  1. Denoise the image x_t at step t using the prompt "an oil painting of an old man", obtaining the noise estimate ε₁.
  2. Flip the image upside down and denoise it with the prompt "an oil painting of people around a campfire", yielding the noise estimate ε₂.
  3. Flip ε₂ back to align it with ε₁.
  4. Average the two noise estimates: ε = (ε₁ + ε₂) / 2.
  5. Perform a reverse diffusion step using the averaged noise estimate ε.

The algorithm can be summarized as:

    ε₁ = UNet(x_t, t, p₁)
    ε₂ = flip(UNet(flip(x_t), t, p₂))
    ε  = (ε₁ + ε₂) / 2

Here are the results:

prompt1 = "an oil painting of an old man" prompt2 = "an oil painting of people around a campfire"

Noise 40

Old man

Noise 40

People around a camp fire

prompt1 = 'a wild baboon' prompt2 = 'a volcano eruption'

Noise 40

Baboon

Noise 40

volcano erupting

prompt1 = 'a ski lift with a mountain background' prompt2 = 'a row boat in venice'

Noise 40

boat

Noise 40

ski-lift

Part 1.9: Hybrid Images

To create hybrid images using a diffusion model, we blend low frequencies from one noise estimate with high frequencies from another. Here’s the step-by-step process:

  1. Estimate noise ε₁ for the first prompt using the UNet model.
  2. Estimate noise ε₂ for the second prompt using the same model.
  3. Combine the noise estimates:
    • Apply a low-pass filter (e.g., Gaussian blur) to ε₁.
    • Apply a high-pass filter to ε₂.
    • Add the results: ε = f_lowpass(ε₁) + f_highpass(ε₂).

The algorithm can be summarized as:

    ε₁ = UNet(x_t, t, p₁)
    ε₂ = UNet(x_t, t, p₂)
    ε  = f_lowpass(ε₁) + f_highpass(ε₂)
Noise 40

prompt1 = "a lithograph of a skull"
prompt2 = "a lithograph of waterfalls"

Noise 40

prompt1 = "a rocket ship"
prompt2 = 'a pencil'

Noise 40

prompt1 = 'a man wearing a hat'
prompt2 = 'a volcano eruption'

Part 2: Training Diffusion Networks

In this part of the project we endeavor to structure and train our own diffusion models using pytorch on the MNIST digits dataset. We will structure this into the following steps:
  • Setting up the UNet Structure.
  • Using the UNet to Train a Denoiser
  • Adding Time Conditioning to UNet
  • Training the UNet
  • dding Class-Conditioning to UNet

Part 2.2: Noise Levels on MNIST digits

We will start by adding noise to the MNIST digits dataset. We will add noise at different levels to the images and visualize this.

Noise

Noised MNIST digits

Part 2.2.1: Training the model to perform denoising

We will now train the model to perform denoising on the MNIST digits dataset. We will train the model to remove noise from the images. Here is the loss curve for the model:

Train a denoiser to denoise noisy image z with sigma=0.5 applied to a clean image x.

Dataset and dataloader:

  • Use the MNIST dataset via torchvision.datasets.MNIST with flags to access training and test sets
  • batch size: 256
  • Train for 5 epochs
  • hidden dimension D = 128
  • Adam optimizer
  • Learning rate: 1e-4

Here is the loss curve for the model:

Noise

Loss curve

Here are the results on some of the MNIST images in the test set

:
Noise 40

After Epoch 1

Noise 40

After Epoch 5

Note that our denoiser is trained on sigma = 0.5, Let's see how the denoiser performs on different sigmas that it wasn't trained for.

Noise 40

Part 2.2: Adding Time Conditioning to UNet

Now we move on training the diffusion models. We begin by adding time conditioning the the UNet structure and training the model.

Here is the training loss curve for the UNet with time conditioning:

Noise 40

Here are the results on some of the MNIST images in the test set

:
Noise 40

After Epoch 5

Noise 40

After Epoch 20

Part 2.4: Adding Class Conditioning to UNet

Now we move on training the diffusion models. We begin by adding class conditioning the the UNet structure and training the model. Here is the training loss curve for the UNet with class conditioning:

Noise 40

Here are the results on some of the MNIST images in the test set

:
Noise 40

After Epoch 5

Noise 40

After Epoch 20