CS180 Project 5: Fun with Diffusion Models

Matthew Sulistijowadi, pre-amble countdown

Overview:

Played with diffusion models, including sampling loops used for inpainting and creation of optical illusions. Part B included training our own diffusion model using MNIST.

Part A: The Power of Diffusion Models!

Part 0: Setup

Random seed used is 47

With the given 3 prompts, generated images for each of them, and used inference steps of 5 and 20.

1.1 Forward Process

Did forward process of diffusion, at different timesteps added scaled noise to clean images with visuals for 250, 500 and 750.

1.2 Classical Denoising

Used Gaussian blur filtering to try and elim noise for above images

1.3 One-Step Denoising

Using pretrained diffusion model UNet to denoise images. Try to recover some gaussian noise from image and get something closer to OG image

Original (blurry since originally small)

1.4 Iterative Denoising

Cleans up noisy images in each step by starting with image being heavily distorted to then decrease noise levels over steps getting even better higher quality images

1.5 Diffusion Model Sampling

Generated images from scratch from random noise and used the print “a high quality photo”

1.6 Classifier-free Guidance (CFG)

Images generated with CFG are greatly enhanced and of higher quality. Combine noise estimates of conditioning a prompt and another unconditional estimate. Compared to previously, these images look better by a large amount

1.7 Image-to-image Translation (with CFG from now on)

Start with different noise levels with higher levels being closer to OG image.

1.7.1 Editing Hand-Drawn and Web Images

Using non realistic images, use the diffusion model and transform them closer to image. Also for hand drawn images, use same process to make them look better.

funny 20 result, punk look with sign saying LOL

1.7.2 Inpainting

Can alter desired parts of images without changing the rest of it. can make new content whenever the mask is set to 1. Can apply to Campanile and other previous images

1.7.3 Text-Conditioned Image-to-image Translation

Guide the projected output with text prompt giving some control (ie Campanile is “a rocket ship”)

“a cheese block”

“a wall of fire”

1.8 Visual Anagrams

Create optical illusions with diffusion models, looks like one thing but if flip, then something else.

‘an oil painting of people around a campfire’

“a photo of a water buffalo” (woman arms become horns and dress makes buffalo head

1.9 Hybrid Images

Depending on distance from image, the view changes.

Goose close up, Statue of Liberty further away

Part B: Diffusion Models from Scratch!

Training own diffusion model for MNSIT with first part making UNet architecture to have image denoising.

UNet contains some up and down sampling with skip connections, constructed the Unconditional UNet and Standard operations for it.

Tensor operations and UNet structure implemented

Part 1: Training a Single-Step Denoising UNet

Given z (noisy image) want to train the denoiser so it corresponds to a clean image and below, can see the progress of gradually decreasing the noise in each image.

Visualization of Noising Process with [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

Training loss curve plot (super eratic but notice the Loss being very small at < 1%

Test Varying noise levels on digits with sigmas not previously trained on. Meets website looks of sigmas 0.8 and 1.0 having worse looking results while the others look fine.

Part 2: Training a Diffusion Model

Started training model with time-conditioned UNet to get somewhat reasonable results but also did class-conditioning to dramatically improve quality and consistency to get numbers that look fairly normal. Made a new UNet by tweaking the UNet previously created to insert timestep parts usign the FCBlocks.

Added Time conditioning using FCBlocks (red boxed section)

2.3 Sampling from UNet

Images denoised in order to output digits although a noticeable majority have some deformation or unclear even which number it is (if any). Thus use class conditioning to increase the accuracy with classifier-free guidance in which digits are created (and actually look like numbers that someone would write).

Sampling UNet Time conditioning with 5 epochs

Sampling UNet Time conditioning with 20 epochs

2.4 + 2.5 Adding and sampling Class-Conditioning UNet

Training Loss for the class conditioning extended UNet.

5 Epochs Class Conditioning (notice how most numbers look pretty good tho have thicker lines vs 20 epochs indicating still incomplete and needs more processing)