3Dfusion Poster
3D fusion Poster JPG version

1 Title: Image Generation Using Simple Stable Diffusion

The focus of our project is to reimplement stable diffusion. We will be reimplementing Variational Autoencoder, UNet, and pretrained CLIP within our stable diffusion model in tensorflow.

2 Who

• Wanjia Fu: wfu16

• Guo Ma: gma10

• Flavia Maria Galeazzi: fgaleazz

• Shu Xu: sxu99

• Raisa Axenie: raxenie

3 Introduction

• What problem are you trying to solve and why?

We are trying to solve the problem of generating high-resolution images conditioned on a text prompt by using LM embeddings. This has many applications in high-quality content creation, ranging from ads to posters to illustrations. Our challenge is to optimize this architecture for the limited computational resources we have at our disposal.

• If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper.

Our work is inspired by the paper Photo realistic Text-to-Image Diffusion Models with Deep Language Understanding, where the team presents Imagen, a text-to-image diffusion model that generates pho- tos with unprecedented degree of photo-realism. Though inspired by their work, our projects seeks to implement a stable diffusion model, translating the training pipeline implemented in PyTorch link by Fareed Khan into Tensorflow framework, and then train the model on a more intricate dataset Fashion MINST.

• If you are doing something new, detail how you arrived at this topic and what motivated you.

We will be reimplementing stable diffusion instead of a simple diffusion model because we feel like it would be more interesting to apply the diffusion process over a lower dimensional latent space, which is what stable diffusion doing.

• What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learn- ing? Unsupervised Learning? etc.

This is a text-to-image generation problem using stable diffusion.

4 Related Work

Are you aware of any, or is there any prior work that you drew on to do your project?

• Link to the paper: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding The key idea presented in the paper is that instead of building a larger image diffusion model in the text- to-image generator, emphasizing on training a larger and more accurate language model using only text data more efficiently enables a stabler text-to-image generator. Their work provides a new possibility for us, where we can achieve better prediction results with a less complex image diffusion model. Another insight that the paper provides is a simpler and more memory-efficient diffusion architecture, Efficient U- Net, and a sampling technique that proves to help achieve better photorealistic results with more memory- efficient training.

• Link to the paper: InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization This paper seeks to provide a solution to the difficulties faced in producing images that align semantically with the initial text prompt due to ineffective noise initialization. The paper identifies the issue and presents a possible solution, Initial Noise Optimization (INITNO), to guide the initial noise toward a valid region through initial latent space partitioning and the noise optimization pipeline.

• Link to the paper: Everything you need to know about GLIDE and DALL-E 2 This blog explains the mechanics of text-guided diffusion. It delves into two main categories of guided diffusion: one is classifier-based guidance (CLIP), and the other is classifier-free guidance.

5 Data

• What data are you using (if any)? We are going to fine tune Stable Diffusion on a specialized dataset.

• If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it). • How big is it? Will you need to do significant preprocessing?

We will be using the Fashion MINST dataset Because the data will be very clean, significant precessing will not be required.

6 Methodology

• What is the architecture of your model?

Our architecture will be made up of a Variational Autoencoder, a UNet, and a pretrained CLIP.

• How are you training the model?

We will be training the model with the tensorflow framework.

• If you are implementing an existing paper, detail what you think will be the hardest part about imple- menting the model here.

The hardest part about implementing the model would be using the computational resources we have at our disposal to generate high quality images. The paper generates 1024x1024 images, we can aim for 2 smaller images. We can use a smaller text encoder and try to simplify the architecture as much as we can while preserving quality outputs.

• If you are doing something new, justify your design. Also note some backup ideas you may have to experiment with if you run into issues.

Our reach idea is to implement text to 3D model generation that can then be 3D printed. However, we decided to start with text to 2D diffusion. Backup ideas include training the model on a smaller dataset after pre-training to generate outputs of a specific type specifically for content creators. We would achieve this by creating a custom labeled dataset by using a neural network https://github.com/salesforce/BLIP to generate the captions for the images and train our model on this dataset. Alternatively, we would try other model architectures for the diffusion model and perhaps opt for simpler architectures.

7 Metrics

• What constitutes “success?”

For our project, we would label as a ”success” our model’s ability to generate high-quality images from text inputs.

• What experiments do you plan to run?

We plan to test our model on a range of text inputs and tasks, such as generating different objects, landscapes, or people.

• For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?

Since image generation is highly subjective, we plan to have a mix of qualitative and quantitative analysis. For the qualitative analysis we will examine the ”accuracy” by hand, looking at image composition, and fit to the text. For the quantitative analysis we will use the CLIP scores, a higher score meaning higher accuracy.

• If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model.

In the previous papers, the authors used both qualitative and quantitative analysis of the generated images to see how well the model performed. In the Photo realistic Text-to-Image Diffusion Models with Deep Language Understanding paper they used human raters to assess how well the images were generated, as FID and CLIP models have their limitations.

• What are your base, target, and stretch goals?

Our base goal is to generate 2D images from text input, our target would be to generate 3D images from text input, and the stretch goal is to generate printable 3D models form text input

8 Ethics

• What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

Given that we are generating images from text and are training on extensive text-image datasets, the appearance of our generated images is heavily influenced by our datasets. Any misrepresentation of stakeholders within the dataset will inevitably manifest in our output. Thus, when selecting datasets, it’s crucial to minimize biases to prevent the creation of stereotypes in our generated images. Because of this concerns, we are using LAION-5B, a commonly used dataset for text-image generation. Hoping that by using mature and carefully selected data, we could avoid the biases that we might encounter by using smaller, and less standardized datasets.

• How are you planning to quantify or measure error or success? What implications does your quantifica- tion have?

Because we are generating images based on text, quantifying success is a difficult task because interpret- ing the semantic meanings of sentences and the actual relations between the objects in the image still involves levels of human judgment. For this task, we are considering implementing some preexisting bench-marking standards, combined with our own judgment, to determine the success of our model.

9 Division of Labor

• Wanjia Fu: Reimplement UNet in Tensorflow for stable diffusion.

• Guo Ma: Reimplement the pretrained CLIP in Tensorflow model.

• Flavia: Choose the approprate dataset that we will be using for text-to-image generation using stable diffusion. Finish the data loader. Figure out how to integrate it with the architecture of the stable diffusion model.

• Shu Xu: Reimplement the decoder part of the Variational Autoencoder in Tensorflow for the stable diffusion.

• Raisa: Reimplement the encoder part of the Variational Autoencoder in Tensorflow for the stable diffu- sion.

10 Challenges

The biggest challenge we are facing is tuning the hyperparameters. Because the ways TensorFlow and PyTorch implemented their Conv2D and ConvTranspose2D are different, using the same stride size as in the PyTorch im plementation does not work for TensorFlow in direct translation—shape errors will occur because the padding is different. What we are currently trying to do is to figure out the correct hyperparameters that can effectively train the model. The current limitation is that the diffusion model is computationally expensive, and therefore will not be able to run on our own end, so we need to run it and parallelize it on GPU, which is also not really efficient and hard to debug. Additionally, because of the memory limitations, we cannot really train the model using a large batch size as the source implementation did, which is also a potential cause for the low quality in the prediction.

11 Reflection

• How do you feel your project ultimately turned out? How did you do relative to your base/target/stretch goals?

We reached out base and target goals successfully. In fact, our model outperforms the PyTorch paper we started from. Therefore, we feel that this project was successful. We did not end up implementing text to 3D diffusion, which was the stretch goal, but we still achieved good results in our other goals.

• Did your model work out the way you expected it to?

Yes, it did. We also tested it on a larger dataset than the original implementation and it outperformed it in terms of loss. The diffusion-generated images are clearly distinguishable from each other. Given that we trained our model on 1 GPU we are more than satisfied of the result.

• How did your approach change over time? What kind of pivots did you make, if any? Would you have done differently if you could do your project over again?

In the beginning we were considering to train the model on a larger dataset than we ultimately did or starting directly with text to 3D diffusion. However, due to computational and time constraints we decided to implement text to 2D diffusion as well. We thought that this was also going to be challenging due to the same reasons but we ended up succeeding. If we had to do the project from scratch we would probably start with text to 2D diffusion directly instead of attempting the hardest task first.

• What do you think you can further improve on if you had more time?

We could adapt our architecture to have good results on different datasets, trying different data augmenta tions and different pre-processing techniques. We could also try to implement text to 3D stable diffusion, experiment with other transformer architectures and perhaps combine results from other papers to get to new findings.

• What are your biggest takeaways from this project/what did you learn?

We learned how to implement diffusion and adapt a model architecture to work for different datasets. Given that stable diffusion is a very promising model architecture we think that these skills will serve us well in future projects. Also, we learned about the challenges and limitations of using different datasets, since a large part of our project was also deciding and testing our model on different datasets.

Link to the writeup for second checkin: https://docs.google.com/document/d/1x7co8qMMe2V16_0BHS53y0cclXNudneoIRR7QwFA6j0/edit

Link to the final writeup: https://drive.google.com/file/d/1Bm1-GNuf4oBOYBzt0eemP5ohbnUE5A0W/view?usp=sharing