Title: Pic2Plot
Who:
Gianna Finear (gfinear1), Catherine Lasersohn (claserso), Alan Gu (agu10), Edward Xing (exing1)
Introduction:
We are implementing something new. We have decided to create a plot synopsis based on the most important words extracted from a picture. We arrived at this idea because we wanted to combine what we have learned in deep learning with what we learned in computer vision and artificial intelligence. A photo of an image will be passed in and we will define the most important words based on that image. We then wanted to create some body of work using those words. We decided to write a synopsis because we have not seen this done before. We are motivated by existing papers on content generation. This problem falls into multiple categories such as classification, structured prediction, and content creation.
Related Work:
There is a lot of prior work on generating natural language from images that we used as inspiration for this project.
This article(https://pub.towardsai.net/this-model-can-create-poetry-from-images-ad0216503d20) describes a model Microsoft made to generate poetry from images. This model uses multi-adversarial neural networks along with an embedding model. First, the images are processed with an embedding model to match image features to poetry. Then, the results are passed through an RNN generator before being refined with discriminators. The paper evaluates the model using poetry specific metrics, such as style and subjectivity that are not as relevant to plot summaries because summaries are less abstract than poetry.
This work(https://github.com/ryankiros/neural-storyteller) uses style shifting to achieve the task of generating stories based on images.
This article(https://medium.com/@sanyamagarwal/my-thoughts-on-skip-thoughts-a3e773605efa) describes skip thought vectors.
Data:
To extract significant words from an image, we will use the Flickr 8k dataset from the Image Captioning homework. We will do similar preprocessing to extract the caption, and then remove stop words to narrow down each image to a few words. To generate a synopsis based on these words, we will combine a book summary dataset(https://www.cs.cmu.edu/~dbamman/booksummaries.html) with a movie synopsis dataset (https://www.kaggle.com/datasets/cryptexcode/mpst-movie-plot-synopses-with-tags). We will take the summary/synopsis data from both of these datasets and preprocess them to be in the same form.
Methodology:
Our model architecture will be largely based on existing work developed for image to text applications, namely image to poetry and image to story. The model/solution architecture may be thought of in two parts: image caption generation followed by plot synopsis generation. Each component is trained separately and deployed in conjunction to produce our image to plot summary result. Image caption generation is used to convert images to text form. This component will consist of a convolutional neural network (CNN) for image feature extraction and a recurrent neural network (RNN), e.g. LSTM, for caption generation. We may try a few different models for caption generation: a predictive model such as in the image captioning homework, or we may train a visual-semantic embedding, which maps captions and images to a common embedding space. We plan to train this model on the Flickr 8k dataset. The skip-thought vector encoder-decoder model is used to learn skip-thought embedding representations for novel summaries. Skip-thought vector embeddings are task-agnostic semantic embeddings of sentences. The encoder-decoder model is trained in a self-supervised manner to optimize for the reconstruction of neighboring sentences using decoders based on representations generated by the encoder, thus optimizing the encoder itself. The encoder-decoder model will be trained on the CMU book summary dataset introduced above. A style transfer function is necessary to convert from a short descriptive caption to a full plot summary. This function takes the form: F(x) = x - c + b where x is our image caption skip-thought vector, c is the average of our image caption skip-thought vectors, and b is the average of our plot summary skip-thought vectors. Intuitively, we maintain the semantics of the image caption, while replacing its style with that of a plot summary. Our full pipeline generates an image caption based on an image and applies a learned style transfer in skip-thought vector space to convert the image caption to a full plot summary. We anticipate that we may face the most difficulty in generating meaningful and consistent embeddings for our text data.
Metrics:
This project seeks to create relevant and creative plot summaries from an image. The complexity and subjectivity of this task will require human supervision to determine the quality of the results since there could be multiple relevant plot summaries for a given image. Thus, there is no single correct result. Instead, results should be evaluated in terms of relevance and understandability. For the image captioning model, perplexity or other metrics that measure word overlap between generated and measured captions such as BLEU or ROUGE can be used. Our goals for perplexity are Base: less than 20 Target: less than 18 Stretch: less than 15 Since the skip-thought vector encoder-decoder model is trained to reduce the reconstruction error of neighboring sentences, perplexity can once again be used to measure how well the decoder predicts the next word from the previous words and the context of neighboring sentences. Our goals for perplexity are Base: less than 120 Target: less than 110 Stretch: less than 100 The individual components will be evaluated using the metrics described above but the final result will be evaluated qualitatively for relevance, understandability, and creativity. Ethics: Our problem space involves the creation of new art using Deep Learning, which removes the human element from this process. The summaries we generate will be based on a dataset of previously created works, meaning that they are influenced by the biases within these datasets. Movies and books do not fully represent human experiences because only a subset of the population is responsible for creating this media. As a result, our model will not encompass the full range of themes a human could create a story about. In addition, our model does not take genre into account, which could trivialize serious topics when they are combined with more comedic ones. The major “stakeholders” in this problem are individuals responsible for creating novel stories, since our model is supposed to do so automatically. An effective model would help generate ideas that could be expanded to larger storytelling works. If the algorithm makes mistakes, then these summaries could be unreliable in accurately portraying the themes of the image. This would negate the effectiveness of this tool, because the summaries generated would not be useful.
Division of labor:
The two main tasks of this project are extracting significant words from images using image captioning and generating plot summaries based on these significant words. Since the image captioning task should be similar to a homework assignment, Gia and Catherine will work on this and then assist with the generation task. Alan and Edward will be primarily responsible for summary generation. Alan and Edward will work on preprocessing the summary data. Everyone will work on testing the model because the model’s success will be difficult to quantitatively measure.
Log in or sign up for Devpost to join the conversation.