Inspiration
Have you ever scrolled on instagram and seen a scene from a movie or show that was not credited. Have you ever wanted to know where the material is from, but had to scroll through the comments to find it from some good samaritan who posted it? Well with Shablam, this will not be a problem any longer! Similar to shazam, it identifies key information about the video file (the "scene") that is inputted and compares it with a database of features of other scenes.
Pipeline
Our program works through a series of steps:
- Keyframe Generation
- Feature Extraction
- Comparison
Keyframe Generation: We used a python library called scenedetect in order to generate timestamps where it appeared that the movie changes shots. We then sample the middle of each of these subscenes and consider it as a representative frame for the entire subscene. This significantly cuts down on the number of images we need to perform feature extraction from while making it fairly robust and mostly representative of the entire clip.
Feature Extraction: We used ResNet-50, a pre-trained convolutional neural network (CNN), to generate deep fingerprints for each keyframe. While ResNet is usually used in the context of image classification, it has very powerful feature extraction as well. By cutting off the last layer that performs the classification, we can use the output of the pooling layer, which is representative of the features in the image. We can then easily compare two different images using simple linear algebra.
Comparison: As mentioned before, we simply need to compare two feature vectors using simple linear algebra. We opted to use cosine similarity as it would provide a standardized output (between -1 and 1) which would make it easy to set a threshold value for no matches. Our comparison algorithm takes each keyframe in the inputted video and compares it with all the keyframes for each movie in our database. Each keyframe's "best match" and their similarity score is then taken and the greatest score is chosen as our output, so long as it's above a certain threshold value. While this may be very inefficient, it works decently fast when the database is still small.
Challenges
We ran into several issues during the development of our project that were mainly mitigated with further research. Our initial plan was to simply take a perceptual hash of frames and compare them. However, we quickly found out that this method would not work very well on images that were scaled/cropped differently since hashing lost too much information. We also ran into the challenge of managing our data. We didn't have enough time to effectively implement a SQL database, so we were simply loading a dictionary "database" every single time we ran the code. This was far too inefficient as initializing this large database required quite some time. We fixed this by only generating keyframes as needed, and creating a script that generates this dictionary and dumps its contents into a pickle file for quick retrieval.
Extension
In the future, we would like to use SQL for efficient database management and a hashing lookup table instead. In the lookup table, we would first compare the hashes of the feature vector and then search through the entries that the hash stores. We would also rely on the IMDb API to obtain more information about the movies that it matched with, such as actors, genres, etc. Currently only very basic information and move posters are being taken from the API.
We could also add an audio component to our recognition pipeline. By comparing audio as well, it can serve to strengthen the matching for the same movie scenes and weaken it when it's different.
We would also like to express the value that this could bring to preventing plagiarism (such as reposts on instagram that give no credit to the original creator).
Log in or sign up for Devpost to join the conversation.