Inspiration
We wanted to help DJs choose the right music for their audience by using a vision model that analyzes how the crowd moves and reacts.
What it does
Our system takes in a live video feed of the crowd, analyzes their movements and poses, and selects a song from Spotify that fits the energy and vibe of the crowd.
How we built it
This project was created by Luke Zhu, Varun Talluri, Anna Simms, and Daniel Tian. We started with a pre-trained video-to-text model from OpenVINO. By changing the model’s classification layer to a regression layer, we trained it to output metrics like energy, danceability, and tempo from the video (Metrics used by Spotify's recommendation system). Using these metrics, we match them to the most similar song from a database stored in MongoDB. Finally, we use Spotify’s API to play the selected track.
Challenges we ran into
At first, we used a model that detects body landmarks from video using MediaPipe, but it couldn’t really understand the overall vibe of the video. So, we switched to OpenVINO’s pre-trained models. Another challenge was finding the right dataset to fine-tune the model, but we eventually found one from 1001tracklists.com, where people paired YouTube videos with DJ music linked to Spotify.
Accomplishments that we're proud of
We overcame all the challenges we faced and managed to get the full system working from start to finish. We’re particularly proud of integrating different technologies—OpenVINO, MongoDB, and Spotify’s API—into a seamless process that dynamically selects music based on crowd analysis. Additionally, solving the problem of dataset scarcity by finding and adapting the 1001tracklists.com resource was a major win.
What we learnt
Throughout this project, we learned the importance of flexibility and adapting to challenges quickly. Switching from a landmark detection model to a pre-trained video model taught us how to pivot when things don’t go as planned. We also gained a deeper understanding of how to fine-tune AI models for specific tasks, how to efficiently work with large audio databases, and how to integrate different APIs to create a smooth user experience. Finally, we learned more about real-time processing and the limitations of machine learning models in complex environments like live events.
What's next for AIDJ
We plan to add multiple encoders to improve the model’s accuracy. Right now, the model might get confused by dark environments or flashing lights. To fix this, we’re planning to use MediaPipe to generate body landmarks first, then use a separate encoder to improve the AIDJ’s performance.
Built With
- flask
- mangodb
- openvino
Log in or sign up for Devpost to join the conversation.