Inspiration
Scrolling through TikTok or Instagram, you have definitely seen a viral dance, and thought, "I wish I could do that." But the reality is that professional dance lessons are expensive, and learning from a tiktok "instruction" video is frustratingly difficult. There is a massive accessibility gap in the digital creator culture. Everyone wants to dance, but not everyone knows how. Whether it’s a trending TikTok challenge or a viral Instagram reel, there is a massive cultural drive to participate in dance trends.
We wanted to bridge that gap by building a personal AI dance coach that lives in your pocket, turning anyone into a confident dancer using motion analysis and reinforcement learning .
What it does
Dance On It takes a reference video as input and:
- Converts the 2D video frames into 3D skeletal coordinates with a finetuned Google Mediapipe model (shoutout Vultr for the cloud GPUs).
- stores the coordinates in MongoDB Atlas and the videos in MongoDB
- Uses a Retrieval-Augmented Generation framework, to assist in querying Gemini API to translate raw data into natural language instructions.
- delivers the instructions via a custom ElevenLabs voice agent (shoutout to "Roger").
When you record yourself, the system calculates the similarity between your movement and the reference. We determine the accuracy of a pose by calculating the distance between joint vectors. For two sets of coordinates, we utilize a Cosine Similarity formula in joint angles to gauge alignment. If your similarity score falls below a certain threshold, Gemini generates specific tips to correct your form.
How we built it
- The main backend server is built with FastAPI.
- We use WebSocket, MediaPipe (Google), and OpenCV for real-time, vision-based interaction systems.
- The backend modules uses MongoDB, AI inference, and structured results to enable Retrieval-Augmented Generation (RAG) for video understanding.
- PyMongo and BSON (ObjectId) are used to interface with MongoDB, allowing the system to locate, retrieve, and stream stored video data directly from the database.
- Videos are passed to Google GenAI via the Gemini API which performs dance move segmentation and classification.
- Pydantic sits in between the AI and the application logic to enforce schemas on Gemini’s responses so classifications, timestamps, and descriptions are machine-readable.
- MongoDB Atlas Search enables vector-based retrieval in get_query_results(), allowing the system to augment AI analysis with relevant stored context.
- The frontend is a React application built with TypeScript and Vite.
- Three.js is used to render 3D pose skeletons for intuitive movement visualization
- ElevenLabs client adds text-to-speech narration for AI feedback
- Lucide React supplies a clean icon set. Key components include a landing page for selecting and uploading reference videos
- The ML service uses OpenCV to ingest videos and extract frames, then applies pose estimation, by default via MediaPipe’s BlazePose for efficient, real-time keypoint detection
- The service is served via Uvicorn and exposes a clean REST API that takes uploaded videos and returns structured JSON pose data.
Challenges we ran into AND PREVAILED OVER!!!!
Pose Detection:
- 2D pose detection models are NOT sufficient enough for learning dances
- 3D pose detection motion models are heavy-weight, slow, and finetuning (specifically data collection and downloading) is a pain.
Comparisons and 3D Rendering
- Tough to get the human model limbs working with the anchor points
- Difficult to detect which frames are most similar (given differences in start times, speed of movement, etc.)
Accomplishments that we're proud of
3D Pose Estimation: The YOLO model is DEAD (just kidding haha)! Dancing is more than just 2D movement, and we successfully moved beyond basic 2D overlays. By fine-tuning MediaPipe on diverse datasets, we achieved stable 3D coordinate generation (even in variable lighting). The "Kinect" Gesture UI: We built a hands-free experience. By detecting a specific gesture (raising the left hand) the recording starts automatically, so you don't have to run back and forth to your laptop. Training Infrastructure: We moved our training to Vultr Cloud GPUs, allowing us to process massive amounts of MP4 data much faster than a local setup would allow. The "Roger" Integration: Seamlessly connecting the logic from Gemini to ElevenLabs to create a coach that actually sounds encouraging.
What we learned
We learned that mapping human movement is as much about data science as it is about art. We dove very deep into Vector Databases and the nuances of RAG, specifically how to retrieve non-textual spatial data to inform a Large Language Model. We also learned the importance of dataset diversity; a model that works in a bright studio might fail in a dark bedroom, which is why fine-tuning for "environmental noise" was a game-changer for us. Finally, we learned the limits of using AI for coding front-end. As a tool that's used for education, there was a lot of design choices we made, where simply asking AI Studio was not sufficient.
What's next for Dance On It
- Deploying to IOS and Android (check our website out at danceonit.tech)!
- Work in Progress: Using blender for real human rendering (put yourself on the human render with Viggle).
- Continual Learning Multi-Modal Model implementation that takes in dance instruction videos as ground-truth labels for more finetuning our instruction generation.
- Option to track multiple people in the reference video.
Built With
- 3js
- blender
- css
- elevenlabs
- express.js
- fastapi
- fastdtw
- geminiapi
- html
- javascript
- mediapipe
- mongodb
- mongodb-atlas
- pydantic
- python
- pytorch
- tailwind
- typescript
- uvicorn
Log in or sign up for Devpost to join the conversation.