Wise-ualizer

Inspiration

Online communication and education rely heavily on listening leaving deaf and neurodivergent users at a disadvantage. We wanted to create something that turns spoken content into visual understanding, helping people follow conversations and lectures more easily. That’s how Wise-ualizer was born, a tool that bridges accessibility and learning through AI-generated visuals.

What it does

Wise-ualizer records audio from lectures or conversations, converts it into text, and uses the Gemini API to create short AI-generated videos that visualize what’s being said. It helps users grasp meaning, context, and emotion through visuals instead of sound, making online experiences more inclusive and engaging.

How we built it

Frontend: React, HTML, CSS, JavaScript, Vite — for recording, user interaction, and video display. Backend: Python (Flask) — for managing recordings, converting audio to text, and communicating with the Gemini API. AI Integration: Gemini API — transforms transcribed text into meaningful video representations.

Challenges we ran into

Generating accurate visuals for complex or abstract topics was difficult. We also faced technical challenges with handling real-time recordings, managing file uploads, and ensuring smooth communication between the frontend, backend, and Gemini API.

Accomplishments that we're proud of

We successfully built a working prototype that converts real conversation snippets into AI-generated videos, creating a powerful demonstration of how accessibility and technology can merge to enhance understanding. We're proud of the creativity and collaboration that made it possible.

What we learned

We learned how to integrate AI models like Gemini into full-stack applications, process live recordings efficiently, and design inclusive technology that supports diverse learning and communication needs.

What's next for Wise-ualizer

Next, we plan to improve the accuracy of our visual outputs and explore direct video and prompt-based generation instead of relying on the audio-to-text pipeline. This would allow for more dynamic, context-aware visualizations and make the experience faster and more intuitive for users.