Inspiration
Sociological research has documented persistent participation disparities in academic and professional interaction, particularly along gender lines. For example, one large observational study found that women ask significantly fewer questions than men in academic seminar settings, even when they are equally represented in the audience, suggesting structural and social barriers to equitable engagement that extend beyond numerical presence alone (Carter et al., 2017). Women and other marginalized groups are also more likely to be interrupted or to have their competence questioned in group discussions, which can diminish visibility and influence and contribute to broader inequities in professional advancement. With the deep analysis that AI offers, LevelUs was created to help solve the problem of unequal participation in group discussions by reflecting on recordings to identify interruptions, inequity, and areas of improvement. https://arxiv.org/abs/1711.10985
What it does
Our app takes audio recordings of group discussions and seminars to be translated through the ElevenLabs API Speech-to-Text function, which will diarize the file, associating voices speaking to unique speaker ids. These are one of the key-value pairs that also store what word was said at a specific time stamp. Then, the function will produce a structured JSON data file, which will be processed and interpreted by Gemini. With Gemini, the JSON is manipulated to store accurate sentences as well as label their name for a specific speaker id before interruptions or a new speaker talks. Gemini AI will also compute a sentiment/tone score, confidence score, total times/length each speaker talked, and the number of interruptions made during the discussion. The client will see these output statistics as well as suggestions on improving discussions in the future on the frontend, promoting an equitable environment.
How we built it
The first thing we did was identify the problem space, and we thought of innovative technologies that we could use to solve the problem of gender inequality in tech. From here, we were able to discuss and evolve our project idea from a real-time voice agent to a tool that addresses inequality without fail. We worked with ElevenLabs and the Gemini API, where we separated the work based on our goals with the pipeline. We also separated into frontend and backend, where we had half our team working on the frontend and the other half working on the backend (ElevenLabs set up and Gemini set up). Once we completed each section of our product, we connected the parts and debugged until things worked out.
Challenges we ran into
Our original idea was to implement a real-time speech-to-text transcript that would provide real-time speaking metrics and performance reviews. The ElevenLabs API cannot perform diarization (associate voice frequencies to unique speaker IDs) on a single mono stream audio device. This setback pivoted us to instead convert uploaded audio files into a transcript and handle JSON data. Also, the maximum number of speakers that the ElevenLabs API can detect is 5, which limits our usability with larger groups.
Another challenge we encountered was routing the transaction of uploading audio files to the backend ElevenLabs API speech-to-text function. Our try-catch function threw an exception frequently, indicating our implementation of parsing the data was incorrect, which we fixed after hours of debugging.
Accomplishments that we're proud of
One of our proudest achievements was getting Gemini to identify speakers by name through introductions from the transcript, instead of just labeling all speakers by their speaker ID. Furthermore, our final summaries for each discussion consistently returned accurate statistics for the contents of the audio, such as the number of interruptions and duration of speaking for each person. Lastly, we were able to engineer and route our multi-stage pipeline, ensuring that our spoken data and timestamps stayed intact and expanded upon through the transition from ElevenLabs’ diarization to Gemini’s analysis.
What we learned
We were able to learn and experiment with voice recognition technologies (ElevenLabs, Deepgram, and AssemblyAI) through dissecting pros and cons with each API library (multimedia channel threading, diarization, and transcription formats). Once we had stored data through ElevenLabs, we learned how to communicate and route APIs to the frontend through services and routes. Also, compared to previous hackathons, we learned how to implement increasingly complex tech stacks with AI assistance, pushing more code into production.
What's next for LevelUs
Although we’re pleased with the progress we made during the past 24 hours, there are more features and functionalities that we can see getting implemented. First, integrating a database through Snowflake or MongoDB would be helpful for proctors to compare and contrast past meetings with new ones. In addition, implementing a video upload and viewing functionality would provide a visual supplement to the transcript and output.
Built With
- elevenlabs
- fastapi
- gemini
- next.js
- python
- react
- tailwind
- typescript
Log in or sign up for Devpost to join the conversation.