Inspiration

We’ve all sat through boring lectures before, where it’s sometimes impossible to pay attention to what the professor is saying. Looking for a way to improve comprehension and save time, we created MadLectures.

What it does

MadLectures first uses Fish Ai’s speech to text tool to get a text transcription of the lecture, with the device’s microphone for input. To keep our process realtime, we record in 4 second chunks, continuously processing our input. The text transcription is fed into Gemini's Flash 2.5 Lite, to rephrase the lecture material. Finally, Gemini’s output is played back using Fish Ai’s Text-to-speech, where users can select a variety of voices.

How we built it

MadLectures is built in TypeScript. We incorporated the Fish Audio API for both our text-to-speech and speech-to-text components, and we used Vercel’s TypeScript AI SDK to integrate with Gemini. Our frontend is built using React and Next.js, and is styled using Shadcn components and Tailwind CSS.

pipeline

Challenges we ran into

Latency - When prompting Gemini to create a concise summary of the given voice input, the varying output latency of the AI caused delays that made the user experience less fluid than we wanted. We eventually solved this by switching to the Gemini 2.5 Flash Lite model, which cut down our token usage as well as our latency by a great amount.

latency graph

Audio Processing - When processing audio with Fish, sometimes we would run into problems in rooms with background noise or unclear audio. These would cause Fish to output broken text, which would then cause problems further down the line. We solved this problem by chunking the audio into smaller pieces so that the Fish API would be able to process the audio cleanly.

Ballooning Prompt Size - For our prompt to the Gemini API, we faced the problem of steadily increasing prompt size which increased latency and resource occupation. This happened because we needed to feed the previous context of the generated script to the model, which was continuously increasing. We solved this problem by experimenting with different sizes of context that were re-passed to the model, eventually finding a solid balance between previous context and performance.

Accomplishments that we're proud of

We feel that this is a useful tool, and usable enough to be used on a day-to-day basis for the average UW-Madison student. We are also proud of how we approached integrating both Fish and Gemini’s APIs into our project, streamlining it so it works in realtime.

What we learned

Generative AI - This was the first time that most of our group members had used a generative AI API for a project, and it proved to be a great learning experience for all of us.

Text/Speech Conversions - Building a project based around speech proved to be quite challenging because it’s a relatively uncommon approach, but it taught us a lot about how audio is processed in modern applications.

Share this project:

Updates