Inspiration

We wanted to try out a few new tools to create a service that has many use cases in the tech world. We had the idea of making new technology accessible to everyone.

What it does

PhonIO is an adaptation of large language models to phone calls. It handles audio recording, transcription, prompting a large language model, and text to speech generation to enable full AI conversations over a phone call.

How we built it

PhonIO was built using a variety of AI tools, and tied together using Python. We started off by capturing audio from a phone call using Google Voice, which is then fed into our software that recognizes when a speaker is finished talking. Then, we use OpenAI's Whisper model to transcribe the audio, and feed it into our self-hosted large language model instance, using LLaMA. When the AI model responds, we use Coqui-AI's text to speech library to send the audio back to the user over the Google Voice call. FFmpeg was used in several places to manipulate and manage audio streams.

Challenges we ran into

Our biggest challenge was deciding when to truncate the user's speech, and transcribe the audio using Whisper. This was accomplished by monitoring the volume level of the audio stream, and trimming the audio when there was more than a 1.5 second gap of silence.

Our next challenge was connecting our software to the system audio so that we could send it over Google Voice. To solve this problem, we created a virtual microphone device in Linux and sent our generated audio clips to it using FFmpeg.

Accomplishments that we're proud of

We are proud of being able to make many different pieces of technology work together well. We all learned how to use some new libraries, between speech to text, LLM inference, text to speech, and the audio libraries in between. We are also proud of branching out from our typical project idea of a web app.

What we learned

We learned about audio processing, and the difficulties involved. We also learned several techniques for getting better results from AI models, including the 3 models we used. We gained insight into the intricacies of human conversation, and how difficult it is to connect this to software.

What's next for PhonIO

We'd like to make our response time much faster, and perhaps develop a website to manage the many pieces of our pipeline, and to record calls for later use. This site would also allow for changing the prompt of the large language model we used, to adjust it for different use cases.

Built With

Share this project:

Updates