"A voice 'pho' you"
Inspiration
We saw a world where a simple customer service call is an impossible wall for those with speech and hearing disabilities, those with autism who struggle understanding speech in phone conversations, as well at those with phone anxiety. We knew this hackathon was an opportunity build a bridge from phone to text.
The Problem: The Auditory Barrier
Deaf & Hard of Hearing: Traditional calls offer no visual or text-based alternative for real-time dialogue.
Auditory Processing (ASD): Many individuals with autism, like mentor Logan, rely on lip-reading to understand speech. Without visual cues, accents or background noise make audio unintelligible.
Speech & Anxiety Barriers: Non-verbal individuals or those with severe phone anxiety often cannot use voice-only phone calls to reach essential services.
What it does
Phogent (Phone Agent) provides a dual-channel bridge that removes the requirement for verbal speech and auditory hearing. Phogent lets you make and receive real phone calls via a chat interface. You type, and an AI voice speaks your text to the caller in real time while transcribing everything the caller says back to you in real time. For you, it's texting. For the caller, it's a call.
Speech-to-Text (Transcription): Converts the caller's voice into a live text stream with ElevenLabs. This allows users to read the conversation in real-time, bypassing the need for lip-reading or hearing.
Text-to-Speech (Voice Synthesis): Allows the user to type their responses. ElevenLabs then communicates these responses to the person on the other end of the line in natural voice.
Universal Accessibility: By neutralizing the challenges that come with a phone call (i.e. accents, and social pressure), Phogent ensures that communication is defined by the message, not the medium.
How we built it
- Frontend: React with Tailwind CSS for a real-time chat UI
- Backend: Python (FastAPI + WebSockets) to orchestrate the data pipeline
- Telephony: Twilio Programmable Voice & Media Streams for inbound/outbound calls
- Voice AI: ElevenLabs for TTS and STT (μ-law 8000 Hz audio)
- AI: Google Gemini for conversational response generation
- Database: MongoDB to store call records and transcripts
Challenges we ran into
- Coordinating API keys across the team was tricky
- We struggled with GitHub merge conflicts.
- We were running out of free credits in our personal Gemini accounts.
- Coordinating team tasks, as we were limited on API use and who uses the AI coding agent.
- Antigravity left things very buggy.
Accomplishments that we're proud of
We built a functional real-time voice pipeline from scratch in a hackathon timeframe. Beyond the code, we genuinely bonded as a team and left as closer friends.
What we learned
We learned how to integrate the ElevenLabs API for real-time TTS/STT, handle WebSocket audio streams with Twilio, and coordinate a multi-service AI pipeline end-to-end. We also learned to code effectively with Google Antigravity and to manage a busy GitHub repository.
What's next for Phogent
- Add a dedicated phone number per user for persistent inbound calls
- Improve interruption handling and reduce latency further
- Build mobile wrappers (Android/iOS) around the existing web app
- Expand visual accessibility features for users.
Built With
- elevenlabs-api
- fastapi
- google-gemini-api
- javascript
- mongodb
- python
- react
- tailwind-css
- twilio-media-streams
- twilio-programmable-voice
- websockets
Log in or sign up for Devpost to join the conversation.