Inspiration
While Silicon Valley optimizes for 4K video calls, 3 billion people struggle to see their families clearly due to poor internet infrastructure. This became personal when video calls with my parents in India consistently degraded to pixelated messes. I realized current video calling treats every pixel equally, but humans don't - we care about faces and expressions. This sparked the idea: what if AI could intelligently prioritize what matters most in video calls?
What it does
Thirai uses AI to democratize video calling quality through two breakthrough approaches:
- Automated Personalized Compression: Uses optimized SDXL VAE models to learn individual facial features and compress them intelligently while preserving recognizable details
- Smart Bandwidth Allocation: Real-time face detection automatically sends high-quality face patches through WebRTC data channels while maintaining low-quality backgrounds, achieving 3x better face quality at the same bitrate
How we built it
- Built a complete real-time video conferencing platform from scratch
- Trained custom autoencoders and systematically evaluated 15 different ML models including SDXL, Real-ESRGAN, and proprietary approaches
- Optimized SDXL VAE inference pipeline to achieve 120ms latency using CoreML and Metal Performance Shaders
- Implemented MediaPipe face detection integrated with custom WebRTC data channel architecture
- Created real-time compositing system that blends high-quality face patches with low-resolution background video
- Cold-called researchers to access unreleased model checkpoints for comprehensive evaluation
Challenges we ran into
- Custom autoencoders produced poor quality at usable compression ratios - even at 16KB, faces looked unrecognizable
- SDXL VAE latent vectors were larger (50KB) than expected, barely competing with standard JPEG compression (12KB)
- Achieving real-time inference on complex diffusion models required extensive optimization and architecture redesign
- Balancing latency vs. quality - adding AI processing while maintaining conversational flow
- WebRTC data channel synchronization with video streams for seamless compositing
Accomplishments that we're proud of
- Optimized a massive diffusion model (SDXL) to run in 120ms - production-grade real-time performance
- Built custom WebRTC data channel pipeline enabling intelligent bandwidth allocation
- Achieved 3x better face quality at same bitrate through smart JPEG patching approach
- Systematic evaluation methodology testing 15 state-of-the-art models with rigorous performance metrics
- Created functioning real-time ML inference pipeline that works on consumer hardware
- Demonstrated that AI can add meaningful intelligence to video calling infrastructure
What we learned
- Personalized compression is harder than expected - generic JPEG often outperforms custom autoencoders on file size
- The real value isn't in compression ratios but in semantic understanding of what humans prioritize
- Infrastructure and engineering optimization matter more than model architecture choice
- Real-time AI requires fundamental rethinking of model deployment, not just faster hardware
- Current video calling solutions have significant room for AI-powered improvements
- Building production-ready ML systems requires balancing multiple complex trade-offs
What's next for Thirai
- Partner with telecom providers in emerging markets to deploy intelligent video calling at scale
- Expand beyond faces to automatic detection of shared screens, documents, and other priority content
- Develop region-specific models optimized for different lighting conditions, skin tones, and cultural contexts
- Integration with existing video platforms through browser extensions and APIs
- Research into attention-based models that predict where viewers will look to guide quality allocation
- Build the intelligence layer that becomes standard for next-generation video communication infrastructure
Log in or sign up for Devpost to join the conversation.