Inspiration

While Silicon Valley optimizes for 4K video calls, 3 billion people struggle to see their families clearly due to poor internet infrastructure. This became personal when video calls with my parents in India consistently degraded to pixelated messes. I realized current video calling treats every pixel equally, but humans don't - we care about faces and expressions. This sparked the idea: what if AI could intelligently prioritize what matters most in video calls?

What it does

Thirai uses AI to democratize video calling quality through two breakthrough approaches:

  1. Automated Personalized Compression: Uses optimized SDXL VAE models to learn individual facial features and compress them intelligently while preserving recognizable details
  2. Smart Bandwidth Allocation: Real-time face detection automatically sends high-quality face patches through WebRTC data channels while maintaining low-quality backgrounds, achieving 3x better face quality at the same bitrate

How we built it

  • Built a complete real-time video conferencing platform from scratch
  • Trained custom autoencoders and systematically evaluated 15 different ML models including SDXL, Real-ESRGAN, and proprietary approaches
  • Optimized SDXL VAE inference pipeline to achieve 120ms latency using CoreML and Metal Performance Shaders
  • Implemented MediaPipe face detection integrated with custom WebRTC data channel architecture
  • Created real-time compositing system that blends high-quality face patches with low-resolution background video
  • Cold-called researchers to access unreleased model checkpoints for comprehensive evaluation

Challenges we ran into

  • Custom autoencoders produced poor quality at usable compression ratios - even at 16KB, faces looked unrecognizable
  • SDXL VAE latent vectors were larger (50KB) than expected, barely competing with standard JPEG compression (12KB)
  • Achieving real-time inference on complex diffusion models required extensive optimization and architecture redesign
  • Balancing latency vs. quality - adding AI processing while maintaining conversational flow
  • WebRTC data channel synchronization with video streams for seamless compositing

Accomplishments that we're proud of

  • Optimized a massive diffusion model (SDXL) to run in 120ms - production-grade real-time performance
  • Built custom WebRTC data channel pipeline enabling intelligent bandwidth allocation
  • Achieved 3x better face quality at same bitrate through smart JPEG patching approach
  • Systematic evaluation methodology testing 15 state-of-the-art models with rigorous performance metrics
  • Created functioning real-time ML inference pipeline that works on consumer hardware
  • Demonstrated that AI can add meaningful intelligence to video calling infrastructure

What we learned

  • Personalized compression is harder than expected - generic JPEG often outperforms custom autoencoders on file size
  • The real value isn't in compression ratios but in semantic understanding of what humans prioritize
  • Infrastructure and engineering optimization matter more than model architecture choice
  • Real-time AI requires fundamental rethinking of model deployment, not just faster hardware
  • Current video calling solutions have significant room for AI-powered improvements
  • Building production-ready ML systems requires balancing multiple complex trade-offs

What's next for Thirai

  • Partner with telecom providers in emerging markets to deploy intelligent video calling at scale
  • Expand beyond faces to automatic detection of shared screens, documents, and other priority content
  • Develop region-specific models optimized for different lighting conditions, skin tones, and cultural contexts
  • Integration with existing video platforms through browser extensions and APIs
  • Research into attention-based models that predict where viewers will look to guide quality allocation
  • Build the intelligence layer that becomes standard for next-generation video communication infrastructure

Built With

Share this project:

Updates