Inspiration

We wanted to build a development environment where code disappears—and ideas become apps as fast as you can speak them. With voice models getting shockingly good, we imagined a world where someone could say “Build a cyberpunk Pomodoro timer” while walking down the street and see it deployed moments later. That vision became Speech To Spec.

What it does

Speech To Spec is a voice-native AI app builder that turns natural speech into fully deployed full-stack web applications. Users speak their idea, watch their project spin up in a live cloud sandbox, see code being generated and executed in real time, and preview the running app instantly. With Meta Ray-Ban glasses, the entire workflow becomes hands-free—the glasses act as both mic and speaker, enabling coding while commuting or walking.

How we built it

We combined: • Gemini Live for real-time conversational understanding • Claude Code for autonomous engineering and filesystem-level actions • Daytona sandboxes for spinning up real cloud execution environments • React + WebSockets for low-latency speech interactions • Node.js/Python orchestration for agent handoffs and project lifecycle

Gemini Live streams the interpreted request to the backend, which triggers a Claude Code agent to create files, install dependencies, write UI/backend code, and start dev servers. Logs, file events, and terminal output are delivered to the frontend via SSE. Once the app is live, the sandbox port is tunneled to a public preview URL.

Meta Ray-Ban integration required an audio-only UX that communicates build progress without a screen.

Challenges we ran into

• Coordinating multi-agent handoffs between Gemini and Claude
• Reducing sandbox latency through pre-warming environments
• Maintaining long-form voice conversational context
• Designing a screenless audio UX for Meta Ray-Ban
• Handling real-time logs, file creation events, and dev-server output cleanly
• Ensuring the system stays stable while autonomously writing, executing, and updating code

Accomplishments that we're proud of

A fully functional speech-to-software workflow that feels magical
• Sub-10-second cold starts thanks to environment optimization
• Smooth, real-time streaming of code, logs, and deploy status
• A reliable Meta Ray-Ban hands-free coding experience
• Seamless project refinement (“add dark mode,” “make the buttons bigger”) that updates the live instance instantly

What we learned

• Large voice models open up new UX paradigms—coding doesn’t need screens
• Multi-agent systems require strict interface design to avoid miscoordination
• Pre-warmed sandboxes massively improve user experience
• Audio-only design forces clarity, redundancy, and careful pacing of feedback
• Users expect deployment-level reliability even in experimental agentic workflows

What's next for Speech to Spec

• Production-grade multi-agent coordination
• Offline and on-device voice interactions
• Support for multi-file refactors and larger projects
• AI-assisted debugging and testing flows
• Deeper Meta Ray-Ban integration with spatial UI cues
• “Blueprint mode” for generating architecture diagrams and planning docs by voice

Built With

  • daytona
  • e2b
  • fastapi
  • nextjs
  • sentry
Share this project:

Updates