Hearcraft

Main dashboard (optimized for screen readers for blind people)
Building real-time mini app with agent, only use voice; Both Real time api and text api are used.
My apps (Crafts) page: each agent is a small mini app; could be a voice game, a dedicated assistant, or a digital friend

Inspiration

I've volunteered with visually impaired communities since high school. One thing always stood out — these are deeply creative people with ideas and stories. But the digital world is designed around screens. Every no-code builder, every app creator, every design tool assumes you can see.

With the rise of AI, that assumption no longer has to hold. Voice is now a direct interface to machine intelligence. Hearcraft exists because blind creators deserve a builder that speaks their language.

What it does

Hearcraft is a voice-first iOS app that lets visually impaired users build, run, and share voice agents ("crafts") entirely by speaking. Describe your idea out loud, and Hearcraft generates a fully configured voice app — complete with its own system prompt, tools, voice personality, and capabilities. Each craft runs as a real-time conversational agent powered by Gemini. No screen interaction required. Voice in, voice out. By them, for them.

Key Features:

Voice-first craft creation: Speak your idea, and Gemini generates a complete voice app automatically — no typing, no visual UI required.
Real-time voice agents: Each craft runs as its own live conversational agent with a custom system prompt, personality, tools, and voice selection from 30+ Gemini voices.
Camera capability: Camera-enabled crafts stream live video to Gemini, which describes surroundings in real time — critical for visually impaired users navigating physical spaces.
Web research: Crafts can perform live Google Search via function calling, with an audible cue so the user knows information is being retrieved.
Location awareness: Google Maps API integration gives crafts geographic context for travel, navigation, and local recommendations.
Accessibility-first UI: Large touch targets, minimal navigation depth, VoiceOver-optimized flows, and verbal confirmations at every step — designed for blind users from the ground up.
Built-in community crafts: Curated examples including an Accessible Travel Scout, a persuasion game, a local news briefing and a Talk with Homer Simpson— demonstrating the creative range of the platform.

How I built it

We started by prototyping in Google AI Studio, where we discovered that combining a system prompt with the real-time API created a compelling product scaffold — a voice agent you could configure and talk to immediately. That validated the core idea: if we could let users generate these configurations by voice, we'd have a creation platform.

We chose iOS as the initial platform because phones are what blind people carry everywhere — always within reach, with a built-in camera for multimodal input, a mature accessibility framework in VoiceOver, and the most natural device for voice-first interaction. It's also the fastest path to a working prototype.

From there, we built the full app using SwiftUI, with development supported by Antigravity and Claude Code. The architecture has three layers:

Gemini Live API handles all real-time voice interaction — both during craft creation and craft runtime. It provides bidirectional audio streaming, barge-in, affective dialog, and function calling for live Google Search.
Gemini 3 Flash with structured output transforms conversation transcripts into JSON craft configurations — system prompts, tool declarations, voice settings, and behavioral rules.
AVFoundation powers camera streaming and audio capture, feeding multimodal input into the Live API for camera-enabled crafts.
Google Maps API provides location context, enabling crafts to deliver geographically relevant guidance.

The entire UI is built accessibility-first with VoiceOver compatibility, large touch targets, and minimal visual complexity.

Challenges we ran into

From real-time stream to structured data. The Gemini Live API excels at natural conversation but outputs unstructured audio and text. Converting a freeform voice conversation into a precise, valid JSON craft configuration was the core engineering challenge. Our solution: use the Live API to generate transcripts, then pass those to Gemini 3 Flash with structured output to produce clean JSON. Two models, each doing what they're best at.

Designing for blind users, not sighted users with a screen reader. Accessibility isn't a toggle you flip on. We had to learn how visually impaired people actually use their phones — the gestures, the flow, the expectations. Standard iOS design patterns don't translate. Large buttons, minimal navigation depth, verbal confirmations at every step, and zero reliance on visual feedback. Understanding their interaction behavior and designing from those constraints fundamentally changed the app.

Accomplishments that i am proud of

A blind person can go from a spoken idea to a working, shareable voice app in under 60 seconds — no code, no screen, no sighted assistance needed.
The two-layer Gemini architecture (Live API for conversation + Gemini 3 Flash for structured generation) is clean, robust, and solves a real technical problem elegantly.
I built a creation tool, not just another assistant. The distinction matters — Hearcraft gives blind creators agency to build for themselves and their community.

What I learned

Voice-first isn't voice-added. You can't retrofit accessibility onto a visual app. The entire interaction model — navigation, feedback, confirmation, error handling — must be redesigned from the ground up around voice as the only interface.
The gap between "AI can talk" and "AI can create" is structured output. Real-time voice AI is powerful but raw. The key technical insight was that a conversation becomes a product only when you can reliably transform it into structured, reusable data. That bridge — from stream to structure — is where the real engineering lives.
Real-time APIs are built for conversation, not construction. Turning a bidirectional audio stream into a deterministic, validated JSON configuration required splitting responsibilities across two models. The Live API handles the messy, human, creative part. Gemini 3 Flash handles the precise, structured part. Respecting what each model does best was the most important architectural decision we made.

What's next for Hearcraft

Community sharing. Right now Hearcraft ships with curated example crafts. Next, creators will be able to publish, browse, and remix each other's crafts — building a marketplace of voice apps made by blind creators for the blind community.
More capabilities. Expanding the building blocks available to creators — calendar integration, document scanning, smart home control, contacts access — so crafts can do more.
Cross-platform. Bringing Hearcraft to Android and web to reach the full visually impaired community, not just iOS users.
Smart glasses integration. Pairing with devices like Ray-Ban Meta to enable fully hands-free craft experiences in the real world.
Cloud infrastructure. The current demo stores crafts locally on-device. We'll migrate to a cloud database to enable cross-device sync, community sharing, and persistent craft libraries.

Built With

gemini
ios
swift
websocket

Updates

FUY25 Fu started this project — Feb 09, 2026 04:38 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.