Inspiration

We’ve all been there: you’re 40 minutes into an audiobook at the gym, your mind wanders for ten seconds during a tough set, and suddenly you have no idea what’s happening. You can’t pull out your phone to rewind—your hands are occupied. So you either keep listening while completely lost, or you break your workout just to fumble with a tiny progress bar.

Audiobooks are consumed during activities where your hands and eyes are busy. People listen while working out, commuting or driving, cooking or doing chores, running or walking, and doing hands-on work. Yet the entire interface—scrubbing, pausing, searching—requires you to stop what you’re doing and stare at a screen.

The problem isn’t the content. It’s the interface. Audiobooks demand visual interaction in contexts where that’s impossible. I built Conductor to fix that.

What It Does

Conductor is an AI companion layer that gives you semantic control over audiobooks—entirely hands-free. It listens alongside you in real time and lets you navigate the story using natural voice commands.
No touching your phone. No breaking your workout. No losing your flow.

Hands-Free Semantic Navigation

Control your audiobook without ever touching a screen:

“Jump to the scene with the poison apple” → Instantly seeks to 9:40
“Go back to when Alice meets the Cheshire Cat” → Jumps to Chapter 3, 4:12
“Skip to the reveal” → Finds the climactic moment without spoiling anything

Perfect for the gym, the kitchen, or anywhere your hands are busy.

Context-Aware Q&A

Ask questions about the story while listening, without pulling out your phone:

“Wait, who is this character?”
“What happened in the last chapter?”
“Remind me why they’re going to the castle”

The AI maintains a sliding 60-second context window, ensuring it never spoils future events.
It’s RAG—with an anti-spoiler filter.

Intelligent Audio Ducking

The moment you start speaking, Conductor automatically lowers the audiobook volume using local Voice Activity Detection (VAD). When you stop, it fades back up.
The listening experience never breaks—and you never miss a word.

How I Built It

I built Duet using Next.js 15 with React 19 for the frontend and the LiveKit Agents SDK for the voice AI backend. The UI uses a split-screen layout with real-time transcript synchronization, powered by VTT files with word-level timestamps.

For the AI pipeline, I integrated AssemblyAI for speech-to-text, GPT-4.1-mini for context-aware responses, and Cartesia Sonic-3 for natural text-to-speech. The agent uses 11 specialized function tools for audiobook control, including semantic search for scene navigation.

Bidirectional communication is handled via LiveKit data channels, sending playback state updates every second and receiving control commands in return. The transcript manager parses VTT files to extract precise word timings and creates overlapping 150-word chunks for semantic search.

Challenges I Ran Into

The biggest challenge was maintaining accurate context windows while preventing spoilers. I had to parse word-level timestamps from VTT files and implement a progressive context retrieval system that only exposes content the user has already heard.

Synchronizing audio playback with the transcript required handling timing offsets and dealing with React Strict Mode causing double-mount issues. I solved a localStorage race condition by delaying cleanup operations with setTimeout.

Implementing semantic scene navigation was tricky. I moved from regex-based matching to an LLM-powered search that sends the full transcript to GPT-4, receives a position percentage, converts it to timestamps, and backtracks to find clean sentence boundaries.

Accomplishments I’m Proud Of

I built a voice assistant that understands audiobook context in real time and answers questions without spoiling future plot points. The semantic search reliably finds scenes even with vague descriptions like “when the queen talks to the mirror.”

Transcript synchronization works smoothly with automatic scrolling and visual highlighting. The split-screen e-ink aesthetic gives the interface a distinctive, retro feel while staying practical.

I also built a multi-audiobook system where users can switch books and the agent automatically updates its context and knowledge base.

What I Learned

This project taught me how to build real-time voice applications using WebRTC and LiveKit. Designing LLM function tools forced me to think carefully about clear, reliable tool interfaces.

Working with transcript data highlighted how critical word-level timing accuracy is. I also learned that semantic search with LLMs is far more robust than keyword matching for natural language navigation.

Keeping audio playback, transcript state, and voice interaction in sync required careful event handling and state design.

What’s Next for Duet

I plan to add support for importing any audiobook file, with automatic transcript generation. Bookmarking and note-taking would let listeners save moments and capture thoughts without breaking immersion.

Built With

Share this project:

Updates