## Inspiration
We've all been there staring at a cryptic assembly manual, trying to figure
out which screw goes where. IKEA furniture, BBQ grills, electronics, even Lego
sets. The instructions feel like they're working against you. Tiny 2D
diagrams. Ambiguous arrows. Part numbers that mean nothing. One mistake, and
you're disassembling everything to start over.
We asked ourselves: What if understanding was guaranteed? What if you
could see every step in an interactive 3D model, hear clear voice guidance,
and ask questions when you're stuck? That's the experience we set out to
build.
## What it does
ManualX transforms any PDF instruction manual into an interactive 3D
assembly guide with AI voice assistance.
- Upload any PDF manual
- AI extracts the steps, detects components, and generates 3D models
- View each step as an interactive 3D scene — rotate, zoom, see exactly
how parts connect - Listen to natural voice narration explaining each step in plain
language - Ask questions using the built-in voice assistant when you need help
No more squinting at tiny diagrams. No more guessing. Just clear, visual,
voice-guided instructions that anyone can follow.
## How we built it
Frontend:
- Next.js 15 with React 19
- React Three Fiber for 3D rendering
- Vapi for real-time voice AI conversations
- ElevenLabs for text-to-speech narration
- react-pdf for side-by-side PDF viewing
Backend:
- FastAPI for the REST API
- Google Gemini for vision AI — detecting steps, identifying components,
generating descriptions, and estimating 3D positions - Tripo AI for converting 2D component images into 3D GLB models
- rembg for background removal to clean component images before 3D generation
- SQLite for data persistence
Pipeline:
PDF → Extract Images/Text → Detect Steps (Gemini) → Detect Components (Gemini)
→ Crop & Clean Images (rembg) → Generate 3D Models (Tripo)
→ Analyze Positions (Gemini) → Generate Voice Audio (ElevenLabs) → JSON
Output
## Challenges we ran into
3D Position Estimation: Getting AI to understand spatial relationships
from 2D manual images and translate them into accurate 3D coordinates was
harder than expected. We iterated on our Gemini prompts extensively to improve
accuracy.
Background Removal: Our initial approach used Stable Diffusion for
"cleaning" component images, but it was actually regenerating them, causing
inconsistencies. Switching to rembg gave us clean, transparent PNGs that Tripo
could work with properly.
Browser Audio Autoplay: Chrome's autoplay policies blocked our TTS from
playing automatically. We had to implement proper user interaction handling
and add manual replay controls.
Real-time Voice AI: Coordinating between the step narration TTS and Vapi's
voice assistant required careful state management to prevent audio overlap
and ensure smooth transitions.
## Accomplishments that we're proud of
- End-to-end automation: Drop a PDF, get a fully interactive 3D guide — no
manual work required
- Multi-modal AI pipeline: Successfully chained Gemini (vision), Tripo (3D
generation), and ElevenLabs (voice) into a cohesive experience
- Natural voice interaction: Users can have real conversations about the
assembly process, not just listen to pre-recorded audio - Clean, intuitive UI: Split-view with 3D scene and original PDF,
glassmorphism design, smooth animations
## What we learned
AI is surprisingly good at understanding instructions, but only if you’re very specific (vague diagrams in, vague results out). Creating a complex AI pipeline from scratch is really hard. Threejs is also pretty sick.
## What's next for ManualX
- AR mode: View the 3D assembly overlaid on your real workspace using your
phone camera
- Community library: Share processed manuals so others don't have to
re-process the same IKEA instructions - Hardware integration: Connect with smart tools that can verify you're
using the right screws
Log in or sign up for Devpost to join the conversation.