Handsfree AI Assistant

Inspiration

We wanted a practical, hands-free desktop copilot that can listen, understand, and act on everyday tasks: opening/closing apps, drafting text, saving files correctly (without losing work), checking time/weather/news for any city, and even recording meetings with one voice command. The goal was reliability over hype—tight OS integration, graceful fallbacks, and guardrails that avoid “oops” moments like force-killing terminals or losing unsaved Word documents.

How we built it

1) Speech in / speech out -ASR: Real-time chunks recorded via sounddevice, written to WAV, then transcribed by faster-whisper with VAD (vad_filter=True). -TTS: A dedicated pyttsx3 thread with a queue makes spoken responses robust; there’s a barrier speak_flush() and a clean shutdown path. -Math note: audio is captured at a fixed sample rate, then processed in frames. Conceptually, the pipeline rests on short-time analysis; e.g., Fourier components 2) Intent planning (single & multi-step) -A compact intent schema (open/close apps, get time/weather/news, save docs, write essays, generate code, Spotify, meeting recorders, etc.) plus a multi-action planner that yields ordered steps, JSON-only. -There’s also a heuristic fallback to parse demo phrases (e.g., “open word… write 200-word essay… save… close”). 3) Meeting Recorder (background capture → DOCX) -Continuous mic capture streams into a WAV file on a background thread, then Stop triggers transcription and a .docx transcript with timestamped heading. -Start/stop are mapped to voice intents; stop runs whisper (beam search), collates segments, and writes both audio and DOCX. 4) Real desktop automation (Word/Office + safe closing) -Save reliably: save_document uses Office COM to Save As with .docx defaulting and auto-create doc if none is open. -Plan augmentation: if the user said “save … as …”, a save_document step gets auto-inserted before closing Word. -Unsaved changes guard: when closing any Microsoft Office apps, we trigger Save As dialogs (or F12) instead of losing work. -Shells are special: closing Terminal/PowerShell/CMD/WSL never kills their child processes (no /T), avoiding collateral damage. 5) City time, weather, and news City → geocode → timezone; time fetched via an API. Weather uses Open-Meteo with unit selection by country code; news via NewsAPI/GNews if keys exist. (These helpers hang off the same intent surface.)

WHAT IT DOES

-Open, close, and control apps like Word, Excel, PowerPoint, browsers, or terminals. -Write and save documents safely, with automatic safeguards that prevent losing unsaved work. -Check the time, weather, and news for any city around the world. -Record meetings hands-free, generating both audio files and AI-transcribed DOCX summaries. -Handle multi-step tasks (e.g., “open Word, write a 200-word essay, save it, and close Word”) in one spoken command.

Challenges we ran into

-Process management on Windows is fiddly: balancing polite CloseMainWindow() with fallbacks to taskkill (and without /T for shells). -Office COM can fail sporadically; the code defensively re-acquires objects and falls back to hotkeys (F12) when dialogs aren’t available. -Streaming audio to disk while keeping latency low required a background drain loop and careful shutdown (flush queue, join thread, close file). -Multi-step voice commands need robust fallbacks; the parser includes a heuristic pipeline when the LLM planner is unavailable or returns junk.

Accomplishments that we're proud of

-Seamless speech-to-action loop: Built a real-time pipeline that captures audio, transcribes it, plans intents, and executes actions on Windows — all with voice-only control. -Data-loss prevention in Office apps: Implemented robust safeguards to automatically trigger Save/Save As before closing Word, Excel, or PowerPoint, ensuring no unsaved work is lost. -Multi-step task orchestration: Enabled the assistant to chain complex voice requests (e.g., “Open Word → write essay → save → close”) into reliable, ordered actions.

What we learned

-Threaded audio and TTS queues dramatically reduce UI junk and deadlocks in Windows SAPI. -A small, explicit intent contract is easier to test and extend than ad-hoc parsing, and JSON-only outputs keep the planner predictable. -“Quality of life” glue—like auto-inserting save steps and Save As fail-safes—matters more than fancy prompts. -Treat shells differently: never use tree-kills on terminals; it’s the difference between “assistant” and “oops.”

What's next for Handsfree AI Assistant

-Smarter intent detection: Improve multi-step planning with richer context so the assistant can handle more natural, conversational instructions. -Cross-platform support: Extend beyond Windows COM automation to macOS/Linux, enabling universal hands-free control. -Noise robustness & mobile use: Add stronger background noise filtering and explore deployment on mobile devices for on-the-go use. -Deeper app integrations: Expand support for IDEs, browsers, and collaboration tools (VS Code, Slack, Teams, etc.) with voice-driven workflows.