Working with your computer is pretty tedious and manual, what if it wasn't? Our project allows users to seemlessly present and use their computer to do tasks such as go over slides, docuements, send emails and more!
- Operating through your computer through your camera
- Operating through your computer through your voice
- Presenter mode
- Record presenter
- Generate transcript and subtitles
- OS control actions
- Left click, right click, move mouse
- Drag-and-drop, scroll, click-and-hold, window snapping, app switching, volume/brightness, media keys
- Macro system
- User-defined sequences triggered by a gesture or voice intent
- Gesture personalization
- Map your own gestures to actions and save presets
- Accessibility modes
- Dwell-click, sticky keys, large-pointer mode, reduced motion
- Collaboration tools
- Remote slide control, laser pointer overlay, live captions
- Telemetry & debugging
- Action logs, latency charts, and a safe replay sandbox for testing
- Gemini intelligence
- Intent parsing: convert natural language into a structured action plan
- Context-aware shortcuts: detect on-screen context and auto-switch profiles
- Presenter assistant: generate speaker notes, summarize slides, prep Q&A
- Meeting mode: transcript summaries, action items, slide search by topic
- Gesture refinement: reduce false positives using vision prompts
- Error recovery: explain failures and suggest fixes
- Safety & privacy
- Pause/kill toggle in the UI and a quick hotkey
- Local-only mode for users who don’t want cloud calls
- Clear opt-in for any external API data sharing
- Languages
- Python
- JavaScript
- Libraries
- OpenCV
- APIs
- Gemini
- Supported Operating Systems
- Mac
We want to set up detection of the following gestures:
- Left hand index finger
- Right hand index finger
We want to set up functions applicable for operating systems to link the gestures to operations. Targeted operations
- Left click
- Right click
- Moving mouse
This is a desktop applicaiton built with Electron. We want to have a clean and slick UI users can use that's intuitive.
Run the app (window menu):
poetry run conuxTest detection:
Use the window menu and select "Gesture tester".Controls: C clear, Q quit.
Tips:
- Left hand: pinch to click/drag. Right hand: swipe left/right.
- Try pinches and short swipes at different distances from the camera.
- If detection is too sensitive, tune thresholds in
gesture_model/src/cv/gesture_detector.py.
cd gesture_modelpoetry install
poetry run conux --mode allpoetry run conux --mode cvpoetry run conux --mode inputpoetry run conux --mode audiopoetry run conux --mode summarizepoetry run conux --mode summarize_videopoetry run conux --mode screenpoetry run conux --mode muxpoetry run conux --mode capture(screen + audio together, then mux)poetry run conux --mode audio --audio-translate
Central defaults live in gesture_model/src/config.py (see AudioConfig).
You can either:
- Set env vars for one-off runs, or
- Import and pass a config in code (e.g.,
audio.streaming.run(config=...)).
cv: camera + gesture detection + OSLink mouse control.audio: live speech-to-text with optional translation + subtitle buffering.screen: screen recording via ffmpeg.mux: mux latest screen recording with saved audio clips.capture: runs audio + screen together, then muxes; setsAUDIO_SAVE=1andSCREEN_SAVE=1.menu: simple window menu to launch CV/tester tools.
Default backend is Vosk with partials enabled; switch to SpeechRecognition with
AUDIO_BACKEND=speech_recognition.
Run with Poetry:
AUDIO_BACKEND=vosk poetry run conux --mode audioTranslate audio on the fly:
AUDIO_TRANSLATE=1 AUDIO_TRANSLATE_TARGET=fr AUDIO_TRANSLATE_SOURCE=auto \
AUDIO_BACKEND=vosk poetry run conux --mode audioOr use the CLI flag:
poetry run conux --mode audio --audio-translateFirst run auto-downloads the Vosk model to ~/.cache/conux/vosk. Override with:
VOSK_MODEL_CACHE_DIR=/path/to/cacheVOSK_MODEL_URL=...(model zip URL)VOSK_MODEL_PATH=/path/to/extracted/model(skip download)
Audio pipeline configuration:
AUDIO_BACKEND=vosk|speech_recognitionAUDIO_STREAM_PARTIALS=1(emit partial Vosk hypotheses; default on)AUDIO_TRANSLATE=1,AUDIO_TRANSLATE_TARGET=en,AUDIO_TRANSLATE_SOURCE=autoAUDIO_TRANSLATE_PARTIALS=1AUDIO_SENTENCE_GAP=1.2,AUDIO_MAX_SENTENCE_SECONDS=10,AUDIO_MAX_SENTENCE_CHARS=120
Audio device selection:
AUDIO_DEVICE_INDEX=0(force a specific mic index)AUDIO_SKIP_DEVICE_SCAN=1(skip device enumeration)
SpeechRecognition tuning:
AUDIO_LISTEN_TIMEOUT=5,AUDIO_PHRASE_TIME_LIMIT=5,AUDIO_QUIET=1AUDIO_DYNAMIC_ENERGY=1,AUDIO_ENERGY_THRESHOLD=...AUDIO_PAUSE_THRESHOLD=...,AUDIO_PHRASE_THRESHOLD=...AUDIO_NON_SPEAKING_DURATION=...
Vosk streaming tuning:
VOSK_LOG_LEVEL=-1AUDIO_SAMPLE_RATE=16000,AUDIO_CHUNK_SIZE=8000AUDIO_PRE_ROLL_SECONDS=0.5AUDIO_PREPROCESS=1,AUDIO_TARGET_RMS=3000AUDIO_MAX_GAIN=3.0,AUDIO_GATE_THRESHOLD=400VOSK_MODEL_PATH=/path/to/modelVOSK_MODEL_CACHE_DIR=...,VOSK_MODEL_VARIANT=large|smallVOSK_MODEL_URL=...,VOSK_MODEL_ZIP_PATH=...
CLI audio flags (override env defaults):
--audio-backend vosk|speech_recognition--audio-stream-partials/--no-audio-stream-partials--audio-translate,--audio-translate-target,--audio-translate-source--audio-translate-partials/--no-audio-translate-partials--audio-sentence-gap,--audio-max-sentence-seconds,--audio-max-sentence-chars--captions(audio/capture modes),--text-mode original|translated
Captions overlay (desktop subtitle bar):
--captionsto show the overlay inaudioorcapturemode.--text-mode original|translatedselects what the overlay shows.- If
--text-mode translatedis used, translation is enabled automatically. - The overlay is a topmost window anchored to the bottom of the desktop.
- When using
--mode capture, the overlay is part of the desktop and will be recorded by screen capture if your OS allows capturing topmost windows.
Vosk required/optional env vars:
- Required to use Vosk: none (default backend)
- Optional model control:
VOSK_MODEL_PATH=/path/to/model(skip download, point at extracted model)VOSK_MODEL_CACHE_DIR=...(override cache dir for auto-download)VOSK_MODEL_VARIANT=large|small(select default model URL)VOSK_MODEL_URL=...(custom model zip URL)VOSK_MODEL_ZIP_PATH=...(use a pre-downloaded zip)
- Optional streaming/audio knobs:
AUDIO_STREAM_PARTIALS=1(emit partial hypotheses)AUDIO_SAMPLE_RATE=16000,AUDIO_CHUNK_SIZE=8000AUDIO_PRE_ROLL_SECONDS=0.5AUDIO_PREPROCESS=1,AUDIO_TARGET_RMS=3000AUDIO_MAX_GAIN=3.0,AUDIO_GATE_THRESHOLD=400VOSK_LOG_LEVEL=-1
There are no required env vars for the default poetry run conux --mode all flow.
Set these only if your environment needs them:
FFMPEG_PATH=/path/to/ffmpegifffmpegis not onPATH(screen/mux).FFPROBE_PATH=/path/to/ffprobeifffprobeis not onPATH(mux duration detection).VOSK_MODEL_PATH=/path/to/modelif you runAUDIO_BACKEND=voskwithout allowing the auto-download.SCREEN_INPUT=...if the default screen capture input does not work on your OS.AUDIO_DEVICE_INDEX=...if you need to force a non-default microphone.
Set one of:
AUDIO_SAVE=1(saves toaudio_recordings/)AUDIO_SAVE_DIR=/path/to/dir
Set one of:
TRANSCRIPT_SAVE=1(saves to repo-roottranscripts/)TRANSCRIPT_SAVE_DIR=/path/to/dirTRANSCRIPT_SAVE_PATH=/path/to/transcript.txtNote:transcripts/is intentionally untracked; do not commit transcript files.
Set one of:
SCREEN_SAVE=1(saves toscreen_recordings/)SCREEN_SAVE_DIR=/path/to/dir
Optional:
FFMPEG_PATH=ffmpegSCREEN_FPS=30SCREEN_SIZE=1920x1080SCREEN_INPUT=...(Windows defaultdesktop, macOS auto-detectsCapture screenand falls back to1, Linux default$DISPLAY)SCREEN_QUIET=1(suppress ffmpeg output)
Uses the newest screen_*.mp4 in SCREEN_SAVE_DIR (or screen_recordings/) and
timestamped audio_*.wav in AUDIO_SAVE_DIR (or audio_recordings/).
Optional:
MUX_OUTPUT_DIR=/path/to/dirFFPROBE_PATH=ffprobe
Gesture thresholds, smoothing, and mouse motion tuning are code-level constants:
gesture_model/src/cv/gesture_detector.py(pinch/swipe/click/drag thresholds)gesture_model/src/cv/camera.py(hand tracking confidence, motion gains, accel)
OSLink exposes a SafetyConfig(paused=False) and pause/kill toggles in code
(oslink/api.py); there are no env vars yet.
Summarize the newest transcript:
GEMINI_API_KEY=... poetry run conux --mode summarizeSummarize with video + transcript context:
GEMINI_API_KEY=... poetry run conux --mode summarize_videoOutputs:
- Transcript:
transcripts/transcript_<timestamp>.txt - Summary:
summaries/summary_<timestamp>.pdf
Generate a PDF slide deck from the newest muxed video + transcript:
poetry run conux --mode deckOne-command capture + deck:
TRANSCRIPT_SAVE=1 AUDIO_SAVE=1 SCREEN_SAVE=1 poetry run conux --mode capture_deckOutputs:
- Slide deck:
decks/deck_<timestamp>.pdf - Markdown source:
decks/deck_<timestamp>.md - Image assets:
decks/deck_<timestamp>_assets/
Dependencies:
poetry add google-genainpm i -g @marp-team/marp-cli(or setMARP_PATH="npx @marp-team/marp-cli")
Options:
DECK_OUTPUT_PATH=/path/to/deck.pdfDECK_OUTPUT_DIR=/path/to/dir(defaultdecks/)DECK_TITLE="..."DECK_SUBTITLE="..."DECK_MAX_IMAGES=6DECK_SCENE_THRESHOLD=0.25(scene-change capture)DECK_FRAME_INTERVAL=10(fallback seconds)DECK_IMAGE_WIDTH=1280DECK_PROMPT="..."(override deck summarization prompt)DECK_AFTER_CAPTURE=1(run deck generation after--mode capture)
- Capture screen + audio (muxed video) + transcript:
TRANSCRIPT_SAVE=1 AUDIO_SAVE=1 SCREEN_SAVE=1 poetry run conux --mode capture- Summarize using the full video + transcript:
poetry run conux --mode summarize_videoOutputs:
- Muxed video:
screen_recordings/muxed_<timestamp>.mp4 - Transcript:
transcripts/transcript_<timestamp>.txt - Summary PDF:
summaries/summary_<timestamp>.pdf
Optional:
SUMMARY_OUTPUT_PATH=/path/to/summary.pdf(or.txt)SUMMARY_OUTPUT_DIR=/path/to/dir(defaultsummaries/)TRANSCRIPT_PATH=/path/to/transcript.txt(default is newest intranscripts/)VIDEO_PATH=/path/to/muxed.mp4(used bysummarize_video, default is newest inscreen_recordings/)SUMMARY_PROMPT="..."(override the summarization instruction)GEMINI_MODEL=gemini-2.0-flash
Dependency:
poetry add google-genai reportlab
Examples:
TRANSCRIPT_SAVE=1 poetry run conux --mode audio
GEMINI_API_KEY=... poetry run conux --mode summarize
TRANSCRIPT_PATH=/path/to/file.txt GEMINI_API_KEY=... poetry run conux --mode summarize
VIDEO_PATH=/path/to/muxed.mp4 GEMINI_API_KEY=... poetry run conux --mode summarize_video