NeuroCue
Reading the room, so you don't have to.
Table of Contents
- The Problem
- How It Works
- System Architecture
- Accomplishments
- Challenges Faced
- What We Learned
- Future Roadmap
The Problem
15–20% of the global population is neurodivergent. For many, one of the hardest daily challenges is social interaction.
Almost 50% of autistic individuals experience social-emotional agnosia—the clinical inability to perceive facial expressions, body language, and vocal inflection. People with ADHD face a related problem: inattention often causes them to miss real-time social cues.
The consequences are severe:
- 85% estimated unemployment rate among autistic adults.
- 52% of neurodivergent professionals don't feel comfortable disclosing their condition at work.
- #1 barrier to employment for neurodivergent people is face-to-face interviews.
Existing support is retrospective (e.g., therapy reviewing what went wrong). NeuroCue works in the moment, in a live conversation, invisibly.
How It Works
The user wears an earpiece and opens their laptop camera. NeuroCue reads the conversation partner across three simultaneous channels:
- Body Language: 17 skeleton keypoints from pose estimation are converted into 14 behavioural states (e.g., crossed arms, fidgeting, leaning away).
- Facial Expression: A custom-trained CNN classifies 7 emotions and compound states like "polite smile" (smiling but not genuinely engaged) or "masking discomfort".
- Speech: Live transcription captures what is actually being said.
Every 15 seconds, all three signals are fused by Claude. The LLM reasons about cross-modal combinations and generates a single, short, actionable piece of coaching advice, which is whispered through the earpiece via text-to-speech.
"They've crossed their arms and look uncomfortable — try asking what concerns they have."
System Architecture
- Perception Layer: YOLOv11 Nano Pose (via NVIDIA A10G GPU / OpenShift) + local trigonometric heuristics for body language.
- Facial Expression: FERNet (custom CNN trained on FER-2013) + OpenCV Haar Cascade for ~15ms per frame CPU inference.
- Speech-to-Text:
sounddevice+ ElevenLabs Scribe v1. - Reasoning Layer: Anthropic's Claude evaluates contradictions (e.g., "smiling but arms crossed").
- Output: ElevenLabs TTS (Turbo v2.5) audio delivery.
Data Flow
Webcam ─► GPU Pose Estimation ─► Body Language ─────────────┐ │ Webcam ─► Haar Face Detect ─► FERNet ─► Facial Expression ──├─► Claude ─► TTS ─► 🔊 │ Microphone ─► ElevenLabs STT ─► Transcript ─────────────────┘
Accomplishments that we are proud of
- Three-modality fusion working end-to-end in real time — body language, facial expression, and speech all feeding into a single coherent piece of advice, with cross-modal contradiction detection.
- Training a facial expression CNN from scratch, working on a limited time scale, and integrating it into the live pipeline — from dataset download to real-time inference in under 24 hours.
- Compound expression detection — going beyond top-1 emotion labels to detect "polite smile" (low-confidence happy + high neutral), "masking" (smiling with fear/sad undertones), and "confusion" (surprise + fear) from the probability distribution.
- The system runs on a normal laptop and remote GPUs, no wearable beyond an earpiece. The entire perception layer runs at real-time speeds on CPU (except YOLO, which uses remote GPUs).
- Claude's cross-modal reasoning — it catches nuance that single-modality systems miss entirely, like "they're nodding and smiling but their voice sounds uncertain and they said 'I'm not sure' — they're being polite, not agreeing."
Challenges Faced
- Training a CNN during a hackathon. Wrestling with dataset paths on Windows vs Linux, Kaggle authentication for downloading FER-2013, and free-tier Colab GPU allocation running out mid-training. FER-2013's severe class imbalance required weighted sampling, class-weighted loss, and mixup augmentation to get usable accuracy on minority emotions like disgust and fear.
- Dependencies. Pygame was required by the ElevenLabs streaming library but refused to build on our system. We rewrote audio playback to use sounddevice instead — a dependency we already had for microphone capture.
- Latency vs efficacy. A 15-second audio window is needed for meaningful speech transcription, but advice arriving 15+ seconds after a social cue has passed is too late. We balanced this by running body language and facial expression detection continuously while batching the LLM call to the audio cycle.
- ML inference over hackathon WiFi. YOLO inference via a Cloudflare tunnel over conference WiFi was unreliable. We compressed frames to JPEG quality 60 and resized to 640×480 before sending, trading image quality for reliability.
- Threading bugs. Shared mutable state between three threads led to races where the UI would read half-updated body language states. A single global lock solved it but required careful placement to avoid deadlocking the video feed.
What we learned
- The gap between "model works in a notebook" and "model works in a live system" is enormous. Path handling, threading, dependency conflicts, and latency constraints are the real engineering challenge — not the ML.
- Class imbalance matters more than model architecture for real-world accuracy. Our FERNet with proper class balancing outperforms a deeper model trained naively on the same data.
- LLMs are surprisingly effective at multi-modal fusion when given structured signal descriptions.
- Claude caught cross-modal contradictions we hadn't explicitly programmed for.
- Social cues are genuinely ambiguous even for neurotypical people — FER-2013's human agreement ceiling is only ~65%. This reframed our goal from "get it right" to "provide useful guidance under uncertainty."
- 93% of communication is nonverbal (Mehrabian, 1971), yet almost all assistive technology for neurodivergent people focuses on verbal communication. There's a massive unserved gap in the nonverbal channel.
What's next for NeuroCue
- On-device inference — replacing the remote GPU with on-device YOLO (Apple Neural Engine, Qualcomm NPU) so it works without WiFi.
- Temporal trend analysis — tracking engagement rising or falling over 30-second windows, not just point-in-time snapshots.
- Mobile deployment — running on a phone with AirPods as the earpiece for truly invisible, go-anywhere use.
- Self-refinement loop — the user rates advice quality and the system learns which signals and coaching styles work best for them, personalising over time.
- Cross-cultural adaptation — body language heuristics are currently Western-centric. Head tilting, eye contact norms, and personal space vary across cultures.
- Smart glasses integration — visual overlays as an alternative to audio coaching for users who prefer visual input.
- Conversation arc awareness — integrating the full conversation transcript so Claude can reference earlier context and detect evolving emotional trajectories across an entire interaction.
Built With
- claude
- computer
- elevenlabs
- gemini
- machine
- machine-learning
- pytorch
- vision
Log in or sign up for Devpost to join the conversation.