NeuroCue

Reading the room, so you don't have to.

The Problem

15–20% of the global population is neurodivergent. For many, one of the hardest daily challenges is social interaction.

Almost 50% of autistic individuals experience social-emotional agnosia—the clinical inability to perceive facial expressions, body language, and vocal inflection. People with ADHD face a related problem: inattention often causes them to miss real-time social cues.

The consequences are severe:

85% estimated unemployment rate among autistic adults.
52% of neurodivergent professionals don't feel comfortable disclosing their condition at work.
#1 barrier to employment for neurodivergent people is face-to-face interviews.

Existing support is retrospective (e.g., therapy reviewing what went wrong). NeuroCue works in the moment, in a live conversation, invisibly.

How It Works

The user wears an earpiece and opens their laptop camera. NeuroCue reads the conversation partner across three simultaneous channels:

Body Language: 17 skeleton keypoints from pose estimation are converted into 14 behavioural states (e.g., crossed arms, fidgeting, leaning away).
Facial Expression: A custom-trained CNN classifies 7 emotions and compound states like "polite smile" (smiling but not genuinely engaged) or "masking discomfort".
Speech: Live transcription captures what is actually being said.

Every 15 seconds, all three signals are fused by Claude. The LLM reasons about cross-modal combinations and generates a single, short, actionable piece of coaching advice, which is whispered through the earpiece via text-to-speech.

"They've crossed their arms and look uncomfortable — try asking what concerns they have."

System Architecture

Perception Layer: YOLOv11 Nano Pose (via NVIDIA A10G GPU / OpenShift) + local trigonometric heuristics for body language.
Facial Expression: FERNet (custom CNN trained on FER-2013) + OpenCV Haar Cascade for ~15ms per frame CPU inference.
Speech-to-Text: sounddevice + ElevenLabs Scribe v1.
Reasoning Layer: Anthropic's Claude evaluates contradictions (e.g., "smiling but arms crossed").
Output: ElevenLabs TTS (Turbo v2.5) audio delivery.

Data Flow

Webcam ─► GPU Pose Estimation ─► Body Language ─────────────┐ │ Webcam ─► Haar Face Detect ─► FERNet ─► Facial Expression ──├─► Claude ─► TTS ─► 🔊 │ Microphone ─► ElevenLabs STT ─► Transcript ─────────────────┘

Accomplishments that we are proud of

Three-modality fusion working end-to-end in real time — body language, facial expression, and speech all feeding into a single coherent piece of advice, with cross-modal contradiction detection.
Training a facial expression CNN from scratch, working on a limited time scale, and integrating it into the live pipeline — from dataset download to real-time inference in under 24 hours.
Compound expression detection — going beyond top-1 emotion labels to detect "polite smile" (low-confidence happy + high neutral), "masking" (smiling with fear/sad undertones), and "confusion" (surprise + fear) from the probability distribution.
The system runs on a normal laptop and remote GPUs, no wearable beyond an earpiece. The entire perception layer runs at real-time speeds on CPU (except YOLO, which uses remote GPUs).
Claude's cross-modal reasoning — it catches nuance that single-modality systems miss entirely, like "they're nodding and smiling but their voice sounds uncertain and they said 'I'm not sure' — they're being polite, not agreeing."

Challenges Faced

Training a CNN during a hackathon. Wrestling with dataset paths on Windows vs Linux, Kaggle authentication for downloading FER-2013, and free-tier Colab GPU allocation running out mid-training. FER-2013's severe class imbalance required weighted sampling, class-weighted loss, and mixup augmentation to get usable accuracy on minority emotions like disgust and fear.
Dependencies. Pygame was required by the ElevenLabs streaming library but refused to build on our system. We rewrote audio playback to use sounddevice instead — a dependency we already had for microphone capture.
Latency vs efficacy. A 15-second audio window is needed for meaningful speech transcription, but advice arriving 15+ seconds after a social cue has passed is too late. We balanced this by running body language and facial expression detection continuously while batching the LLM call to the audio cycle.
ML inference over hackathon WiFi. YOLO inference via a Cloudflare tunnel over conference WiFi was unreliable. We compressed frames to JPEG quality 60 and resized to 640×480 before sending, trading image quality for reliability.
Threading bugs. Shared mutable state between three threads led to races where the UI would read half-updated body language states. A single global lock solved it but required careful placement to avoid deadlocking the video feed.

What we learned

The gap between "model works in a notebook" and "model works in a live system" is enormous. Path handling, threading, dependency conflicts, and latency constraints are the real engineering challenge — not the ML.
Class imbalance matters more than model architecture for real-world accuracy. Our FERNet with proper class balancing outperforms a deeper model trained naively on the same data.
LLMs are surprisingly effective at multi-modal fusion when given structured signal descriptions.
Claude caught cross-modal contradictions we hadn't explicitly programmed for.
Social cues are genuinely ambiguous even for neurotypical people — FER-2013's human agreement ceiling is only ~65%. This reframed our goal from "get it right" to "provide useful guidance under uncertainty."
93% of communication is nonverbal (Mehrabian, 1971), yet almost all assistive technology for neurodivergent people focuses on verbal communication. There's a massive unserved gap in the nonverbal channel.

What's next for NeuroCue

On-device inference — replacing the remote GPU with on-device YOLO (Apple Neural Engine, Qualcomm NPU) so it works without WiFi.
Temporal trend analysis — tracking engagement rising or falling over 30-second windows, not just point-in-time snapshots.
Mobile deployment — running on a phone with AirPods as the earpiece for truly invisible, go-anywhere use.
Self-refinement loop — the user rates advice quality and the system learns which signals and coaching styles work best for them, personalising over time.
Cross-cultural adaptation — body language heuristics are currently Western-centric. Head tilting, eye contact norms, and personal space vary across cultures.
Smart glasses integration — visual overlays as an alternative to audio coaching for users who prefer visual input.
Conversation arc awareness — integrating the full conversation transcript so Claude can reference earlier context and detect evolving emotional trajectories across an entire interaction.