NeuroCue

Reading the room, so you don't have to.

Table of Contents


The Problem

15–20% of the global population is neurodivergent. For many, one of the hardest daily challenges is social interaction.

Almost 50% of autistic individuals experience social-emotional agnosia—the clinical inability to perceive facial expressions, body language, and vocal inflection. People with ADHD face a related problem: inattention often causes them to miss real-time social cues.

The consequences are severe:

  • 85% estimated unemployment rate among autistic adults.
  • 52% of neurodivergent professionals don't feel comfortable disclosing their condition at work.
  • #1 barrier to employment for neurodivergent people is face-to-face interviews.

Existing support is retrospective (e.g., therapy reviewing what went wrong). NeuroCue works in the moment, in a live conversation, invisibly.


How It Works

The user wears an earpiece and opens their laptop camera. NeuroCue reads the conversation partner across three simultaneous channels:

  1. Body Language: 17 skeleton keypoints from pose estimation are converted into 14 behavioural states (e.g., crossed arms, fidgeting, leaning away).
  2. Facial Expression: A custom-trained CNN classifies 7 emotions and compound states like "polite smile" (smiling but not genuinely engaged) or "masking discomfort".
  3. Speech: Live transcription captures what is actually being said.

Every 15 seconds, all three signals are fused by Claude. The LLM reasons about cross-modal combinations and generates a single, short, actionable piece of coaching advice, which is whispered through the earpiece via text-to-speech.

"They've crossed their arms and look uncomfortable — try asking what concerns they have."


System Architecture

  • Perception Layer: YOLOv11 Nano Pose (via NVIDIA A10G GPU / OpenShift) + local trigonometric heuristics for body language.
  • Facial Expression: FERNet (custom CNN trained on FER-2013) + OpenCV Haar Cascade for ~15ms per frame CPU inference.
  • Speech-to-Text: sounddevice + ElevenLabs Scribe v1.
  • Reasoning Layer: Anthropic's Claude evaluates contradictions (e.g., "smiling but arms crossed").
  • Output: ElevenLabs TTS (Turbo v2.5) audio delivery.

Data Flow

Webcam ─► GPU Pose Estimation ─► Body Language ─────────────┐ │ Webcam ─► Haar Face Detect ─► FERNet ─► Facial Expression ──├─► Claude ─► TTS ─► 🔊 │ Microphone ─► ElevenLabs STT ─► Transcript ─────────────────┘

Accomplishments that we are proud of

  • Three-modality fusion working end-to-end in real time — body language, facial expression, and speech all feeding into a single coherent piece of advice, with cross-modal contradiction detection.
  • Training a facial expression CNN from scratch, working on a limited time scale, and integrating it into the live pipeline — from dataset download to real-time inference in under 24 hours.
  • Compound expression detection — going beyond top-1 emotion labels to detect "polite smile" (low-confidence happy + high neutral), "masking" (smiling with fear/sad undertones), and "confusion" (surprise + fear) from the probability distribution.
  • The system runs on a normal laptop and remote GPUs, no wearable beyond an earpiece. The entire perception layer runs at real-time speeds on CPU (except YOLO, which uses remote GPUs).
  • Claude's cross-modal reasoning — it catches nuance that single-modality systems miss entirely, like "they're nodding and smiling but their voice sounds uncertain and they said 'I'm not sure' — they're being polite, not agreeing."

Challenges Faced

  • Training a CNN during a hackathon. Wrestling with dataset paths on Windows vs Linux, Kaggle authentication for downloading FER-2013, and free-tier Colab GPU allocation running out mid-training. FER-2013's severe class imbalance required weighted sampling, class-weighted loss, and mixup augmentation to get usable accuracy on minority emotions like disgust and fear.
  • Dependencies. Pygame was required by the ElevenLabs streaming library but refused to build on our system. We rewrote audio playback to use sounddevice instead — a dependency we already had for microphone capture.
  • Latency vs efficacy. A 15-second audio window is needed for meaningful speech transcription, but advice arriving 15+ seconds after a social cue has passed is too late. We balanced this by running body language and facial expression detection continuously while batching the LLM call to the audio cycle.
  • ML inference over hackathon WiFi. YOLO inference via a Cloudflare tunnel over conference WiFi was unreliable. We compressed frames to JPEG quality 60 and resized to 640×480 before sending, trading image quality for reliability.
  • Threading bugs. Shared mutable state between three threads led to races where the UI would read half-updated body language states. A single global lock solved it but required careful placement to avoid deadlocking the video feed.

What we learned

  • The gap between "model works in a notebook" and "model works in a live system" is enormous. Path handling, threading, dependency conflicts, and latency constraints are the real engineering challenge — not the ML.
  • Class imbalance matters more than model architecture for real-world accuracy. Our FERNet with proper class balancing outperforms a deeper model trained naively on the same data.
  • LLMs are surprisingly effective at multi-modal fusion when given structured signal descriptions.
  • Claude caught cross-modal contradictions we hadn't explicitly programmed for.
  • Social cues are genuinely ambiguous even for neurotypical people — FER-2013's human agreement ceiling is only ~65%. This reframed our goal from "get it right" to "provide useful guidance under uncertainty."
  • 93% of communication is nonverbal (Mehrabian, 1971), yet almost all assistive technology for neurodivergent people focuses on verbal communication. There's a massive unserved gap in the nonverbal channel.

What's next for NeuroCue

  • On-device inference — replacing the remote GPU with on-device YOLO (Apple Neural Engine, Qualcomm NPU) so it works without WiFi.
  • Temporal trend analysis — tracking engagement rising or falling over 30-second windows, not just point-in-time snapshots.
  • Mobile deployment — running on a phone with AirPods as the earpiece for truly invisible, go-anywhere use.
  • Self-refinement loop — the user rates advice quality and the system learns which signals and coaching styles work best for them, personalising over time.
  • Cross-cultural adaptation — body language heuristics are currently Western-centric. Head tilting, eye contact norms, and personal space vary across cultures.
  • Smart glasses integration — visual overlays as an alternative to audio coaching for users who prefer visual input.
  • Conversation arc awareness — integrating the full conversation transcript so Claude can reference earlier context and detect evolving emotional trajectories across an entire interaction.

Built With

Share this project:

Updates