Inspiration
The inspiration for Echo-Sign came from a simple observation: while the world is becoming more connected through AI, the Deaf and hard-of-hearing communities still face massive barriers in everyday "hearing" environments. We wanted to build a tool that wasn't just a static dictionary, but a real-time bridge—something that could turn the fluidity of American Sign Language (ASL) into spoken English instantly, using nothing more than a standard laptop webcam.
What it does
Echo-Sign is a comprehensive accessibility platform designed to facilitate seamless, two-way communication between the Deaf and hearing communities. At its core, the application utilizes a high-speed computer vision pipeline that tracks 21 distinct 3D hand landmarks via a standard webcam. This spatial data is mathematically normalized and processed by a local Large Language Model (Llama 3.2), which interprets complex hand shapes and movement velocities into natural English. The results are displayed in a real-time live translation log and simultaneously broadcast through a Text-to-Speech engine, allowing the signer's message to be heard as well as seen.
To close the communication loop, Echo-Sign includes a "Reverse Translation" feature that allows non-signers to type text and receive instant visual ASL feedback. The system intelligently pulls from a library of common sign animations and provides an automated fingerspelling fallback for names or specialized vocabulary. By running entirely on local hardware, Echo-Sign ensures total user privacy while providing a robust, nuanced interpretation tool that understands the difference between static hand positions and the fluid intent of active signing.
How we built it
We designed Echo-Sign using a "Feature Extraction" architecture, which we call the Eye-Math-Brain pipeline: The Eye: We used MediaPipe and OpenCV to track 21 hand landmarks in 3D space at 30+ FPS. The Math Layer: To reduce the "noise" of human movement, we calculated spatial vectors between landmarks. The Brain: Instead of hardcoding every sign, we passed these normalized data points to Llama 3.2 (via Ollama). This allowed the AI to reason about the intent of the movement, providing a much more natural translation than simple pattern matching. The Interface: A Streamlit dashboard provides a live video feed, a real-time translation log, and a "Reverse Translation" tool that converts text back into ASL GIFs.
Challenges we ran into
The biggest hurdle was "Dependency Hell." Mid-development, we encountered a major conflict where the latest NumPy 2.0 release broke the MediaPipe landmark engine. We had to deep-dive into environment management, eventually stabilizing the app by pinning specific versions: numpy < 2.0.0 python 3.11 Isolated Virtual Environments (venv) We also struggled with latency. Running a Large Language Model while processing 30 video frames per second can freeze a computer. We solved this by implementing a non-blocking buffer system—the "Eye" keeps watching while the "Brain" thinks in the background, ensuring the video feed never stutters.
Accomplishments that we're proud of
Real-Time Stability: We successfully achieved a consistent 30 FPS landmark tracking rate, ensuring the user experience feels fluid and responsive rather than laggy. The "Hybrid" Brain: We are particularly proud of our decision to use Llama 3.2 via Ollama. Moving beyond simple "if-this-then-that" logic allowed our system to interpret the nuances of hand speed and motion, making it feel like a true translation engine.
What we learned
Environmental Context Matters: We learned that lighting, background noise, and webcam quality drastically affect computer vision. This taught us the importance of data normalization and "noise filtering" in the math layer. Modern AI Integration: This project was a deep dive into Local LLMs. We learned how to structure prompts for multimodal reasoning—turning raw coordinate data into descriptive text that an AI can understand. Dependency Management: We gained a new appreciation for Virtual Environments and the importance of pinning library versions (e.g, NumPy < 2.0.0) when working with rapidly evolving ecosystems like Python and AI
What's next for Echo Sign
Facial Expression Analysis: ASL relies heavily on facial grammar. Our next step is to integrate MediaPipe Face Mesh to capture eyebrow movements and mouth shapes for more accurate translation. Deploying on different appliances: Shifting from webcams to other cameras like on mobile devices or even glasses Broader Recognized Vocabulary: This current version only recognizes a few signs and can display some select common signs, but we aim for it to be much more comprehensive by recognizing more ASL signs.
Log in or sign up for Devpost to join the conversation.