Inspiration

The inspiration for Echo-Sign came from a simple observation: while the world is becoming more connected through AI, the Deaf and hard-of-hearing communities still face massive barriers in everyday "hearing" environments. We wanted to build a tool that wasn't just a static dictionary, but a real-time bridge—something that could turn the fluidity of American Sign Language (ASL) into spoken English instantly, using nothing more than a standard laptop webcam.

What it does

Echo-Sign is a comprehensive accessibility platform designed to facilitate seamless, two-way communication between the Deaf and hearing communities. At its core, the application utilizes a high-speed computer vision pipeline that tracks 21 distinct 3D hand landmarks via a standard webcam. This spatial data is mathematically normalized and processed by a local Large Language Model (Llama 3.2), which interprets complex hand shapes and movement velocities into natural English. The results are displayed in a real-time live translation log and simultaneously broadcast through a Text-to-Speech engine, allowing the signer's message to be heard as well as seen.

To close the communication loop, Echo-Sign includes a "Reverse Translation" feature that allows non-signers to type text and receive instant visual ASL feedback. The system intelligently pulls from a library of common sign animations and provides an automated fingerspelling fallback for names or specialized vocabulary. By running entirely on local hardware, Echo-Sign ensures total user privacy while providing a robust, nuanced interpretation tool that understands the difference between static hand positions and the fluid intent of active signing.

How we built it

We designed Echo-Sign using a "Feature Extraction" architecture, which we call the Eye-Math-Brain pipeline: The Eye: We used MediaPipe and OpenCV to track 21 hand landmarks in 3D space at 30+ FPS. The Math Layer: To reduce the "noise" of human movement, we calculated spatial vectors between landmarks. The Brain: Instead of hardcoding every sign, we passed these normalized data points to Llama 3.2 (via Ollama). This allowed the AI to reason about the intent of the movement, providing a much more natural translation than simple pattern matching. The Interface: A Streamlit dashboard provides a live video feed, a real-time translation log, and a "Reverse Translation" tool that converts text back into ASL GIFs.

Challenges we ran into

The biggest hurdle was "Dependency Hell." Mid-development, we encountered a major conflict where the latest NumPy 2.0 release broke the MediaPipe landmark engine. We had to deep-dive into environment management, eventually stabilizing the app by pinning specific versions: numpy < 2.0.0 python 3.11 Isolated Virtual Environments (venv) We also struggled with latency. Running a Large Language Model while processing 30 video frames per second can freeze a computer. We solved this by implementing a non-blocking buffer system—the "Eye" keeps watching while the "Brain" thinks in the background, ensuring the video feed never stutters.

Accomplishments that we're proud of

Real-Time Stability: We successfully achieved a consistent 30 FPS landmark tracking rate, ensuring the user experience feels fluid and responsive rather than laggy. The "Hybrid" Brain: We are particularly proud of our decision to use Llama 3.2 via Ollama. Moving beyond simple "if-this-then-that" logic allowed our system to interpret the nuances of hand speed and motion, making it feel like a true translation engine.

What we learned

Environmental Context Matters: We learned that lighting, background noise, and webcam quality drastically affect computer vision. This taught us the importance of data normalization and "noise filtering" in the math layer. Modern AI Integration: This project was a deep dive into Local LLMs. We learned how to structure prompts for multimodal reasoning—turning raw coordinate data into descriptive text that an AI can understand. Dependency Management: We gained a new appreciation for Virtual Environments and the importance of pinning library versions (e.g, NumPy < 2.0.0) when working with rapidly evolving ecosystems like Python and AI

What's next for Echo Sign

Facial Expression Analysis: ASL relies heavily on facial grammar. Our next step is to integrate MediaPipe Face Mesh to capture eyebrow movements and mouth shapes for more accurate translation. Deploying on different appliances: Shifting from webcams to other cameras like on mobile devices or even glasses Broader Recognized Vocabulary: This current version only recognizes a few signs and can display some select common signs, but we aim for it to be much more comprehensive by recognizing more ASL signs.

Built With

Share this project:

Updates