Inspiration

Many deaf and hard-of-hearing children in the U.S.—an estimated 300,000—face significant barriers to language development, especially when they are exposed to American Sign Language (ASL) later in life. Research has shown that around 70% of children lack access to ASL (largely due to the fact that over 90% of deaf children are born to non-deaf parents) and that delayed exposure to ASL can lead to long-term challenges in language proficiency. To help address this, we built a Chrome extension and web app that generate accurate, kid-friendly 2D animations expressing ASL from any text input using ML/CV techniques. Our goal is to support early ASL acquisition by enabling young children to simultaneously read and communicate ASL, fostering stronger language foundations and bridging the gap between reading and communication.

What it does

  • 2D Animation Generation: Users interact with a friendly animated bear that visually signs the text input in real-time. The bear performs realistic body and hand movements that closely mimic how a real person signs. By combining ASL gestures with written words, the app helps kids develop both reading skills and ASL comprehension in an engaging and interactive way.
  • Web App PDF-to-Sign: Users can upload entire PDFs to our web application and can click on individual sentences within the document to generate videos of the animated bear signing those sentences, providing an accessible way to engage with ASL alongside written text of any length.
  • Chrome Extension Highlight-to-Sign: Users can highlight any sentence on a webpage, triggering the animated bear to pop up and sign the selected text in real-time, making ASL translation instantly available while browsing.

How we built it

  • English-to-ASL Gloss Translation: Converts English text into ASL gloss format using the Claude API by taking English input (from text, file, or batch) and applying structured ASL grammar rules via a specified system prompt.
  • ASL Video Webscraper: Videos of a person signing specific words are webscraped, automating the process of collecting, downloading, and organizing ASL video data mapped to meanings.
  • ASL Video Storage: The human-ASL videos are stored in a Cloudflare bucket for easy retrieval from our vector database that maps words and their embeddings to their ASL video clip.
  • ASL Gloss to Video Vector-Similarity Search: We built an encoder that covers Gloss word meanings into 300-dimensional vector embeddings using Word2Vec (an NLP model that encodes words as semantic numeric vectors). We encoded all the words in our ASL video vocabulary and stored their embeddings into a vector database on Supabase, with a key-value lookup that points to a short video of a human signing the word. During inference, we encode Gloss tokens and query Supabase to find the most semantically similar word with a matching ASL video.
  • Pose & Hand Extraction: Leveraging Google’s MediaPipe framework, we accurately detect and extract 2D keypoints for body poses and detailed hand landmarks from every frame of the webscraped videos, providing precise spatial data needed to animate sign language gestures.
  • Keypoint Interpolation: To ensure fluid and natural motion between different signing sequences, we apply interpolation techniques that smoothly blend the transition from the final frame of one video to the initial frame of the next, minimizing abrupt jumps or visual discontinuities.
  • Bear Rendering: Using OpenCV’s drawing tools, we dynamically map the extracted pose and hand keypoints onto a custom-designed animated bear character, animating its joints, arms, and fingers in each frame to visually replicate ASL signing.
  • Hugging Face Space for the Backend API: The entire inference pipeline (from text --> ASL Gloss --> vector embeddings --> human ASL video clips --> body and hand keypoint extraction --> video interpolation --> bear animation) is open-sourced as an interactable Gradio interface on Hugging Face Spaces! This allowed for rapid iteration and testing. The Hugging Face Space also acts as an API that our frontend and Chrome extension can call to access our AI-powered backend.
  • Web App Tech Stack: React, Typescript, Tailwind CSS, Vite
  • Chrome Extension Tech Stack: Javascript, HTML/CSS
  • Backend Tech Stack: Claude, Gensim, Word2Vec, Hugging Face
  • Storage: Supabase, Cloudflare

Challenges we ran into

  • Connecting our entire end-to-end flow was difficult due to the complexity of each step and the number of steps we have. Specifically, linking the Hugging Face API to show videos to our frontend was challenging due to the need for precise coordination between asynchronous API responses and real-time video rendering.
  • Finding the right video data to create our animation was really hard, since smooth consistency across frames requires as much standardization as possible. We were lucky to find a huge data source of the same person signing thousands of words!

Accomplishments that we're proud of

  • We were able to create an ML/CV pipeline for both a Chrome extension and a web app to generate a 2D animated bear with any input text in real-time. Besides looking super cute, our apps solve the nationwide problem of deaf and hard-of-hearing kids not getting early access to ASL education by combining reading skills with interactive ASL learning.
  • Our methodology of storing and querying vector embeddings ensures that the bear signs the most semantically similar ASL gloss for each word, resulting in highly accurate and contextually appropriate translations. This vector search approach allows the app to handle a wide range of vocabulary dynamically, improving the quality and reliability of the ASL animations for diverse input sentences.
  • To generate our bear, we actually ended up creating one of the largest available datasets mapping ASL Gloss meanings to videos!

Accomplishments that we're proud of

  • We were able to create an ML/CV pipeline for both a Chrome extension and a web app to generate a 2D animated bear with any input text in real-time. Besides looking super cute, our apps solve the nationwide problem of deaf and hard-of-hearing kids not getting early access to ASL education by combining reading skills with interactive ASL learning.
  • Our methodology of storing and querying vector embeddings ensures that the bear signs the most semantically similar ASL gloss for each word, resulting in highly accurate and contextually appropriate translations. This vector search approach allows the app to handle a wide range of vocabulary dynamically, improving the quality and reliability of the ASL animations for diverse input sentences.
  • To generate our bear, we actually ended up creating one of the largest available datasets mapping ASL Gloss meanings to videos!

What we learned

  • We gained experience reconstructing a smooth video of custom avatar from disjoint clips using MediaPipe for body and hand mapping.
  • We learned how to create our own vector embeddings from words and leverage Supabase's vector database technologies!
  • We learned the importance of building and testing each component piece by piece first; this approach allowed us to move quickly, implement features in parallel, and efficiently connect everything later. Constant communication and pair programming were really important to help us bug faster and work effectively as a team.
  • With our own app, we learned a ton of ASL easily!

What's next for AI-SL

  • We also hope to implement more robust handling for out of vocabulary words (OOV). Currently, for words that can't be encoded by our word-to-vector encoder, we spell out each letter of the word (e.g. Oski --> O S K I). To handle cases like compound words or proper nouns more smoothly, we could add an LLM agent that is able to search for synonyms. Additionally, we'd like to incorporate attention and context to allow us to encode multi-word phrases and map those to multi-word ASL videos.
  • With more time, we'd like to implement other forms of sign language, as ASL is just one of many ways that deaf/hard-of-hearing people communicate in the world. Everyone deserves a way to communicate as easily as possible, and we believe that education for it should be as accessible as possible.
  • We'd also like to give more options than just a bear for the 2D animations! Future improvements include making the generated avatar more 3D and making the avatar's actions smoother.
  • Expand our avatar generation to include ASL videos from multiple sources and by multiple signers. When initially iterating, we used ASL videos from just a single person as a proof of concept to see if our video-stitching idea was feasible. But since MediaPipe is identity-agnostic, we could expand our ASL vocabulary even more by using ASL videos from multiple people if we had time to add them to our database!

Built With

Share this project:

Updates