SLR | Devpost

Team Members

Cormac Collins (ccolli20), Dylan Hu (dhu24), Jenny Yu (jyu111), Tianren Dong(tdong6)

Introduction

We want to implement an existing paper and develop a model for American Sign Language translation. The objective of this paper is to develop a model that offers sign language translation to text, thereby helping non-signers understand sign language, which is an important form of communication used by people with impaired hearing and speech. We chose the paper because its model covers many concepts that we have touched on in class. Specifically, it involves the use of CNN to recognize visual and spatial features and the use of LSTM to remember temporal features in the videos. While we have implemented both CNN and LSTM models in our assignments, they were independent of each other. Implementing this model will give us the opportunity to combine different DL concepts, thereby solidifying our understanding.

Summary of an article relevant to ASL translation: Google has spent a lot of effort trying to make information accessible and useful in more than 150 languages. In 2021, they launched SignTown, which is an interactive web game that helps people learn about sign language and Deaf culture. It uses ML to detect the user’s ability to perform signs. When recognizing sign languages, it goes beyond just hands and includes body gestures and facial expressions in the process. Currently, the game works for Japanese Sign Language and HK Sign Language, but Google also open-sourced their core models so that developers and researchers can train and deploy their own sign language models.

Our insights: Surprisingly, there are as many sign languages as there are spoken languages around the world. While our implementation of the translation model may be naive, the topic has its significance when it comes to application and can potentially increase accessibility and facilitate communication between members of our community.

Public implementations that exist: None found (This is a student paper, so we didn’t find any existing open-source implementations)

Data

Tentatively, we decided to use the MS-ASL dataset from Microsoft research. This dataset is a collection of youtube links to ASL videos. There are a total of 6452 distinct youtube videos. Each video can have multiple ASL words. The dataset provides the start and end times and bounding box dimensions for a word against each video file.

The preprocessing that we need to do includes: Trimming the video to individual signs based on the start and end time provided (some videos have signs for more than one text) Cropping the video using the bounding box provided to remove the unwanted space Convert the videos to frames in order to input them into the network

Methodology

What is the architecture of your model?

Our model will have a CNN (Inception or Capsule Network) followed by a fully connected layer, then finally a RNN LSTM layer. The CNN layer is used to detect gestures. The RNN layer is used to translate gestures into English text. The output of the CNN softmax and max pooling layer is feeded to the LSTM RNN for gesture recognition.

How are you training the model?

The CNN and the LSTM RNN are trained and evaluated separately. Specifically, we train the CNN to identify gesture segments for the frames. The RNN is trained to classify the sequence of gestures identified by the CNN into gestures classes. The LSTM is used specifically to take into consideration time dependency of the gestures sequences.

If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here.

For the CNN part of the model, we will use transfer learning on pre-trained CNN. However, the RNN will be built from scratch. Implementing the details of the RNN would be the most challenging, as the paper doesn’t give enough details for the LSTM RNN. Linking up the LSTM to the output of the CNN would also be challenging, because we didn’t build up the CNN by ourselves. In addition, we might also need to finetune many parameters, because the paper is using a much smaller dataset.

Metrics

What constitutes “success?” We hope to achieve a model that, based on a qualitative analysis of results, is able to recognize gestures and produce a translated sequence which at least somewhat resembles the target sentence. Ideally, we are aiming to at least match the accuracy in the paper, and we hope that the variations we implement will result in even better performance, especially for inputs not sourced from the dataset.

What experiments do you plan to run?

We plan to experiment with different types of data augmentation to increase our dataset, and we will analyze the effect of various augmentation methods on the model’s performance.

For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?

The traditional notion of accuracy for machine translation applies to this model, as we are translating between sequences from a source language to a target language. The same accuracy metric as in Homework 4 can be applied here.

If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model.

In the original paper, the authors analyzed the performance of a combined CNN and RNN model to extract features from videos of American Sign Language. They compared the final translation performance between the output of the CNN softmax layer and the output of the CNN pooling layer as input to the RNN. They calculated accuracy as in Homework 4.

What are your base, target, and stretch goals? Base goal: Be able to put together the data pipeline and model. Target goal: Achieve similar results as the paper. Stretch goal: make improvements to the model based on the improvements the paper detailed.

Ethics

What broader societal issues are relevant to your chosen problem space? Our project is within the space of accessibility - specifically, accessibility for those who are speech impaired. The project seeks to enable accessibility for those who are speech impaired especially in circumstances where direct translation from person to person would not be possible. An example use of a model like this would be to enable people to video call and the model would caption ASL gestures in English text. However, it is important to note that designing structures that are not abilist is not strictly a technical problem. Our solution and even high-performing, scalable models of the sort that we are creating enable improved accessibility in some regards but must be designed and considered in a broader context.

Why is Deep Learning a good approach to this problem? Deep learning is a suitable approach for this problem because of its success in image classification and natural language processing. Deep learning is well suited for learning English text from images and this project combines many of the core concepts we have encountered in class thus far.