Digital Project Poster

Title

ASLAN: American Sign LAnguage Network

Who

Emily Ye (eye4), Kevin Hsu (khsu13), Gareth Mansfield (gmansfie)

Our final writeup can be found here.

Check-in #2

Introduction

We hope to implement a convolutional neural network to classify the letters of the American Sign Language alphabet. We chose this topic because we feel that deep learning has the potential to improve quality of living for the deaf community by providing an easier way to communicate with those who may not know ASL. This problem is a classification problem. We are basing our work on A Deep Convolutional Neural Network Approach for Static Hand Gesture Recognition (2020).

Challenges

When we implemented the proposed model architecture and hyperparameters from the research paper, we were unable to obtain the same results. Furthermore, we noticed that there were some issues with the layers in the research paper's model, such as using stride sizes larger than filter sizes (which results in loss of information) and using filters much larger than the image itself. Thus, we may need to look for a different research paper or use the Towards Data Science article as a basis for our work.
Separately implementing the model proposed in the Towards Data Science article also failed to produce similar results, indicating that there could have been issues with preprocessing, so we have also been looking into other ways to preprocess our data or possible areas where we could have made mistakes.
One of our datasets (Fingerspelling A) had a broken link, so we had to use wget to obtain the raw HTML and then wget again to obtain the actual data. In addition, since the dataset provided images rather than cleaned up data, we had to do preprocessing to get them into a usable format.

Insights

Our current model is performing below expectations, but we are continuing to modify hyperparameters and the model architecture in hopes of improving performance.

Plans

Revisit preprocessing: since we were unable to get either our original research paper's or Towards Data Science's proposed model working, we suspect there may be an issue in our underlying data, so we will continue to reevaluate our preprocessing methods.
Modify hyperparameters and model architecture: we suspect that the original research paper may have some issues with its proposed model and hyperparameters, so we hope that we can adjust those to improve the model's performance.
Allow for camera input and figuring out words rather than just detecting letters: these are next steps we hope to take so that our project can handle custom input (i.e. the user signing in front of a webcam), but these steps are not a priority until we can get a working model.

Check-in #1

Introduction

Related Work

American Sign Language Recognition, an article from Towards Data Science, also uses a convolutional neural network to recognize the different ASL letters. They test a combination of data augmentation, batch normalization, and learning rate decay, as well as different numbers of convolution layers, to find the best performance.

Other similar papers and articles include:

Real-time American Sign Language Recognition with Convolutional Neural Networks (2016)

Data

There are two datasets we are planning to use:

Sign Language MNIST: a larger dataset with ~35k cases, mostly cleaner data with a clearly visible hand; used by the Towards Data Science article
ASL Fingerspelling A: a smaller dataset with 5 users signing on slightly different and noisier backgrounds

We plan to start off with the Sign Language MNIST dataset and progress to Fingerspelling A once we reach a high enough accuracy. Both datasets omit J and Z, as the symbols for those in ASL require motion.

Methodology

We use a 3 layer CNN with max pooling, followed by a softmax layer and a single dense layer. We use the SGDM optimization algorithm, planning to use a momentum rate of 0.90 and learning rate of 0.01, training for 20 epochs. The most difficult task will be adjusting the hyperparameters - while the paper gives us the parameters they used, these may be insufficient.

Metrics

We plan to use accuracy to assess the success of our model. Accuracy is a good way to measure the success of this project, as our goal is for the model to correctly identify the different ASL symbols. The authors of the original paper used accuracy, as well as precision, recall, and FI-score, but we feel that just using accuracy is sufficient as the original authors were hoping to prove that their model is superior to a previous one, while we are just trying to create a functional product.

Our goals are as follows:

Base goal: build a functional model that works on the entire ASL alphabet (removing J and Z) with accuracy > 80%
Target goal: build a functional model that works on the entire ASL alphabet (removing J and Z) that is able to interact with the webcam for custom input
Stretch goal: build an additional CNN to find the position of the hand in each image (i.e. preprocessing to crop the image to just the hand) OR try using Fingerspelling B dataset (a version of the Fingerspelling A dataset with noisier background images)

Ethics

What is your dataset?
- We include two datasets, one with cleaner backgrounds, and the other with more noise. The latter, ASL Fingerspelling A is more representative of situations that might occur when real-life users are using the applications. However, there is still bias in the dataset, since ideally this could be used by parsing videos instead of still life photos, as well as handling inaccuracies that come with motion. In addition, it is unclear whether there is equal representation of different skin tones and ethnicities in this dataset. If not, our model could end up performing significantly worse when used by minorities.
What broader societal issues are relevant to your chosen problem space?
- The fact that sign language is not widely taught demonstrates that it is made more difficult for deaf people to navigate the world when it doesn’t necessarily have to be. High school classes offer foreign languages, but few offer ASL despite students demonstrating an interest in learning. The result is that the deaf community has to use other work-arounds to communicate despite ASL being a viable solution. We hope that ASLAN and other deep learning research can help the deaf community, but wider education surrounding sign language is still needed.

Division of Labor

We hope to all work together on each portion of the project. However, each person will focus primarily on one aspect:

Kevin: preprocessing data and allowing the CNN to interact with the camera so that we can run our model on custom input
Gareth: building out the CNN model and finetuning hyperparameters
Emily: figuring out what word is being spelled, based on most probable letter predictions for each hand gesture