Title
ASLAN: American Sign LAnguage Network
Who
Emily Ye (eye4), Kevin Hsu (khsu13), Gareth Mansfield (gmansfie)
Our final writeup can be found here.
Check-in #2
Introduction
We hope to implement a convolutional neural network to classify the letters of the American Sign Language alphabet. We chose this topic because we feel that deep learning has the potential to improve quality of living for the deaf community by providing an easier way to communicate with those who may not know ASL. This problem is a classification problem. We are basing our work on A Deep Convolutional Neural Network Approach for Static Hand Gesture Recognition (2020).
Challenges
- When we implemented the proposed model architecture and hyperparameters from the research paper, we were unable to obtain the same results. Furthermore, we noticed that there were some issues with the layers in the research paper's model, such as using stride sizes larger than filter sizes (which results in loss of information) and using filters much larger than the image itself. Thus, we may need to look for a different research paper or use the Towards Data Science article as a basis for our work.
- Separately implementing the model proposed in the Towards Data Science article also failed to produce similar results, indicating that there could have been issues with preprocessing, so we have also been looking into other ways to preprocess our data or possible areas where we could have made mistakes.
- One of our datasets (Fingerspelling A) had a broken link, so we had to use
wgetto obtain the raw HTML and thenwgetagain to obtain the actual data. In addition, since the dataset provided images rather than cleaned up data, we had to do preprocessing to get them into a usable format.
Insights
Our current model is performing below expectations, but we are continuing to modify hyperparameters and the model architecture in hopes of improving performance.
Plans
- Revisit preprocessing: since we were unable to get either our original research paper's or Towards Data Science's proposed model working, we suspect there may be an issue in our underlying data, so we will continue to reevaluate our preprocessing methods.
- Modify hyperparameters and model architecture: we suspect that the original research paper may have some issues with its proposed model and hyperparameters, so we hope that we can adjust those to improve the model's performance.
- Allow for camera input and figuring out words rather than just detecting letters: these are next steps we hope to take so that our project can handle custom input (i.e. the user signing in front of a webcam), but these steps are not a priority until we can get a working model.
Check-in #1
Introduction
We hope to implement a convolutional neural network to classify the letters of the American Sign Language alphabet. We chose this topic because we feel that deep learning has the potential to improve quality of living for the deaf community by providing an easier way to communicate with those who may not know ASL. This problem is a classification problem. We are basing our work on A Deep Convolutional Neural Network Approach for Static Hand Gesture Recognition (2020).
Related Work
American Sign Language Recognition, an article from Towards Data Science, also uses a convolutional neural network to recognize the different ASL letters. They test a combination of data augmentation, batch normalization, and learning rate decay, as well as different numbers of convolution layers, to find the best performance.
Other similar papers and articles include:
Data
There are two datasets we are planning to use:
- Sign Language MNIST: a larger dataset with ~35k cases, mostly cleaner data with a clearly visible hand; used by the Towards Data Science article
- ASL Fingerspelling A: a smaller dataset with 5 users signing on slightly different and noisier backgrounds
We plan to start off with the Sign Language MNIST dataset and progress to Fingerspelling A once we reach a high enough accuracy. Both datasets omit J and Z, as the symbols for those in ASL require motion.
Methodology
We use a 3 layer CNN with max pooling, followed by a softmax layer and a single dense layer. We use the SGDM optimization algorithm, planning to use a momentum rate of 0.90 and learning rate of 0.01, training for 20 epochs. The most difficult task will be adjusting the hyperparameters - while the paper gives us the parameters they used, these may be insufficient.
Metrics
We plan to use accuracy to assess the success of our model. Accuracy is a good way to measure the success of this project, as our goal is for the model to correctly identify the different ASL symbols. The authors of the original paper used accuracy, as well as precision, recall, and FI-score, but we feel that just using accuracy is sufficient as the original authors were hoping to prove that their model is superior to a previous one, while we are just trying to create a functional product.
Our goals are as follows:
- Base goal: build a functional model that works on the entire ASL alphabet (removing J and Z) with accuracy > 80%
- Target goal: build a functional model that works on the entire ASL alphabet (removing J and Z) that is able to interact with the webcam for custom input
- Stretch goal: build an additional CNN to find the position of the hand in each image (i.e. preprocessing to crop the image to just the hand) OR try using Fingerspelling B dataset (a version of the Fingerspelling A dataset with noisier background images)
Ethics
- What is your dataset?
- We include two datasets, one with cleaner backgrounds, and the other with more noise. The latter, ASL Fingerspelling A is more representative of situations that might occur when real-life users are using the applications. However, there is still bias in the dataset, since ideally this could be used by parsing videos instead of still life photos, as well as handling inaccuracies that come with motion. In addition, it is unclear whether there is equal representation of different skin tones and ethnicities in this dataset. If not, our model could end up performing significantly worse when used by minorities.
- What broader societal issues are relevant to your chosen problem space?
- The fact that sign language is not widely taught demonstrates that it is made more difficult for deaf people to navigate the world when it doesn’t necessarily have to be. High school classes offer foreign languages, but few offer ASL despite students demonstrating an interest in learning. The result is that the deaf community has to use other work-arounds to communicate despite ASL being a viable solution. We hope that ASLAN and other deep learning research can help the deaf community, but wider education surrounding sign language is still needed.
Division of Labor
We hope to all work together on each portion of the project. However, each person will focus primarily on one aspect:
- Kevin: preprocessing data and allowing the CNN to interact with the camera so that we can run our model on custom input
- Gareth: building out the CNN model and finetuning hyperparameters
- Emily: figuring out what word is being spelled, based on most probable letter predictions for each hand gesture
Log in or sign up for Devpost to join the conversation.