Inspiration

The COVID-19 pandemic has without a doubt affected many people negatively. However we feel that students in particularly have been very hard. They are missing both the social contact with peers as well as the benefit from personalised close interactions with teachers. We set out to design a product to address this, whilst keeping in mind students mental and physical health.

What it does

EduMate is a website that allows students to scan in maths questions using their webcam and Machine Learning. It then classifies the question into one of ten topics. EduMate then finds and displays relevant educational content and displays similar questions.

Currently only functioning on GCSE maths questions (without diagrams).

How we built it

The front end is build using flask. Images are taken using OpenCV.

Images are then passed to the Mathpix API. This transcribes mathematical equations in images to latex. Additionally, we extract text elements from within questions using Pytesseract. The Latex elements are then further transcribed to plain English text. Both data vectors are then cleaned and preprocessed and TF-IDF features are extracted. Finally, from these features, a random forest is used to classify the image.

The transcription from image to plain text is surprisingly good. Pytesseract picks up almost all text making only small mistakes, such a transcribing a "1" to a "|". However, it is very poor with mathematical expressions. However here Mathpix is very good. Overall almost all information from the original images is preserved. This is very interesting since it allows for a very good model in theory. However, due to the small data set sparsity is a major issue. First, all non text (eg numbers, symbols) are changed to natural language. We then decided to extract TF-IDF features. This is a good approach because due to the small corpus, many values occur only once. However, some values occur dozens of times. This does not necessarily speak to their statistical significance though. For instance, most questions have numbers in them, but this gives us little information. TF-IDF selects features depending on their relative importance within the corpus and is thus ideal. The choice to use a random forest was made instinctively as it tends to perform well on these kinds of problems. However later testing proved superior performance compared to other classifiers such as Logistic Regression, Decision Tree, SVM.

Then, youtube-search is used to find relevant youtube videos and these are embedded into the website. Additionally, the database is queried for similar questions, which are also displayed. This is currently done only based on the label but in the future, we would want to cluster questions or compare cosine similarities of the feature vectors.

Challenges we ran into

Using OpenCV on Linux proved to be a huge challenge. Not only are there many issues with the drivers, but many of the installation paths provide unstable builds. Integration with flask was especially difficult and saving to a specific directory (static) required a lot of troubleshooting.

On the machine learning side, the lack of data was the major obstacle. Our final model was trained with only 120 images and for 10 classes. Naturally, this meant that feature vectors were very sparse. Nonetheless, we managed a validation accuracy of 60%. During initial stages we reached validation accuracies of 65%, but this was likely due to overfitting from non-random sampling of images. Whilst 60% certainly does not sound too impressive it is worth keeping in mind that even humans don't achieve accuracies above 85% from experience. This is because the labels overlap and our selection criteria were not clearly set out. Additionally, there is an expected amount of noise from hand labelling images. These issues translate to the model. This can be seen as frequently images of type "indices" are misclassified as "algebra". This is not surprising as index problems are a subset of those algebra problems and thus it is likely that some of those training images will have been mislabeled. Even with perfect labelling, we would still expect that our decision boundary would not be perfect, due to the non-distinct nature of the labels. However, this also means that the impact of misclassification is less severe as resources will still be relevant.

Accomplishments that we're proud of

We are particularly proud of the impact that we feel our solution can have. We believe that it truly has the potential to positively impact peoples learning and feel like we would have benefited from it in school. Additionally, we are very happy that we produced a full working prototype of our solution.

What we learned

We learned lots about full-stack development. It was very interesting, building a prototype from start to finish and we went through many design cycles. Additionally, we used many new tools and libraries such as mathpix-API or OpenCV, which we had not used much before. We also learnt a lot about team-work whilst resolving the many merge conflicts that we created!

What's next for EduMate

First, the main goal is to improve the prototype as is. Key steps for this are:

  • improve classifier using more data (at least an order of magnitude more0
  • improve labelling using strict guidelines to improve decision boundaries
  • implementing social features that allow students to connect with each other
  • implement features that remind students to take regular breaks
  • investigate the effects of data preprocessing on accuracy
  • change to a free model to convert images to latex (https://github.com/harvardnlp/im2markup)

Later on, we would also like to extend this to other subjects and subject levels.

Built With

Share this project:

Updates