Handwritten Source Code Recognition For Python

This research project is based on the evaluation dataset of handwritten python source code smaples provided by the following study;

https://arxiv.org/abs/1706.00069

There are two main parts to this research project

Fine tunning Tesseract 4's tessdata-best LSTM-based language model by using tesseracts training-tools to recognize handwritten python code. The fine-tuned model was evaluated using the evaluation dataset.

The training files used for training are saved in the tesseract-training/training-data directory
The tess-data files used for training are saved in the tesseract-training/tess-data directory
The least character-error model checkpoint is saved in the tesseract-training/best-model-checkpoint directory
The extracted LSTM model of the above checkpoint is saved in the _tesseract-training/best-model directory

Evaluating Google's Digital INK Recognition Engine's performance on the same evaluation dataset. The android application that recognizes each writing sample is CodeGraphyEvalDataset

To use the application
1. Chose Writing sample
2. Chose Writer
3. Recognize

The Evaluation Dataset

The original dataset ( saved in dataset/original-dataset) presented by the study mentioned above is in the online-handwritting format ( x and y cordinate stroke data ). The author had to convert this data into individual images for each writing sample so it becomes compatible with Tesseract. The javascript code that does this is saved in dataset/js-stroke2svg2jpeg. The converted images can be found in dataset/image-dataset

The original stroke data had to be transformed so it can be easily loaded into the android application, that data is in the dataset/android-stroke-data directory

Findings

The following table depicts the Character Error Rate (CER) and Word Error Rate (WER) evaluations. It was found that the Digital Ink Model outperforms the fine-tuned tesseract model which shows improvement after fine tuning from tess-best (base tesseract)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
CodeGraphyEvalDataset		CodeGraphyEvalDataset
dataset		dataset
tesseract-training		tesseract-training
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Handwritten Source Code Recognition For Python

There are two main parts to this research project

The Evaluation Dataset

Findings

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Handwritten Source Code Recognition For Python

There are two main parts to this research project

The Evaluation Dataset

Findings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages