This research project is based on the evaluation dataset of handwritten python source code smaples provided by the following study;
https://arxiv.org/abs/1706.00069
- Fine tunning Tesseract 4's tessdata-best LSTM-based language model by using tesseracts training-tools to recognize handwritten python code. The fine-tuned model was evaluated using the evaluation dataset.
- The training files used for training are saved in the tesseract-training/training-data directory
- The tess-data files used for training are saved in the tesseract-training/tess-data directory
- The least character-error model checkpoint is saved in the tesseract-training/best-model-checkpoint directory
- The extracted LSTM model of the above checkpoint is saved in the _tesseract-training/best-model directory
-
Evaluating Google's Digital INK Recognition Engine's performance on the same evaluation dataset. The android application that recognizes each writing sample is CodeGraphyEvalDataset
To use the application
- Chose Writing sample
- Chose Writer
- Recognize
The original dataset ( saved in dataset/original-dataset) presented by the study mentioned above is in the online-handwritting format ( x and y cordinate stroke data ). The author had to convert this data into individual images for each writing sample so it becomes compatible with Tesseract. The javascript code that does this is saved in dataset/js-stroke2svg2jpeg. The converted images can be found in dataset/image-dataset
The original stroke data had to be transformed so it can be easily loaded into the android application, that data is in the dataset/android-stroke-data directory
The following table depicts the Character Error Rate (CER) and Word Error Rate (WER) evaluations. It was found that the Digital Ink Model outperforms the fine-tuned tesseract model which shows improvement after fine tuning from tess-best (base tesseract)
