Skip to content

rajeevbk/Handwritten-SourceCode-Recognition-For-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Handwritten Source Code Recognition For Python

This research project is based on the evaluation dataset of handwritten python source code smaples provided by the following study;

https://arxiv.org/abs/1706.00069

There are two main parts to this research project

  1. Fine tunning Tesseract 4's tessdata-best LSTM-based language model by using tesseracts training-tools to recognize handwritten python code. The fine-tuned model was evaluated using the evaluation dataset.
  • The training files used for training are saved in the tesseract-training/training-data directory
  • The tess-data files used for training are saved in the tesseract-training/tess-data directory
  • The least character-error model checkpoint is saved in the tesseract-training/best-model-checkpoint directory
  • The extracted LSTM model of the above checkpoint is saved in the _tesseract-training/best-model directory
  1. Evaluating Google's Digital INK Recognition Engine's performance on the same evaluation dataset. The android application that recognizes each writing sample is CodeGraphyEvalDataset

    To use the application

    1. Chose Writing sample
    2. Chose Writer
    3. Recognize

The Evaluation Dataset

The original dataset ( saved in dataset/original-dataset) presented by the study mentioned above is in the online-handwritting format ( x and y cordinate stroke data ). The author had to convert this data into individual images for each writing sample so it becomes compatible with Tesseract. The javascript code that does this is saved in dataset/js-stroke2svg2jpeg. The converted images can be found in dataset/image-dataset

The original stroke data had to be transformed so it can be easily loaded into the android application, that data is in the dataset/android-stroke-data directory

Findings

The following table depicts the Character Error Rate (CER) and Word Error Rate (WER) evaluations. It was found that the Digital Ink Model outperforms the fine-tuned tesseract model which shows improvement after fine tuning from tess-best (base tesseract)

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages