Data Preparation for Tesseract Training

This mini-project provides a set of Python scripts for generating synthetic text, creating images with your custom fonts, and augmenting those images for improved Tesseract OCR training. The main goal is to streamline the creation of training datasets for custom fonts, especially when these fonts are derived from standard fonts with slight transformations (e.g., skew, distortion). By comparing visually in an e.g. PS/Figma/Affinity editor, you can determine the exact transformations to make the base font look like the desired one and, having determined appropriate changes, generate a high-quality dataset for training Tesseract to recognise your font.

Workflow Overview

Generate Text: Use the text_generator.py script to generate random text samples in pre-planned format (plain text, numbers or wordlists).
Generate Font Images: Run train_data_generation.py to render images of the generated text using your custom font. (See an example of how to specify your base font tweaks in this file)
Apply Augmentations: The augment.py script applies transformations such as rotation, noise, and distortions to enhance OCR training performance.

Example Use Case

If you have a custom font that looks like a modified version of a standard font (e.g., a slightly skewed Helvetica), the steps you want to take are:

Generate thousands of text samples depending on the context of the documents you work with.
Render those samples as images in your custom font.
Apply appropriate augmentations to create a robust training dataset.
Use the resulting data to train Tesseract to recognise your specific font style accurately.

Disclaimer

The author bears no responsibility if this code or any part thereof breaks your computer or your dreams.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
augment.py		augment.py
text_generator.py		text_generator.py
train_data_generation.py		train_data_generation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Preparation for Tesseract Training

Workflow Overview

Example Use Case

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Preparation for Tesseract Training

Workflow Overview

Example Use Case

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages