This mini-project provides a set of Python scripts for generating synthetic text, creating images with your custom fonts, and augmenting those images for improved Tesseract OCR training. The main goal is to streamline the creation of training datasets for custom fonts, especially when these fonts are derived from standard fonts with slight transformations (e.g., skew, distortion). By comparing visually in an e.g. PS/Figma/Affinity editor, you can determine the exact transformations to make the base font look like the desired one and, having determined appropriate changes, generate a high-quality dataset for training Tesseract to recognise your font.
-
Generate Text: Use the text_generator.py script to generate random text samples in pre-planned format (plain text, numbers or wordlists).
-
Generate Font Images: Run train_data_generation.py to render images of the generated text using your custom font. (See an example of how to specify your base font tweaks in this file)
-
Apply Augmentations: The augment.py script applies transformations such as rotation, noise, and distortions to enhance OCR training performance.
If you have a custom font that looks like a modified version of a standard font (e.g., a slightly skewed Helvetica), the steps you want to take are:
- Generate thousands of text samples depending on the context of the documents you work with.
- Render those samples as images in your custom font.
- Apply appropriate augmentations to create a robust training dataset.
- Use the resulting data to train Tesseract to recognise your specific font style accurately.
The author bears no responsibility if this code or any part thereof breaks your computer or your dreams.
