This repository hosts the codebase from the paper An End-to-End OCR-Free Solution For Identity Document Information Extraction.
This repository provides a Python implementation for a synthetic ID-card generation pipeline. The core objective of this pipeline is to create synthetic images of personal identification documents (ID cards) to serve as additional training data in document information extraction tasks.
The pipeline takes as input an ID card empty template and the information about how to complete it (supported by proper Python functions) and generates the final document along with the textual field annotations.
The following is a visual representation of how the pipeline works, though it is better explained in the paper.
This pipeline has been used in the aforementioned paper An End-to-End OCR-Free Solution For Identity Document Information Extraction. It generated a set of synthetic ID cards to serve as pre-training data to fine-tune a Donut model in a document information extraction task. You can download the synthetic dataset and fine-tuned model on Zenodo.
Our implementation uses Wand to write text over the templates. You need ImageMagick to run it properly.
You have to install some custom fonts we used, which you can find in the template/fonts folder.
The face pictures we use to compile the templates come from the Face Research Lab London Set. We use the neutral front faces as those better-fit ID cards. You can manually download it and place the images in the face_pictures template folder or run the download script on Linux and WSL.
Create a Python environment (Python>=3.10) and install the requirements:
pip install -r requirements.txtFinally, you have to fix a problem with one of our dependencies. Our implementation uses Augraphy and its low
light noise augmentation. We faced a NameError using the LowLightNoise class because two variables
were referenced without ever being declared, as this happens inside an if statement. We fixed the problem by moving
their declaration outside the if statement.
The pipeline allows for the synthetic generation of 4 types of Italian ID cards.







