This project implements an OCR system using ResNet34, Mamba, and CTC. It uses adapter-based fine-tuning to train efficiently on the vision backbone.
- Preprocessing: Images are resized to 32px height, preserving aspect ratio, and padded.
- Vision Backbone: Uses a pre-trained ResNet34 with injected adapters. freezes most of the network and only train the adapters, partial layer4, and projections.
- Mamba Encoder: Models the sequence of features to understand character context.
- Decoding: Uses CTC to convert the sequence into text.
- Mamba: Uses state space models instead of RNNs for better scaling.
- Adapters: Fine-tunes the backbone efficiently without breaking pre-trained weights.
- ResNet34: Strong baseline feature extractor.
- FP16: Supports mixed precision training.
configs/: Configuration settings.data/: Dataset loading and processing.models/: Model architecture (ResNet + Mamba).train.py: Main training script.infer.py: Inference script.utils.py: Metrics and decoding helpers.gen_data.py: Synthetic data generator.
Clone the repo and install dependencies:
git clone <repo_url>
cd ocr_project
pip install -r requirements.txtNote: mamba-ssm needs a GPU.
Check configs/config.py to set your data paths and batch size.
Run the training script (it handles freezing automatically):
python train.pyTest on images:
python infer.pyUse gen_data.py to create synthetic training data with random fonts and noise.