A Vietnamese text normalization system for diacritic restoration and spelling error correction, built on a fine-tuned BartPho Seq2Seq Transformer with parameter-efficient adaptation.
-
Seq2Seq Transformer Architecture
- Encoder–decoder design tailored for text-to-text correction tasks
- Handles diacritic restoration, spelling correction, and informal variants (e.g., teencode)
-
Pretrained BartPho Backbone
- Vietnamese-specific Transformer trained on large-scale monolingual corpora
- Syllable-level tokenization to reduce OOV errors
-
Parameter-Efficient Fine-Tuning (LoRA)
- Fine-tunes only ~3–4% of total parameters
- Enables training and inference on limited GPU resources
-
Character-Level Error Sensitivity
- Explicit focus on Character Error Rate (CER), critical for tonal languages
-
Strict Quantitative Metrics
- BLEU for sequence similarity
- WER for word-level correctness
- CER for fine-grained diacritic accuracy
-
Exact Match Awareness
- Tracks sentence-level exact correctness to expose error accumulation in long sequences
-
Confidence-Aware Inference
- Each corrected token is associated with a confidence score
-
Visual Difference Highlighting
- Explicit marking of inserted, modified, and deleted tokens
-
Explainable Corrections
- Token-level probability inspection to support human verification
- Language: Python
- Deep Learning: PyTorch
- Modeling: Hugging Face Transformers, PEFT (LoRA)
- Serving: Flask
- Deployment: Docker & Docker Compose
Alternatively, you can access the system directly via the link (note that the hosting space may currently be inactive and the service might be temporarily unavailable).
git clone https://github.com/yammdd/vietnamese-text-normalization.git
cd vietnamese-text-normalizationBefore running the system, you need to manually download the pretrained models and place them in the correct directory structure.
Download the following models: vietnamese-error-correction, vietnamese-diacritic-restoration-v2
Create the following folders inside the project root:
models/
├── vietnamese-error-correction/
└── vietnamese-diacritic-restoration-v2/Make sure Docker and Docker Compose are installed, then run:
docker-compose up -d --buildThis command will build the image and start the inference service.
Once the containers are running, open your browser at:
http://localhost:7860
-
This project is research-oriented, not production-ready
-
Designed for studying:
- Vietnamese text normalization
- Error propagation in Seq2Seq models
- Character-level evaluation in tonal languages
-
Human verification is strongly recommended
-
High confidence does not guarantee absolute correctness
![]() Thanh Dan Bui Project Manager |
![]() Tien Dung Pham Member |
![]() Quoc Hung Pham Member |


