Vietnamese Text Normalization

A Vietnamese text normalization system for diacritic restoration and spelling error correction, built on a fine-tuned BartPho Seq2Seq Transformer with parameter-efficient adaptation.

✨ Key Features

🧠 Model & Learning Strategy

Seq2Seq Transformer Architecture
- Encoder–decoder design tailored for text-to-text correction tasks
- Handles diacritic restoration, spelling correction, and informal variants (e.g., teencode)
Pretrained BartPho Backbone
- Vietnamese-specific Transformer trained on large-scale monolingual corpora
- Syllable-level tokenization to reduce OOV errors
Parameter-Efficient Fine-Tuning (LoRA)
- Fine-tunes only ~3–4% of total parameters
- Enables training and inference on limited GPU resources

🧪 Evaluation-Oriented Design

Character-Level Error Sensitivity
- Explicit focus on Character Error Rate (CER), critical for tonal languages
Strict Quantitative Metrics
- BLEU for sequence similarity
- WER for word-level correctness
- CER for fine-grained diacritic accuracy
Exact Match Awareness
- Tracks sentence-level exact correctness to expose error accumulation in long sequences

👤 Human-in-the-Loop & Explainability

Confidence-Aware Inference
- Each corrected token is associated with a confidence score
Visual Difference Highlighting
- Explicit marking of inserted, modified, and deleted tokens
Explainable Corrections
- Token-level probability inspection to support human verification

📦 Tech Stack

Language: Python
Deep Learning: PyTorch
Modeling: Hugging Face Transformers, PEFT (LoRA)
Serving: Flask
Deployment: Docker & Docker Compose

🚀 Getting Started

Alternatively, you can access the system directly via the link (note that the hosting space may currently be inactive and the service might be temporarily unavailable).

1. Clone the Repository

git clone https://github.com/yammdd/vietnamese-text-normalization.git
cd vietnamese-text-normalization

2. Model Download & Setup

Before running the system, you need to manually download the pretrained models and place them in the correct directory structure.

Download the following models: vietnamese-error-correction, vietnamese-diacritic-restoration-v2

Create the following folders inside the project root:

models/
├── vietnamese-error-correction/
└── vietnamese-diacritic-restoration-v2/

3. Build and Run with Docker

Make sure Docker and Docker Compose are installed, then run:

docker-compose up -d --build

This command will build the image and start the inference service.

4. Access the Application

Once the containers are running, open your browser at:

http://localhost:7860

⚠ Intended Use & Disclaimer

This project is research-oriented, not production-ready
Designed for studying:
- Vietnamese text normalization
- Error propagation in Seq2Seq models
- Character-level evaluation in tonal languages
Human verification is strongly recommended
High confidence does not guarantee absolute correctness

👥 Contributors

_{Thanh Dan Bui}
_{Project Manager}

_{Tien Dung Pham}
_Member

_{Quoc Hung Pham}
_Member

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
src		src
static		static
templates		templates
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
REPORT.pdf		REPORT.pdf
SLIDE.pptx		SLIDE.pptx
app.py		app.py
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vietnamese Text Normalization

✨ Key Features

🧠 Model & Learning Strategy

🧪 Evaluation-Oriented Design

👤 Human-in-the-Loop & Explainability

📦 Tech Stack

🚀 Getting Started

1. Clone the Repository

2. Model Download & Setup

3. Build and Run with Docker

4. Access the Application

⚠ Intended Use & Disclaimer

👥 Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vietnamese Text Normalization

✨ Key Features

🧠 Model & Learning Strategy

🧪 Evaluation-Oriented Design

👤 Human-in-the-Loop & Explainability

📦 Tech Stack

🚀 Getting Started

1. Clone the Repository

2. Model Download & Setup

3. Build and Run with Docker

4. Access the Application

⚠ Intended Use & Disclaimer

👥 Contributors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages