Vietnamese GPT-2: Multi-Stage Pretraining

A clean and reproducible GPT-2 pretraining pipeline for Vietnamese, trained in two stages:

Stage 1: Pretrain GPT-2 from random initialization on mixed Vietnamese corpora.
Stage 2: Continue pretraining on a curated 5-word quatrain poem corpus for style adaptation.

This repository includes data preparation, tokenizer training, multi-stage pretraining, and text generation scripts.

Repository Structure

vietnamese-gpt2/
├── src/
│   ├── config.py               # Central configuration: paths, datasets, hyperparameters
│   ├── utils.py                # Shared helpers for normalization, callbacks, generation
│   ├── train_tokenizer.py      # Train tokenizer from Vietnamese corpora
│   ├── train_1.py              # Stage 1 pretraining from scratch
│   ├── train_2.py              # Stage 2 continued pretraining on poem corpus
│   ├── generate_base.py        # Generate text with the stage-1 model
│   └── generate_poem.py        # Generate poem-style text with the stage-2 model
├── data_prep/
│   ├── news/download_datasets.py
│   ├── wiki/crawl_vi_wiki.py
│   ├── wiki/process_vi_wiki.py
│   ├── poem/crawl_poem.py
│   ├── poem/scrape_poem_content.py
│   ├── poem/prepare_poem_data.py
│   └── deduplicate.py
├── scripts/
│   ├── train_1.sh
│   └── train_2.sh
├── artifacts/                  # Tokenizer, checkpoints, logs, final models
└── data/                       # Stage-organized datasets

Data layout

data/
├── stage_1/
│   ├── raw/                    # Stage-1 raw inputs (JSONL/Parquet)
│   └── dedup/                  # Stage-1 deduplicated parquets + report
└── stage_2/
    ├── raw/                    # Poem metadata CSV + processed jsonl for training
    └── dedup/                  # Stage-2 deduplicated poem parquet

Training Overview

Stage 1: Base Language Pretraining

Train GPT-2 from random initialization on mixed Vietnamese corpora such as news and Wikipedia.

Stage 2: Domain Adaptation for Poetry

Continue pretraining the stage-1 model on a Vietnamese poem corpus to adapt the model toward 5-word quatrain generation.

Requirements

Python 3.11+
CUDA-compatible GPU
flash-attn (optional; requires a compatible CUDA toolchain)
uv for environment and package management

Installation

Clone the repository and install dependencies:

git clone https://github.com/duongtruongbinh/vietnamese-gpt2
cd vietnamese-gpt2
uv sync
uv pip install -e .

Run all commands from the repository root.

Run with Docker (Trainer + REST API + Next.js UI)

1) Build all services

docker compose build

2) Start chat application stack (backend + UI)

docker compose up backend ui

Next.js UI: http://localhost:3000
FastAPI backend: http://localhost:8000
Health endpoint: http://localhost:8000/health
UI gọi API qua NEXT_PUBLIC_API_BASE_URL=http://localhost:8000 để browser truy cập đúng host.

3) Enter trainer container (optional for model training)

docker compose run --rm trainer

Inside trainer container, run pipeline commands with uv run, for example:

uv run python src/train_tokenizer.py
bash scripts/train_1.sh

Notes:

./data and ./artifacts are mounted into containers so outputs persist on host machine.

Backend will use MODEL_PATH=/app/artifacts/model_stage2 by default (configured in docker-compose.yml).

If model files are not available yet, backend returns a mock response so UI can still be tested.

For GPU training, run Docker with NVIDIA Container Toolkit support (e.g. docker compose run --rm --gpus all trainer).

Chat App Architecture

ui (Next.js)  --->  backend (FastAPI)  --->  local HuggingFace model (artifacts/model_stage2)

backend/app/main.py exposes POST /api/chat and GET /health.
ui/app/page.js provides a simple chat interface and calls backend through NEXT_PUBLIC_API_BASE_URL.
If model loading fails, backend automatically falls back to mock replies for development.

Pipeline

1. Prepare raw corpora

uv run python data_prep/news/download_datasets.py
uv run python data_prep/wiki/crawl_vi_wiki.py
uv run python data_prep/wiki/process_vi_wiki.py

2. Train the tokenizer

uv run python src/train_tokenizer.py

3. Deduplicate the pretraining data

uv run python data_prep/deduplicate.py

4. Run stage 1 pretraining

bash scripts/train_1.sh

5. Prepare the poem corpus

uv run python data_prep/poem/crawl_poem.py
uv run python data_prep/poem/scrape_poem_content.py
uv run python data_prep/poem/prepare_poem_data.py
uv run python data_prep/deduplicate_poem.py

6. Run stage 2 continued pretraining

bash scripts/train_2.sh

Text Generation

Generate text with the base model:

uv run python src/generate_base.py

Generate poem-style text with the stage-2 model:

uv run python src/generate_poem.py

Configuration

All important paths and hyperparameters are managed in:

src/config.py

This includes:

Dataset paths
Tokenizer directory
Checkpoint directory
Sequence length
Batch size
Learning rate
Training budget
Logging and runtime settings

Outputs

Training artifacts are stored under:

artifacts/

Typical outputs include:

Trained tokenizer
Intermediate checkpoints
Final stage-1 model
Final stage-2 model
Training logs

Notes

Stage 1 is intended for general Vietnamese language modeling.
Stage 2 is intended for style adaptation, not full instruction tuning.
For best results, ensure corpus quality and deduplication are completed before training.
A GPU is strongly recommended for both tokenizer experimentation and model training.

Project Goal

This project aims to provide a simple, practical, and extensible foundation for training Vietnamese GPT-2 models from scratch and adapting them to specific text styles such as poetry

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
artifacts/tokenizer		artifacts/tokenizer
backend		backend
data_prep		data_prep
scripts		scripts
src		src
tests		tests
ui		ui
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vietnamese GPT-2: Multi-Stage Pretraining

Repository Structure

Data layout

Training Overview

Stage 1: Base Language Pretraining

Stage 2: Domain Adaptation for Poetry

Requirements

Installation

Run with Docker (Trainer + REST API + Next.js UI)

1) Build all services

2) Start chat application stack (backend + UI)

3) Enter trainer container (optional for model training)

Chat App Architecture

Pipeline

1. Prepare raw corpora

2. Train the tokenizer

3. Deduplicate the pretraining data

4. Run stage 1 pretraining

5. Prepare the poem corpus

6. Run stage 2 continued pretraining

Text Generation

Configuration

Outputs

Notes

Project Goal

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Languages

Folders and files

Latest commit

History

Repository files navigation

Vietnamese GPT-2: Multi-Stage Pretraining

Repository Structure

Data layout

Training Overview

Stage 1: Base Language Pretraining

Stage 2: Domain Adaptation for Poetry

Requirements

Installation

Run with Docker (Trainer + REST API + Next.js UI)

1) Build all services

2) Start chat application stack (backend + UI)

3) Enter trainer container (optional for model training)

Chat App Architecture

Pipeline

1. Prepare raw corpora

2. Train the tokenizer

3. Deduplicate the pretraining data

4. Run stage 1 pretraining

5. Prepare the poem corpus

6. Run stage 2 continued pretraining

Text Generation

Configuration

Outputs

Notes

Project Goal

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Languages

Packages

Contributors