A clean and reproducible GPT-2 pretraining pipeline for Vietnamese, trained in two stages:
- Stage 1: Pretrain GPT-2 from random initialization on mixed Vietnamese corpora.
- Stage 2: Continue pretraining on a curated 5-word quatrain poem corpus for style adaptation.
This repository includes data preparation, tokenizer training, multi-stage pretraining, and text generation scripts.
vietnamese-gpt2/
├── src/
│ ├── config.py # Central configuration: paths, datasets, hyperparameters
│ ├── utils.py # Shared helpers for normalization, callbacks, generation
│ ├── train_tokenizer.py # Train tokenizer from Vietnamese corpora
│ ├── train_1.py # Stage 1 pretraining from scratch
│ ├── train_2.py # Stage 2 continued pretraining on poem corpus
│ ├── generate_base.py # Generate text with the stage-1 model
│ └── generate_poem.py # Generate poem-style text with the stage-2 model
├── data_prep/
│ ├── news/download_datasets.py
│ ├── wiki/crawl_vi_wiki.py
│ ├── wiki/process_vi_wiki.py
│ ├── poem/crawl_poem.py
│ ├── poem/scrape_poem_content.py
│ ├── poem/prepare_poem_data.py
│ └── deduplicate.py
├── scripts/
│ ├── train_1.sh
│ └── train_2.sh
├── artifacts/ # Tokenizer, checkpoints, logs, final models
└── data/ # Stage-organized datasets
data/
├── stage_1/
│ ├── raw/ # Stage-1 raw inputs (JSONL/Parquet)
│ └── dedup/ # Stage-1 deduplicated parquets + report
└── stage_2/
├── raw/ # Poem metadata CSV + processed jsonl for training
└── dedup/ # Stage-2 deduplicated poem parquet
Train GPT-2 from random initialization on mixed Vietnamese corpora such as news and Wikipedia.
Continue pretraining the stage-1 model on a Vietnamese poem corpus to adapt the model toward 5-word quatrain generation.
- Python 3.11+
- CUDA-compatible GPU
flash-attn(optional; requires a compatible CUDA toolchain)- uv for environment and package management
Clone the repository and install dependencies:
git clone https://github.com/duongtruongbinh/vietnamese-gpt2
cd vietnamese-gpt2
uv sync
uv pip install -e .Run all commands from the repository root.
uv run python data_prep/news/download_datasets.py
uv run python data_prep/wiki/crawl_vi_wiki.py
uv run python data_prep/wiki/process_vi_wiki.pyuv run python src/train_tokenizer.pyuv run python data_prep/deduplicate.pybash scripts/train_1.shuv run python data_prep/poem/crawl_poem.py
uv run python data_prep/poem/scrape_poem_content.py
uv run python data_prep/poem/prepare_poem_data.py
uv run python data_prep/deduplicate_poem.pybash scripts/train_2.shGenerate text with the base model:
uv run python src/generate_base.pyGenerate poem-style text with the stage-2 model:
uv run python src/generate_poem.pyAll important paths and hyperparameters are managed in:
src/config.py
This includes:
- Dataset paths
- Tokenizer directory
- Checkpoint directory
- Sequence length
- Batch size
- Learning rate
- Training budget
- Logging and runtime settings
Training artifacts are stored under:
artifacts/
Typical outputs include:
- Trained tokenizer
- Intermediate checkpoints
- Final stage-1 model
- Final stage-2 model
- Training logs
- Stage 1 is intended for general Vietnamese language modeling.
- Stage 2 is intended for style adaptation, not full instruction tuning.
- For best results, ensure corpus quality and deduplication are completed before training.
- A GPU is strongly recommended for both tokenizer experimentation and model training.
This project aims to provide a simple, practical, and extensible foundation for training Vietnamese GPT-2 models from scratch and adapting them to specific text styles such as poetry