Vietnamese GPT-2: Multi-Stage Pretraining

A clean and reproducible GPT-2 pretraining pipeline for Vietnamese, trained in two stages:

Stage 1: Pretrain GPT-2 from random initialization on mixed Vietnamese corpora.
Stage 2: Continue pretraining on a curated 5-word quatrain poem corpus for style adaptation.

This repository includes data preparation, tokenizer training, multi-stage pretraining, and text generation scripts.

Repository Structure

vietnamese-gpt2/
├── src/
│   ├── config.py               # Central configuration: paths, datasets, hyperparameters
│   ├── utils.py                # Shared helpers for normalization, callbacks, generation
│   ├── train_tokenizer.py      # Train tokenizer from Vietnamese corpora
│   ├── train_1.py              # Stage 1 pretraining from scratch
│   ├── train_2.py              # Stage 2 continued pretraining on poem corpus
│   ├── generate_base.py        # Generate text with the stage-1 model
│   └── generate_poem.py        # Generate poem-style text with the stage-2 model
├── data_prep/
│   ├── news/download_datasets.py
│   ├── wiki/crawl_vi_wiki.py
│   ├── wiki/process_vi_wiki.py
│   ├── poem/crawl_poem.py
│   ├── poem/scrape_poem_content.py
│   ├── poem/prepare_poem_data.py
│   └── deduplicate.py
├── scripts/
│   ├── train_1.sh
│   └── train_2.sh
├── artifacts/                  # Tokenizer, checkpoints, logs, final models
└── data/                       # Stage-organized datasets

Data layout

data/
├── stage_1/
│   ├── raw/                    # Stage-1 raw inputs (JSONL/Parquet)
│   └── dedup/                  # Stage-1 deduplicated parquets + report
└── stage_2/
    ├── raw/                    # Poem metadata CSV + processed jsonl for training
    └── dedup/                  # Stage-2 deduplicated poem parquet

Training Overview

Stage 1: Base Language Pretraining

Train GPT-2 from random initialization on mixed Vietnamese corpora such as news and Wikipedia.

Stage 2: Domain Adaptation for Poetry

Continue pretraining the stage-1 model on a Vietnamese poem corpus to adapt the model toward 5-word quatrain generation.

Requirements

Python 3.11+
CUDA-compatible GPU
flash-attn (optional; requires a compatible CUDA toolchain)
uv for environment and package management

Installation

Clone the repository and install dependencies:

git clone https://github.com/duongtruongbinh/vietnamese-gpt2
cd vietnamese-gpt2
uv sync
uv pip install -e .

Run all commands from the repository root.

Pipeline

1. Prepare raw corpora

uv run python data_prep/news/download_datasets.py
uv run python data_prep/wiki/crawl_vi_wiki.py
uv run python data_prep/wiki/process_vi_wiki.py

2. Train the tokenizer

uv run python src/train_tokenizer.py

3. Deduplicate the pretraining data

uv run python data_prep/deduplicate.py

4. Run stage 1 pretraining

bash scripts/train_1.sh

5. Prepare the poem corpus

uv run python data_prep/poem/crawl_poem.py
uv run python data_prep/poem/scrape_poem_content.py
uv run python data_prep/poem/prepare_poem_data.py
uv run python data_prep/deduplicate_poem.py

6. Run stage 2 continued pretraining

bash scripts/train_2.sh

Text Generation

Generate text with the base model:

uv run python src/generate_base.py

Generate poem-style text with the stage-2 model:

uv run python src/generate_poem.py

Configuration

All important paths and hyperparameters are managed in:

src/config.py

This includes:

Dataset paths
Tokenizer directory
Checkpoint directory
Sequence length
Batch size
Learning rate
Training budget
Logging and runtime settings

Outputs

Training artifacts are stored under:

artifacts/

Typical outputs include:

Trained tokenizer
Intermediate checkpoints
Final stage-1 model
Final stage-2 model
Training logs

Notes

Stage 1 is intended for general Vietnamese language modeling.
Stage 2 is intended for style adaptation, not full instruction tuning.
For best results, ensure corpus quality and deduplication are completed before training.
A GPU is strongly recommended for both tokenizer experimentation and model training.

Project Goal

This project aims to provide a simple, practical, and extensible foundation for training Vietnamese GPT-2 models from scratch and adapting them to specific text styles such as poetry

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
artifacts/tokenizer		artifacts/tokenizer
data_prep		data_prep
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vietnamese GPT-2: Multi-Stage Pretraining

Repository Structure

Data layout

Training Overview

Stage 1: Base Language Pretraining

Stage 2: Domain Adaptation for Poetry

Requirements

Installation

Pipeline

1. Prepare raw corpora

2. Train the tokenizer

3. Deduplicate the pretraining data

4. Run stage 1 pretraining

5. Prepare the poem corpus

6. Run stage 2 continued pretraining

Text Generation

Configuration

Outputs

Notes

Project Goal

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vietnamese GPT-2: Multi-Stage Pretraining

Repository Structure

Data layout

Training Overview

Stage 1: Base Language Pretraining

Stage 2: Domain Adaptation for Poetry

Requirements

Installation

Pipeline

1. Prepare raw corpora

2. Train the tokenizer

3. Deduplicate the pretraining data

4. Run stage 1 pretraining

5. Prepare the poem corpus

6. Run stage 2 continued pretraining

Text Generation

Configuration

Outputs

Notes

Project Goal

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages