A research-focused implementation of GPT-style decoder-only language models (GPT-2 / GPT-3 style) with:
- clean architecture code from scratch
- reproducible CLI workflows for train/compare/ablation/plot
- multiple dataset loaders (OpenWebText, WikiText, TinyStories, C4, Pile, RedPajama, local corpora)
- YAML-based experiment packaging
- simple Airflow orchestration for professional, reproducible pipelines
- Core models:
GPT2,GPT3from scratch insrc/model/ - Research variants:
RMSNormSwiGLU- RoPE positional encoding
- SDPA attention backend
- gradient checkpointing
- Research CLIs:
scripts/train.pyscripts/compare_models.pyscripts/run_ablations.pyscripts/plot_results.pyscripts/run_pipeline_config.py
.
├── configs/ # YAML configs for train/compare/ablation/plot/pipeline
├── orchestration/
│ └── airflow/
│ ├── dags/ # Airflow DAGs
│ └── README.md # Airflow usage notes
├── scripts/ # Project CLIs
├── src/
│ ├── data/ # Dataset registry + dataloaders
│ ├── inference/ # Generation/sampling utils
│ ├── model/ # Attention, blocks, GPT2/GPT3
│ ├── research/ # Experiment runner + plotting + pipeline config runner
│ └── training/ # Training loop, optimizer, AMP helpers
├── tests/
├── requirements.txt
└── README.md
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtNotes:
matplotlibis included for plotting.- Datasets are downloaded only when you run a loader/experiment command.
List all registered datasets:
python scripts/list_datasets.pyCurrent dataset keys:
small: OpenWebText-10Klarge: WikiText-103wikitext2: WikiText-2tinystories: TinyStoriesopenwebtext: OpenWebTextc4_en: C4 Englishpile: The Pile (uncopyrighted)redpajama: RedPajama 1T samplelocal_jsonl: local JSONL/TXT corpora
Inspect dataloader shapes quickly:
python scripts/build_dataloaders.py --dataset tinystories --preview-batches 2For local corpora (local_jsonl), configure env vars:
export LOCAL_TEXT_DATA_PATH=/path/to/corpus.jsonl
export LOCAL_TEXT_DATA_FORMAT=json # json | text
export LOCAL_TEXT_FIELD=text # only for jsonpython scripts/train.py \
--dataset small \
--model-version gpt3 \
--model-preset small \
--block-size 256 \
--batch-size 32 \
--epochs 3 \
--max-steps 2000 \
--norm-type rmsnorm \
--mlp-type swiglu \
--pos-encoding rope \
--attention-impl sdpa \
--gradient-checkpointing \
--val-checking \
--run-name gpt3_research_smallpython scripts/compare_models.py \
--dataset small \
--models gpt2 gpt3 \
--seeds 42 43 44 \
--n-layer 8 --n-head 8 --d-model 512 \
--epochs 3 --max-steps 2000 \
--val-checking \
--experiment-name gpt2_vs_gpt3_small_fairOutputs include results.jsonl and results.csv.
python scripts/run_ablations.py \
--model-version gpt3 \
--dataset small \
--ablation-axis norm_type \
--ablation-values layernorm rmsnorm \
--seeds 42 43 \
--n-layer 8 --n-head 8 --d-model 512 \
--epochs 3 --max-steps 2000 \
--val-checking \
--experiment-name norm_ablation_gpt3_smallCompare plot:
python scripts/plot_results.py \
--results research_runs/compare/gpt2_vs_gpt3_small_fair/results.jsonl \
--kind compare \
--metric val_loss_best \
--output-dir research_runs/plotsAblation plot:
python scripts/plot_results.py \
--results research_runs/ablations/norm_ablation_gpt3_small/results.jsonl \
--kind ablation \
--metric val_loss_best \
--output-dir research_runs/plotspython scripts/generate.py \
--checkpoint checkpoints/gpt3_research_small.last.pt \
--tokenizer-path owt10k_tokenizer.json \
--use-ckpt-config \
--prompt "What's your name?" \
--strategy topk \
--top-k 50 \
--max-new-tokens 64The repo includes ready-to-run YAML configs:
configs/train/gpt3_research_small.yamlconfigs/compare/gpt2_vs_gpt3_small_fair.yamlconfigs/ablation/norm_ablation_gpt3_small.yamlconfigs/plot/compare_plot.yamlconfigs/plot/ablation_plot.yamlconfigs/pipeline/research_small_repro.yaml
Run whole pipeline with one command:
python scripts/run_pipeline_config.py --config configs/pipeline/research_small_repro.yamlDry-run (only print commands):
python scripts/run_pipeline_config.py --config configs/pipeline/research_small_repro.yaml --dry-runRun specific step(s):
python scripts/run_pipeline_config.py \
--config configs/pipeline/research_small_repro.yaml \
--step compare \
--step plot_compareThe runner stores execution reports in JSON under run_dir.
For a professional/reproducible run, keep these artifacts per pipeline execution:
- pipeline YAML used (
configs/pipeline/*.yaml) - per-stage YAMLs (
configs/train|compare|ablation|plot/*.yaml) - metrics (
results.jsonl,results.csv) - plots (
research_runs/plots/*.png) - execution report (
run_dir/execution_*.json) - environment metadata (Python, PyTorch, CUDA, git commit)
DAG location:
orchestration/airflow/dags/gpt_research_pipeline.py
Default pipeline config used by DAG:
configs/pipeline/research_small_repro.yaml
Environment variables:
export AIRFLOW_HOME=$PWD/.airflow
export PYTHONPATH=$PWD
export GPT_PIPELINE_CONFIG=$PWD/configs/pipeline/research_small_repro.yaml
export GPT_PYTHON_BIN=pythonThen trigger DAG gpt_research_pipeline from your Airflow instance.
The DAG calls the same project CLIs through scripts/run_pipeline_config.py, keeping local and orchestrated behavior aligned.
python3 -m pytest -s --capture=no@article{radford2019language,
title={Language Models are Unsupervised Multitask Learners},
author={Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
journal={OpenAI Technical Report},
year={2019}
}
@article{brown2020language,
title={Language Models are Few-Shot Learners},
author={Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and others},
journal={Advances in Neural Information Processing Systems},
year={2020}
}MIT. See LICENSE.