GPT Research Suite (From Scratch)

A research-focused implementation of GPT-style decoder-only language models (GPT-2 / GPT-3 style) with:

clean architecture code from scratch
reproducible CLI workflows for train/compare/ablation/plot
multiple dataset loaders (OpenWebText, WikiText, TinyStories, C4, Pile, RedPajama, local corpora)
YAML-based experiment packaging
simple Airflow orchestration for professional, reproducible pipelines

What Is Included

Core models: GPT2, GPT3 from scratch in src/model/
Research variants:
- RMSNorm
- SwiGLU
- RoPE positional encoding
- SDPA attention backend
- gradient checkpointing
Research CLIs:
- scripts/train.py
- scripts/compare_models.py
- scripts/run_ablations.py
- scripts/plot_results.py
- scripts/run_pipeline_config.py

Repository Layout

.
├── configs/                      # YAML configs for train/compare/ablation/plot/pipeline
├── orchestration/
│   └── airflow/
│       ├── dags/                # Airflow DAGs
│       └── README.md            # Airflow usage notes
├── scripts/                     # Project CLIs
├── src/
│   ├── data/                    # Dataset registry + dataloaders
│   ├── inference/               # Generation/sampling utils
│   ├── model/                   # Attention, blocks, GPT2/GPT3
│   ├── research/                # Experiment runner + plotting + pipeline config runner
│   └── training/                # Training loop, optimizer, AMP helpers
├── tests/
├── requirements.txt
└── README.md

Installation

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Notes:

matplotlib is included for plotting.
Datasets are downloaded only when you run a loader/experiment command.

Dataset Options

List all registered datasets:

python scripts/list_datasets.py

Current dataset keys:

small: OpenWebText-10K
large: WikiText-103
wikitext2: WikiText-2
tinystories: TinyStories
openwebtext: OpenWebText
c4_en: C4 English
pile: The Pile (uncopyrighted)
redpajama: RedPajama 1T sample
local_jsonl: local JSONL/TXT corpora

Inspect dataloader shapes quickly:

python scripts/build_dataloaders.py --dataset tinystories --preview-batches 2

For local corpora (local_jsonl), configure env vars:

export LOCAL_TEXT_DATA_PATH=/path/to/corpus.jsonl
export LOCAL_TEXT_DATA_FORMAT=json   # json | text
export LOCAL_TEXT_FIELD=text         # only for json

CLI Workflows

1. Train

python scripts/train.py \
  --dataset small \
  --model-version gpt3 \
  --model-preset small \
  --block-size 256 \
  --batch-size 32 \
  --epochs 3 \
  --max-steps 2000 \
  --norm-type rmsnorm \
  --mlp-type swiglu \
  --pos-encoding rope \
  --attention-impl sdpa \
  --gradient-checkpointing \
  --val-checking \
  --run-name gpt3_research_small

2. Fair GPT-2 vs GPT-3 Comparison

python scripts/compare_models.py \
  --dataset small \
  --models gpt2 gpt3 \
  --seeds 42 43 44 \
  --n-layer 8 --n-head 8 --d-model 512 \
  --epochs 3 --max-steps 2000 \
  --val-checking \
  --experiment-name gpt2_vs_gpt3_small_fair

Outputs include results.jsonl and results.csv.

3. Ablations

python scripts/run_ablations.py \
  --model-version gpt3 \
  --dataset small \
  --ablation-axis norm_type \
  --ablation-values layernorm rmsnorm \
  --seeds 42 43 \
  --n-layer 8 --n-head 8 --d-model 512 \
  --epochs 3 --max-steps 2000 \
  --val-checking \
  --experiment-name norm_ablation_gpt3_small

4. Plot Results

Compare plot:

python scripts/plot_results.py \
  --results research_runs/compare/gpt2_vs_gpt3_small_fair/results.jsonl \
  --kind compare \
  --metric val_loss_best \
  --output-dir research_runs/plots

Ablation plot:

python scripts/plot_results.py \
  --results research_runs/ablations/norm_ablation_gpt3_small/results.jsonl \
  --kind ablation \
  --metric val_loss_best \
  --output-dir research_runs/plots

5. Generate Text

python scripts/generate.py \
  --checkpoint checkpoints/gpt3_research_small.last.pt \
  --tokenizer-path owt10k_tokenizer.json \
  --use-ckpt-config \
  --prompt "What's your name?" \
  --strategy topk \
  --top-k 50 \
  --max-new-tokens 64

Config-Driven Reproducibility (YAML)

The repo includes ready-to-run YAML configs:

configs/train/gpt3_research_small.yaml
configs/compare/gpt2_vs_gpt3_small_fair.yaml
configs/ablation/norm_ablation_gpt3_small.yaml
configs/plot/compare_plot.yaml
configs/plot/ablation_plot.yaml
configs/pipeline/research_small_repro.yaml

Run whole pipeline with one command:

python scripts/run_pipeline_config.py --config configs/pipeline/research_small_repro.yaml

Dry-run (only print commands):

python scripts/run_pipeline_config.py --config configs/pipeline/research_small_repro.yaml --dry-run

Run specific step(s):

python scripts/run_pipeline_config.py \
  --config configs/pipeline/research_small_repro.yaml \
  --step compare \
  --step plot_compare

The runner stores execution reports in JSON under run_dir.

Packaging Checklist

For a professional/reproducible run, keep these artifacts per pipeline execution:

pipeline YAML used (configs/pipeline/*.yaml)
per-stage YAMLs (configs/train|compare|ablation|plot/*.yaml)
metrics (results.jsonl, results.csv)
plots (research_runs/plots/*.png)
execution report (run_dir/execution_*.json)
environment metadata (Python, PyTorch, CUDA, git commit)

Simple Airflow Orchestration

DAG location:

orchestration/airflow/dags/gpt_research_pipeline.py

Default pipeline config used by DAG:

configs/pipeline/research_small_repro.yaml

Environment variables:

export AIRFLOW_HOME=$PWD/.airflow
export PYTHONPATH=$PWD
export GPT_PIPELINE_CONFIG=$PWD/configs/pipeline/research_small_repro.yaml
export GPT_PYTHON_BIN=python

Then trigger DAG gpt_research_pipeline from your Airflow instance.

The DAG calls the same project CLIs through scripts/run_pipeline_config.py, keeping local and orchestrated behavior aligned.

Testing

python3 -m pytest -s --capture=no

References

@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  journal={OpenAI Technical Report},
  year={2019}
}

@article{brown2020language,
  title={Language Models are Few-Shot Learners},
  author={Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and others},
  journal={Advances in Neural Information Processing Systems},
  year={2020}
}

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT Research Suite (From Scratch)

What Is Included

Repository Layout

Installation

Dataset Options

CLI Workflows

1. Train

2. Fair GPT-2 vs GPT-3 Comparison

3. Ablations

4. Plot Results

5. Generate Text

Config-Driven Reproducibility (YAML)

Packaging Checklist

Simple Airflow Orchestration

Testing

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
configs		configs
full_notebooks		full_notebooks
orchestration/airflow		orchestration/airflow
scripts		scripts
src		src
tests		tests
training_showcase		training_showcase
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GPT Research Suite (From Scratch)

What Is Included

Repository Layout

Installation

Dataset Options

CLI Workflows

1. Train

2. Fair GPT-2 vs GPT-3 Comparison

3. Ablations

4. Plot Results

5. Generate Text

Config-Driven Reproducibility (YAML)

Packaging Checklist

Simple Airflow Orchestration

Testing

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages