(Own) Nano-GPT-Speedrun-Track

This repo represents my Nano-GPT speedrun playground, which started coding along Let's reproduce GPT-2 (124M), then moved into further improvements (NanoGPT Speedrun Living Worklog and modded-nanogpt) and then own experiments.

Every new file in versions/ adds some new functionality reflected in its name (chosen over commits over the same file to allow for independent updating in case I have to clean each file up if I ever decide to create a training set with diffs between them), while train.py holds the latest/fastest version of the code.

Measurements

Repo's best time to reach <= 3.28 val loss on 10,485,760 validation tokens (4x NVIDIA H100 SXM): 4.87 minutes (we would need to change total_tokens_per_step_train/val (262,144), gpu_batch_size (8), or seq_len_train/val (8,192) to use 8x NVIDIA H100).
Nov 2, 2025, World record <= 3.28 val loss on the first 10,485,760 validation tokens (8x NVIDIA H100 SXM): 2.345 minutes

Updates

While I haven't had much time to play around with this repo lately, I have added 2 new versions, improving this repo's best time to X, now also on the same 10M tokens the official speedrun uses.
In this time, the official speedrun modded-nanogpt has improved another 30% in 2 months and are already at 1.540 minutes (Feb 9th, 2026). Go check them out!

WR vs This Repo Differences

As some noticeable differences, this repo tries to have more comments and be more customizable (possibly better for: learning, adaptation to other use cases) while the world record code is (a lot) more optimized.

Results

Train/val loss

GPU benchmarking (Runpod)

Possible improvements

KV cache (for sample)
Mixture of Experts config option
Masked Language Model (MLM) training support (is_causal=False -> work as encoder)
Reward >1 token at the start of training (e.g., I went swimming -> reward swimming, running, walking, etc.) and sharpen signal as training goes on
...

Instructions

I currently use RunPod to run this repo (with other options including: Lambda, Vast.ai). If you want to run this code on another provider, or your own hardware, feel free to skip RunPod-specific instructions, but stay around for hf_user and hf_token setup, and the Docker image used.

1. Create `hf_user` and `hf_token`

On RunPod, create the following secrets if you want to push your results (and model(s) if checkpointing) to Hugging Face after pre-training.

Click Secrets (left menu)
Click + Create Secret
Type hf_user as Secret Name
Paste or type your Hugging Face username as Secret Value
Click Create Secret (to create the secret)
Click + Create Secret
Type hf_token as Secret Name
Paste or type your Hugging Face token as Secret Value
Click Create Secret (to create the secret)

If you use your own hardware, create them as environment variables

2. (Optional) Create a Docker Image

I currently use:

anywinter4079/pytorch:2.10.0-cu128 for versions/33-*, versions/34-*, and train.py (PyTorch 2.10.0, CUDA 12.8)
anywinter4079/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04-runpod-clone for earlier versions (PyTorch 2.8.0, CUDA 12.8.1)

The 2.10.0-cu128 image is built from anywinter4079/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04-runpod-clone (clone/backup of the official runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04) with extra dependencies (tiktoken datasets tqdm huggingface_hub safetensors) and an upgrade to PyTorch 2.10.0+cu128.

To build and push the image, I used:

mkdir -p docker_build && \
cat > docker_build/Dockerfile <<'EOF'
FROM anywinter4079/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04-runpod-clone

WORKDIR /workspace

RUN apt-get update && apt-get install -y nano && rm -rf /var/lib/apt/lists/*

RUN python -m pip install --upgrade pip \
 && pip install tiktoken datasets tqdm huggingface_hub safetensors

# torch 2.10.0+cu128 (CUDA 12.8)
RUN pip install torch==2.10.0+cu128 \
 --index-url https://download.pytorch.org/whl/cu128 --upgrade
EOF

And then:

docker login

With:

docker buildx use amd64-builder
docker buildx build --platform linux/amd64 -t anywinter4079/pytorch:2.10.0-cu128 ./docker_build --push

If amd64-builder already exists (e.g., have run the command before), or:

docker buildx create --use --name amd64-builder
docker buildx build --platform linux/amd64 -t anywinter4079/pytorch:2.10.0-cu128 ./docker_build --push

otherwise.

If you want to reuse the existing image, there is nothing to do in this step. Else, you can follow these steps to create your own, if you ever want to change the image.

3. (Optional) Create a network volume to host the Fineweb-Edu data

To create a persistent network volume to host the ~20GB of data ($1.40/mo) for instant access upon pod spinup:

Clone and cd into this repo:

git clone https://github.com/Any-Winter-4079/Nano-GPT-Speedrun-Track.git && \
cd Nano-GPT-Speedrun-Track

Download the Fineweb-Edu dataset (locally):

cd data && python fineweb-npy.py

Create a Network volume on Runpod:

Click Storage (left menu)
Click + New Network Volume
Choose a datacenter with H100 SXM(s) available
Set the size to 20 GB
Click Create Network Volume

Start a CPU pod (cheaper) on the same datacenter where the network volume is hosted:

Click Storage (left menu)
Click the volume
Click Configure Pod with Volume
Click Edit Pod
Change Volume Mount Path from /workspace to /workspace/data/edu_fineweb10B
Click Set Overrides
Click Deploy On-Demand

Then securely copy (scp) the data from your local computer to the remote network volume, replacing <PORT>, <PRIVATE_KEY_FILE>, <IP> with those from Connect:

scp -P <PORT> -i ~/.ssh/<PRIVATE_KEY_FILE> -r \
    data/edu_fineweb10B/* \
    root@<IP>:/workspace/data/edu_fineweb10B/

4. Pretrain

To pre-train, terminate the CPU pod, and spin up a pod with one or more H100 SXM(s), and add hf_user and hf_token (if you want to push your final model/training logs to the Hub), i.e.:

Click Edit
Click Environment Variables
Click + Add Environment Variable
Click Lock icon to select secret
Click hf_token to select it as value
Type hf_token to choose it as key
Click + Add Environment Variable
Click Lock icon to select secret
Click hf_user to select it as value
Type hf_user to choose it as key
Change Volume Mount Path from /workspace to /workspace/data/edu_fineweb10B
Click Set Overrides
Click Deploy On-Demand

Connect to the pod via ssh (replacing <CONNECTION_STRING>, <PRIVATE_KEY_FILE> with those provided by Runpod):

ssh <CONNECTION_STRING>@ssh.runpod.io -i ~/.ssh/<PRIVATE_KEY_FILE>

Fetch the remote repo initializing a local repo beforehand to combine the data/ folders (replacing <your-user> with your GitHub username), running the following within the workspace (cd workspace if needed):

git init && \
git remote add origin https://github.com/<your-user>/Nano-GPT-Speedrun-Track.git && \
git remote add upstream https://github.com/Any-Winter-4079/Nano-GPT-Speedrun-Track.git && \
git fetch origin && \
git checkout -b main origin/main

or in my case:

git init && \
git remote add origin https://github.com/Any-Winter-4079/Nano-GPT-Speedrun-Track.git && \
git fetch origin && \
git checkout -b main origin/main

Then either check the network volume has been successfully attached:

ls data/edu_fineweb10B

Or download the data within the pod (if you skipped step 3):

cd data && \
python fineweb-npy.py && \
cd ..

Finally, run (replacing 4 with the number of GPUs you decided to use):

torchrun --standalone --nproc_per_node=4 train.py

You should see a log similar to:

per-gpu gradient accumulation mini-steps: 1
lr warmup steps: 0
lr warmup and cosine steps: 3,051
max train steps: 19,073
num decayed parameter tensors (AdamW): 25, with 38,633,760 parameters
num non-decayed parameter tensors (AdamW): 25, with 19,200 parameters
num Muon parameter tensors: 72, with 84,934,656 parameters
using fused AdamW: True
found 99 shards for split train
total train shard 0 tokens: 99,876,865
total gpus: 4
gpu batch size: 8 sequences
sequence length: 8,192 tokens
tokens fed to gpu per grad accum mini-step: 65,536 (4 gpus, 262,144 total tokens)
per-gpu grad accumulation mini-steps for train shard 0 (each mini-step processing 262,144 tokens): 381
found 1 shards for split val
total val shard 0 tokens: 99,876,865
total gpus: 4
gpu batch size: 8 sequences
sequence length: 8,192 tokens
tokens fed to gpu per grad accum mini-step: 65,536 (4 gpus, 262,144 total tokens)
per-gpu grad accumulation mini-steps for val shard 0 (each mini-step processing 262,144 tokens): 381
[rank 0] gets HellaSwag sentences 0 to 2,510
[rank 1] gets HellaSwag sentences 2,511 to 5,021
[rank 2] gets HellaSwag sentences 5,022 to 7,532
[rank 3] gets HellaSwag sentences 7,533 to 10,041
123,587,616 parameters

Then the code takes 2-4 minutes to compile, and then:

step: 0 | train loss: 10.97804832 | train ppl: 58,574.12 | train step time: 216.32 ms | adamw lr: 0.00500000 | tok/s: 1,211,858.15 | total toks: 262,144 | total time: 0.00 min | sw size: 128 | max q_scale raw/eff: 1.0045/1.0045 | max k_scale raw/eff: 1.0045/1.0045
step: 1 | train loss: 8.87146950 | train ppl: 7,125.74 | train step time: 153.80 ms | adamw lr: 0.00500000 | tok/s: 1,704,460.43 | total toks: 524,288 | total time: 0.01 min | sw size: 128 | max q_scale raw/eff: 1.0090/1.0090 | max k_scale raw/eff: 1.0090/1.0090
step: 2 | train loss: 7.93997669 | train ppl: 2,807.30 | train step time: 155.61 ms | adamw lr: 0.00500000 | tok/s: 1,684,600.25 | total toks: 786,432 | total time: 0.01 min | sw size: 128 | max q_scale raw/eff: 1.0130/1.0130 | max k_scale raw/eff: 1.0130/1.0130
step: 3 | train loss: 8.02733135 | train ppl: 3,063.56 | train step time: 154.30 ms | adamw lr: 0.00499999 | tok/s: 1,698,903.45 | total toks: 1,048,576 | total time: 0.01 min | sw size: 128 | max q_scale raw/eff: 1.0158/1.0158 | max k_scale raw/eff: 1.0158/1.0158
step: 4 | train loss: 7.80339432 | train ppl: 2,448.90 | train step time: 155.14 ms | adamw lr: 0.00499998 | tok/s: 1,689,747.02 | total toks: 1,310,720 | total time: 0.01 min | sw size: 128 | max q_scale raw/eff: 1.0177/1.0177 | max k_scale raw/eff: 1.0177/1.0177
step: 5 | train loss: 7.81453323 | train ppl: 2,476.33 | train step time: 154.13 ms | adamw lr: 0.00499997 | tok/s: 1,700,782.44 | total toks: 1,572,864 | total time: 0.02 min | sw size: 128 | max q_scale raw/eff: 1.0202/1.0202 | max k_scale raw/eff: 1.0201/1.0201
...

5. (Optional) Store `stdout` and `stderr` logs on file (if you run updates to the code and the log is too long to be read on the shell)

Because error messages can be very long, you can run (replacing --nproc_per_node=4 with you GPU count):

mkdir debug

PYTHONUNBUFFERED=1 torchrun --standalone --nproc_per_node=1 train.py \

> > (stdbuf -oL tee -a debug/out-$(date +%F_%H-%M-%S).log) \
  2> >(stdbuf -oL tee -a debug/err-$(date +%F\_%H-%M-%S).log >&2)

Then review the error message in chunks (replacing err-2025-10-25_00-32-19.log with your own log):

cd debug
awk 'NR>=0 && NR<=180' err-2025-10-25_00-32-19.log

6. (Optional) Push the training config, logs, and checkpoint(s) to Hugging Face

To push the config, log, and model(s) (if checkpointing), find the latest timestamp, either on checkpoints or configs_and_logs, e.g.:

ls checkpoints

Then copy the timestamp and replace (with nano train.py) push_to_hub.py's timestamp ("20251013_145053") with your timestamp (TODO: make the script ask for the timestamp as input when running)

Save (e.g., control + x, y, enter)

And run:

python push_to_hub.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

(Own) Nano-GPT-Speedrun-Track

Measurements

Updates

WR vs This Repo Differences

Results

Train/val loss

GPU benchmarking (Runpod)

Possible improvements

Instructions

1. Create `hf_user` and `hf_token`

2. (Optional) Create a Docker Image

3. (Optional) Create a network volume to host the Fineweb-Edu data

4. Pretrain

5. (Optional) Store `stdout` and `stderr` logs on file (if you run updates to the code and the log is too long to be read on the shell)

6. (Optional) Push the training config, logs, and checkpoint(s) to Hugging Face

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
data		data
versions		versions
LICENSE		LICENSE
README.md		README.md
input.txt		input.txt
push_to_hub.py		push_to_hub.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

(Own) Nano-GPT-Speedrun-Track

Measurements

Updates

WR vs This Repo Differences

Results

Train/val loss

GPU benchmarking (Runpod)

Possible improvements

Instructions

1. Create hf_user and hf_token

2. (Optional) Create a Docker Image

3. (Optional) Create a network volume to host the Fineweb-Edu data

4. Pretrain

5. (Optional) Store stdout and stderr logs on file (if you run updates to the code and the log is too long to be read on the shell)

6. (Optional) Push the training config, logs, and checkpoint(s) to Hugging Face

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Create `hf_user` and `hf_token`

5. (Optional) Store `stdout` and `stderr` logs on file (if you run updates to the code and the log is too long to be read on the shell)

Packages