A from-scratch, readable implementation of a GPT-style language model in PyTorch, plus a Jupyter notebook for pretraining and text generation experiments.
- Minimal, well-structured PyTorch implementation:
- Multi-Head Self-Attention with causal masking
- Pre-norm Transformer blocks with residual connections
- Token + positional embeddings, LayerNorm, GELU feed-forward
- Tie-able output projection head sized to vocab
- Tokenization via OpenAI's GPT-2 BPE (
tiktoken) - Simple greedy text generation utility (
modules/GenerateSimple.py) - Training and sampling walkthrough in
pretraining.ipynb
SydsGPT-Pretraining/
├─ pretraining.ipynb # End-to-end training + generation walkthrough
├─ model/
│ └─ SydsGPT.py # SydsGPT model definition (Transformer LM)
├─ modules/
│ ├─ DataLoader.py # Tiny dataset + dataloader backed by tiktoken
│ ├─ GenerateSimple.py # Greedy decode helper (argmax sampling)
│ ├─ MultiHeadAttention.py # Causal MHA (with mask and dropout)
│ ├─ TransformerBlock.py # Pre-norm block: MHA + FFN + residual
│ ├─ FeedForward.py # GELU MLP (hidden dim = 4x embedding)
│ ├─ LayerNorm.py # Simple LayerNorm (scale/shift)
│ └─ GELU.py # GELU activation
└─ .gitignore # Ignores caches, venvs, checkpoints, logs, data
- Class:
SydsGPTinmodel/SydsGPT.py - Forward:
(batch, seq_len) -> logits (batch, seq_len, vocab_size) - Core config keys (dictionary):
vocab_size(int): tokenizer vocabulary sizeembedding_dim(int): model hidden sizecontext_length(int): max sequence length (positional embedding size)num_layers(int): number of Transformer blocksnum_heads(int): attention heads (must divideembedding_dim)dropout(float): dropout prob for attention/MLPqkv_bias(bool): add bias to Q/K/V projections
This repo uses:
torch(PyTorch)tiktoken(GPT-2 BPE tokenizer)notebookorjupyterlab(for running the notebook)
You can install these into a virtual environment. On Windows PowerShell:
# Create and activate a virtual environment (PowerShell)
python -m venv .venv
.\.venv\Scripts\Activate.ps1
# Install dependencies
pip install --upgrade pip
pip install torch tiktoken notebook
# (Optionally) pip install jupyterlab tqdm matplotlibNote: For GPU acceleration, install a CUDA-enabled PyTorch build appropriate for your system (see the official PyTorch install selector). CPU-only is fine for small experiments.
# From the repo root
jupyter notebook pretraining.ipynb
# or
jupyter lab pretraining.ipynbThe notebook walks through preparing data, training, checkpointing, and generating text from the trained model.
Below is a minimal example to instantiate the model and perform greedy generation using the included helper.
import torch
import tiktoken
from model.SydsGPT import SydsGPT
from modules.GenerateSimple import generate_simple
# Minimal config (tune as needed)
config = {
"vocab_size": 50257, # GPT-2 tokenizer size
"embedding_dim": 256,
"context_length": 128,
"num_layers": 4,
"num_heads": 4,
"dropout": 0.1,
"qkv_bias": False,
}
# Tokenizer
enc = tiktoken.get_encoding("gpt2")
# Prompt
prompt = "Once upon a time"
input_ids = torch.tensor([enc.encode(prompt)], dtype=torch.long)
# Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SydsGPT(config).to(device).eval()
# Greedy generation (argmax)
max_new_tokens = 100
context_len = config["context_length"]
with torch.no_grad():
generated = generate_simple(model, input_ids.to(device), max_new_tokens, context_len)
print(enc.decode(generated[0].tolist()))- The notebook demonstrates creating a training dataset from raw text using the GPT-2 tokenizer.
- The helper
modules/DataLoader.pyprovides:create_dataloader(text, max_length=512, step_size=256, batch_size=8, ...)- A tiny
Datasetthat yields(input_ids, target_ids)pairs shifted by one token.
- Loss: next-token prediction via cross-entropy on the model logits.
- Optimizer: AdamW is a typical choice (see the notebook for a full training loop).
- Checkpointing: Save and load
state_dict()for reproducible runs; include the optimizer state if you plan to resume training.
Example checkpoint pattern:
# Save
ckpt = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"config": config,
}
torch.save(ckpt, "checkpoints/sydsgpt.pt")
# Load
ckpt = torch.load("checkpoints/sydsgpt.pt", map_location=device)
model.load_state_dict(ckpt["model"])
optimizer.load_state_dict(ckpt["optimizer"]) # if resumingIncluded:
modules/GenerateSimple.py— a greedy (argmax) decoder that repeatedly feeds back the last token.
Extensions you can try (see the notebook):
- Temperature scaling, top-k filtering, multinomial sampling
- Early stop on EOS token
- Repetition penalties or n-gram blocking
- CUDA out of memory: reduce
batch_size,context_length, orembedding_dim. - Diverging loss: start with a smaller model, reduce learning rate, or increase gradient clipping.
- Tokenizer mismatch: Always use the same tokenizer (
tiktokenGPT-2) for both training and generation. - Sequence length: Ensure your prompt length + new tokens never exceeds
context_length; the helpers crop to the lastcontext_lengthtokens.
Issues and PRs are welcome. If you add features (e.g., top-p sampling or better data pipelines), please keep the code simple and well-commented.