This project implements a complete pipeline for building, training, and fine-tuning a GPT-2 style language model from scratch using PyTorch. It covers all key stages โ from data preparation and tokenization to transformer architecture, pretraining, and downstream fine-tuning for tasks like classification.
This implementation is inspired by the book Build a Large Language Model (from Scratch) by Raschka, Mirjalili, and D'Souza (2024).
LLM FROM SCRATCH/
โโโ pycache/
โโโ gpt2/
โโโ images/
โ โโโ llm.png
โ โโโ Screenshot 2025-10-21 163332.png
โโโ llm/
โโโ sms_spam_collection/
โโโ .gitattributes
โโโ .gitignore
โโโ gpt_download3.py
โโโ LICENSE
โโโ LLM.tokenizer.ipynb
โโโ loss-plot.pdf
โโโ model_and_optimizer.pth
โโโ model.pth
โโโ README.md
โโโ sms_spam_collection.zip
โโโ temperature-plot.pdf
โโโ test.csv
โโโ the-verdict.txt
โโโ train.csv
โโโ validation.csv
โโโ verdict.txt
The project follows a systematic three-stage pipeline to build and deploy a production-ready GPT-2 style model using transfer learning.
-
Data Preparation & Sampling
- Tokenization of raw text using OpenAIโs
tiktokenlibrary (GPT-2 BPE encoder). - Creation of input-target pairs using a sliding window.
- Custom PyTorch DataLoader for efficient batching.
- Tokenization of raw text using OpenAIโs
-
Attention Mechanism
- Causal multi-head self-attention implementation.
- Scaled dot-product attention with causal masking.
- Multiple attention heads capture semantic relationships.
-
LLM Architecture
- Transformer model with embedding layers.
- Stacked transformer blocks with attention & feed-forward layers.
- Layer normalization and residual connections for training stability.
-
Pretraining
- Train on the-verdict.txt corpus using next-token prediction.
- Learns language structure, grammar, and semantics.
-
Training Loop
- Implements PyTorch-based optimization and loss calculation.
- AdamW optimizer with gradient clipping and scheduler.
-
Model Evaluation
- Validation on held-out data for convergence tracking.
- Checkpointing to save progress.
-
Load Pretrained Weights
- Load official GPT-2 (124M) weights automatically via
gpt_download3.py.
- Load official GPT-2 (124M) weights automatically via
-
Fine-Tuning for Classification
- Train on SMS Spam Collection dataset for spam detection.
- Replace LM head with a classification head.
- Freeze most transformer layers for efficient adaptation.
-
Fine-Tuning for Chat Tasks
- Adaptation for conversational datasets (instruction-following).
- Model learns context retention and response generation.
GPT-2 Small (124M) Configuration:
| Parameter | Value |
|---|---|
vocab_size |
50257 |
context_length |
1024 |
emb_dim |
768 |
n_heads |
12 |
n_layers |
12 |
drop_rate |
0.1 |
qkv_bias |
False |
- Token & Position Embeddings: Map tokens and sequence positions to dense vectors.
- Transformer Blocks: Multi-head self-attention + MLP layers with GELU activation.
- Residuals & LayerNorm: Ensure gradient stability and fast convergence.
- Training loss decreases from ~9.5 โ <1.0 in 10 epochs.
- Validation loss stabilizes around 6.5.
- Clear convergence after epoch 6.
| Metric | Value |
|---|---|
| Training Accuracy | 100% |
| Validation Accuracy | 97.5% |
- Supports temperature scaling and top-k sampling for output control.
pip install torch tiktoken pandas matplotlib tensorflow tqdmpython gpt_download3.pyjupyter notebook LLM.tokenizer.ipynbCovers:
- Data preparation & tokenization
- Model architecture implementation
- Pretraining
- Fine-tuning for spam detection
- Text generation
# Load pretrained model
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load('model.pth'))
# Add custom classification head
num_classes = your_num_classes
model.out_head = torch.nn.Linear(
in_features=GPT_CONFIG_124M["emb_dim"],
out_features=num_classes
)
# Fine-tune on your dataset
train_classifier(model, your_train_loader, your_val_loader)- โ Complete Implementation from Scratch
- ๐งฉ Modular & Extensible Design
- ๐ Pretrained Weight Compatibility
- ๐ง Multi-Task Fine-Tuning
- โก Parameter-Efficient Training
- โจ Flexible Text Generation
- ๐งโ๐ป Production-Ready Code
| Component | Description | Benefit |
|---|---|---|
| Attention Mechanism | Causal masking + multi-head attention | Enables autoregressive generation |
| Training Optimization | AdamW + learning rate scheduling | Stable convergence |
| Memory Efficiency | Batch processing + gradient clipping | Handles large batches safely |
- GPT-2 Medium / Large / XL Variants
- Multi-GPU Distributed Training
- Advanced Decoding (Nucleus, Beam Search)
- Integration with Instruction-Tuning Datasets
- RLHF Pipeline (Reinforcement Learning from Human Feedback)
- Model Quantization for Lightweight Deployment
- Vizauara Labs AI
- Raschka, S., Mirjalili, V., & D'Souza, D. (2024). Build a Large Language Model (from Scratch). Manning (Reference)
- Vaswani, A., et al. (2017). Attention Is All You Need.
- UCI Dataset SMS Spam Collection Dataset Contributors.
This project is licensed under the terms specified in the LICENSE file.
- OpenAI โ GPT-2 architecture and pretrained weights
- Sebastian Raschka โ Educational materials and guidance
- PyTorch Team โ Deep learning framework
- SMS Spam Collection Dataset Contributors
