Skip to content

Manish3451/LLM-From-Scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

40 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง  Building a Large Language Model (LLM) from Scratch

LLM Project Banner


๐Ÿ“˜ Overview

This project implements a complete pipeline for building, training, and fine-tuning a GPT-2 style language model from scratch using PyTorch. It covers all key stages โ€” from data preparation and tokenization to transformer architecture, pretraining, and downstream fine-tuning for tasks like classification.

This implementation is inspired by the book Build a Large Language Model (from Scratch) by Raschka, Mirjalili, and D'Souza (2024).


๐Ÿ“‚ Project Structure

LLM FROM SCRATCH/
โ”œโ”€โ”€ pycache/
โ”œโ”€โ”€ gpt2/
โ”œโ”€โ”€ images/
โ”‚   โ”œโ”€โ”€ llm.png
โ”‚   โ””โ”€โ”€ Screenshot 2025-10-21 163332.png
โ”œโ”€โ”€ llm/
โ”œโ”€โ”€ sms_spam_collection/
โ”œโ”€โ”€ .gitattributes
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ gpt_download3.py
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ LLM.tokenizer.ipynb
โ”œโ”€โ”€ loss-plot.pdf
โ”œโ”€โ”€ model_and_optimizer.pth
โ”œโ”€โ”€ model.pth
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ sms_spam_collection.zip
โ”œโ”€โ”€ temperature-plot.pdf
โ”œโ”€โ”€ test.csv
โ”œโ”€โ”€ the-verdict.txt
โ”œโ”€โ”€ train.csv
โ”œโ”€โ”€ validation.csv
โ””โ”€โ”€ verdict.txt

๐Ÿ”„ Three-Stage Training Pipeline

The project follows a systematic three-stage pipeline to build and deploy a production-ready GPT-2 style model using transfer learning.

Stage 1: Building the LLM

  1. Data Preparation & Sampling

    • Tokenization of raw text using OpenAIโ€™s tiktoken library (GPT-2 BPE encoder).
    • Creation of input-target pairs using a sliding window.
    • Custom PyTorch DataLoader for efficient batching.
  2. Attention Mechanism

    • Causal multi-head self-attention implementation.
    • Scaled dot-product attention with causal masking.
    • Multiple attention heads capture semantic relationships.
  3. LLM Architecture

    • Transformer model with embedding layers.
    • Stacked transformer blocks with attention & feed-forward layers.
    • Layer normalization and residual connections for training stability.

Stage 2: Pretraining the Foundation Model

  1. Pretraining

    • Train on the-verdict.txt corpus using next-token prediction.
    • Learns language structure, grammar, and semantics.
  2. Training Loop

    • Implements PyTorch-based optimization and loss calculation.
    • AdamW optimizer with gradient clipping and scheduler.
  3. Model Evaluation

    • Validation on held-out data for convergence tracking.
    • Checkpointing to save progress.

Stage 3: Fine-Tuning for Downstream Tasks

  1. Load Pretrained Weights

    • Load official GPT-2 (124M) weights automatically via gpt_download3.py.
  2. Fine-Tuning for Classification

    • Train on SMS Spam Collection dataset for spam detection.
    • Replace LM head with a classification head.
    • Freeze most transformer layers for efficient adaptation.
  3. Fine-Tuning for Chat Tasks

    • Adaptation for conversational datasets (instruction-following).
    • Model learns context retention and response generation.

๐Ÿงฑ Model Architecture

GPT-2 Small (124M) Configuration:

Parameter Value
vocab_size 50257
context_length 1024
emb_dim 768
n_heads 12
n_layers 12
drop_rate 0.1
qkv_bias False

Core Components

  • Token & Position Embeddings: Map tokens and sequence positions to dense vectors.
  • Transformer Blocks: Multi-head self-attention + MLP layers with GELU activation.
  • Residuals & LayerNorm: Ensure gradient stability and fast convergence.

๐Ÿ“ˆ Results and Performance

Pretraining Loss

  • Training loss decreases from ~9.5 โ†’ <1.0 in 10 epochs.
  • Validation loss stabilizes around 6.5.
  • Clear convergence after epoch 6.

Fine-Tuning Performance

Metric Value
Training Accuracy 100%
Validation Accuracy 97.5%

Text Generation with Temperature Control

  • Supports temperature scaling and top-k sampling for output control.

โš™๏ธ Usage

Environment Setup

pip install torch tiktoken pandas matplotlib tensorflow tqdm

Download Pretrained Weights

python gpt_download3.py

Training from Scratch

jupyter notebook LLM.tokenizer.ipynb

Covers:

  • Data preparation & tokenization
  • Model architecture implementation
  • Pretraining
  • Fine-tuning for spam detection
  • Text generation

Fine-Tuning for Custom Tasks

# Load pretrained model
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load('model.pth'))

# Add custom classification head
num_classes = your_num_classes
model.out_head = torch.nn.Linear(
    in_features=GPT_CONFIG_124M["emb_dim"],
    out_features=num_classes
)

# Fine-tune on your dataset
train_classifier(model, your_train_loader, your_val_loader)

๐ŸŒŸ Key Features

  • โœ… Complete Implementation from Scratch
  • ๐Ÿงฉ Modular & Extensible Design
  • ๐Ÿ”„ Pretrained Weight Compatibility
  • ๐Ÿง  Multi-Task Fine-Tuning
  • โšก Parameter-Efficient Training
  • โœจ Flexible Text Generation
  • ๐Ÿง‘โ€๐Ÿ’ป Production-Ready Code

๐Ÿงฎ Technical Highlights

Component Description Benefit
Attention Mechanism Causal masking + multi-head attention Enables autoregressive generation
Training Optimization AdamW + learning rate scheduling Stable convergence
Memory Efficiency Batch processing + gradient clipping Handles large batches safely

๐Ÿš€ Future Enhancements

  • GPT-2 Medium / Large / XL Variants
  • Multi-GPU Distributed Training
  • Advanced Decoding (Nucleus, Beam Search)
  • Integration with Instruction-Tuning Datasets
  • RLHF Pipeline (Reinforcement Learning from Human Feedback)
  • Model Quantization for Lightweight Deployment

๐Ÿ“š References

  • Vizauara Labs AI
  • Raschka, S., Mirjalili, V., & D'Souza, D. (2024). Build a Large Language Model (from Scratch). Manning (Reference)
  • Vaswani, A., et al. (2017). Attention Is All You Need.
  • UCI Dataset SMS Spam Collection Dataset Contributors.

๐Ÿชช License

This project is licensed under the terms specified in the LICENSE file.


๐Ÿ™Œ Acknowledgments

  • OpenAI โ€” GPT-2 architecture and pretrained weights
  • Sebastian Raschka โ€” Educational materials and guidance
  • PyTorch Team โ€” Deep learning framework
  • SMS Spam Collection Dataset Contributors

About

GPT - 2 From scratch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors