Skip to content

siddsachar/Pretraining-LargeCorpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SydsGPT Pretraining on Large Datasets

A practical, memory-safe pipeline to pretrain a 164M/165M-parameter GPT-style model (SydsGPT/SydsGPTv2) on large text corpora using Hugging Face Datasets, Parquet sharding, and PyTorch optimizations such as torch.compile, gradient accumulation, and rotating checkpoints. The workspace is designed for Windows + PowerShell (pwsh) and CUDA acceleration when available.

  • Project root: e:\Code\SydsGPT-Pretraining-LargeDS
  • Primary notebook: pretraining-largecorpus.ipynb
  • Model code: model/, modules/
  • Data caches: data/

Key Features

  • Large corpus ingestion: FineWeb, Wikipedia, ArXiv via Hugging Face.
  • Tokenization with tiktoken (GPT-2 vocab) and persisted HF datasets.
  • Memory-safe fixed-length chunking to Parquet shards (e.g., 2048 tokens/chunk).
  • Flexible DataLoaders with external label shift (no attention mask for fixed chunks).
  • Training loop with cosine LR schedule + warmup, gradient clipping, gradient accumulation.
  • torch.compile integration (Inductor) for speed improvements.
  • Optimizer swap to GaLoreAdamW (with safe fallback to AdamW).
  • Rotating checkpoints (keep last two) to avoid accidental overwrites.
  • Simple text generation utilities for validation.

Repository Structure

pretraining-largecorpus.ipynb
model/
  SydsGPT.py
modules/
  DataLoader.py
  DataLoaderv2.py
  FeedForward.py
  GELU.py
  Generate.py
  LayerNorm.py
  Loss.py
  MultiHeadAttention.py
  Training.py
  TransformerBlock.py
data/
  combined_dataset/ ...
  combined_tokenized_dataset/ ...
  combined_chunks_train_parquet/ ...
  combined_chunks_val_parquet/ ...

Environment Setup (Windows + pwsh)

Use Python 3.12 (recommended). If you plan to use CUDA, install a matching PyTorch build.

# From repo root
python -m venv .venv
.\.venv\Scripts\Activate.ps1

# Basic dependencies
pip install --upgrade pip wheel setuptools
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install datasets tiktoken pyarrow

# Optional: GaLore optimizer (one of these may work depending on availability)
pip install galore-torch || pip install galore

Notes:

  • If you encounter space issues on C: drive, set pip cache/temp to a larger drive before installs:
    $env:PIP_CACHE_DIR = "E:\pip-cache"
    $env:TEMP = "E:\temp"; $env:TMP = "E:\temp"
  • Verify CUDA availability in Python: import torch; torch.cuda.is_available().

Data Preparation Workflow

All steps are demonstrated in pretraining-largecorpus.ipynb.

  1. Load datasets via HF:
    • FineWeb: HuggingFaceFW/fineweb (e.g., sample-10BT)
    • Wikipedia: wikimedia/wikipedia (e.g., 20231101.en)
    • ArXiv: timaeus/pile-arxiv
  2. Retain only text column, save to disk under data/.
  3. Reload, shuffle, optionally trim, and concatenate into data/combined_dataset.
  4. Tokenize with tiktoken and save to data/combined_tokenized_dataset including input_ids and length.
  5. Stream chunking to Parquet:
    • Flatten tokens in a buffer and write fixed 2048-token chunks to shards.
    • Approximate train/val split per-chunk (e.g., 80/20) into combined_chunks_{train,val}_parquet.

Why Parquet shards?

  • Avoids out-of-memory by not materializing the full flattened token list.
  • Enables memory-mapped reads and scalable DataLoader construction.

Training DataLoaders

For fixed 2048-token chunks, we omit attention masks and shift labels externally.

  • Collate behavior (collate_shift):
    • Inputs: input_ids tensor of shape (B, 2048)
    • Targets: same shape with a right-shift; final position set to -100 (ignored by loss).
  • Batch size (BATCH_SIZE) can be changed freely; shards remain valid.

Models

  • SydsGPT and SydsGPTv2 are defined in model/ and notebook cells.
  • FlashAttention-style block uses PyTorch scaled_dot_product_attention with causal masking.

Training Loop

Features implemented in the notebook:

  • Cosine decay with warmup:
    • Initial LR → peak LR during warmup.
    • Cosine schedule towards a minimum LR.
  • Gradient clipping after warmup (max_norm=1.0).
  • Gradient accumulation:
    • Scale loss by 1/grad_accum_steps per batch.
    • Call optimizer.step() every grad_accum_steps batches.
  • Checkpoint rotation:
    • autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth
    • autosave_ckpt1_prev1_sydsgpt_v2_164m_trained_model_optimizer.pth
    • autosave_ckpt1_prev2_sydsgpt_v2_164m_trained_model_optimizer.pth

Learning Rate Guidance (example for ~165M params)

  • initial_lr: 1e-6
  • peak_lr: 1e-4
  • min_lr: 1e-5 (10% of peak)
  • Warmup: 1–2% of total steps (tune per hardware and stability).

torch.compile Integration

  • Enabled with Inductor backend for performance acceleration.
  • Recommended flags on Ampere+ GPUs:
    • torch.backends.cuda.matmul.allow_tf32 = True
    • torch.backends.cudnn.allow_tf32 = True
    • torch.backends.cudnn.benchmark = True
  • Choose precision preference: torch.set_float32_matmul_precision('high').
  • Compile:
    model = torch.compile(model, backend='inductor', mode='default', dynamic=False)
  • Note: First step pays compile cost; subsequent iterations are faster.
  • If batch sizes change frequently, consider dynamic=True (may reduce potential speedups).

Optimizer: GaLoreAdamW

  • The notebook attempts to import GaLoreAdamW from galore_torch or galore, with a fallback to torch.optim.AdamW.
  • Typical instantiation:
    from galore_torch import GaLoreAdamW  # if available
    optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)
    # Fallback: optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.05)

Checkpointing & Resume

  • Rotating checkpoints (keep last two) avoid accidental loss of progress.
  • Resume example:
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
    checkpoint = torch.load('autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth', map_location=device)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.05)
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    for state in optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.to(device)
    model.to(device)

Generate Text (Quick Sanity Check)

Use modules/Generate.py helpers to produce sample text.

from modules.Generate import generate, text_to_tokens, tokens_to_text
import tiktoken

enc = tiktoken.get_encoding('gpt2')
input_text = 'Once upon a time there was a kingdom far away where'
input_tokens = text_to_tokens(input_text, enc).to(device)
output_tokens = generate(model, input_tokens, 200, SYDSGPT_CONFIG_V2_164M['context_length'], temperature=1.5, top_k=40)
print(tokens_to_text(output_tokens, enc))

Quick Start (Notebook)

  1. Activate environment and start VS Code:
    .\.venv\Scripts\Activate.ps1
    code e:\Code\SydsGPT-Pretraining-LargeDS
  2. Open pretraining-largecorpus.ipynb and run cells in order:
    • Load, sanitize, save datasets.
    • Tokenize and persist tokenized dataset.
    • Stream chunk to Parquet shards.
    • Build DataLoaders.
    • Initialize model and compile.
    • Select optimizer (GaLore/AdamW) and start training.

Troubleshooting

  • CUDA not available:
    • Ensure NVIDIA drivers + CUDA toolkit compatible with your PyTorch build.
    • Reinstall PyTorch with the correct CUDA wheel (cu124, etc.).
  • OOM during flattening:
    • Use the Parquet shard pipeline (already implemented).
    • Reduce SHARD_SIZE_CHUNKS or BATCH_SIZE; increase grad_accum_steps.
  • Slow compile step:
    • Expected on first iteration; try mode='reduce-overhead' if startup dominates.
  • Tokenization stalls:
    • Set num_proc appropriately; ensure enough RAM; consider batching (batch_size in map).

License

This repository does not include a license file. If you plan to distribute, add an appropriate license.

Acknowledgments

  • Hugging Face Datasets
  • tiktoken
  • PyTorch & Inductor
  • GaLore optimizer (if available)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors