SydsGPT Pretraining on Large Datasets

A practical, memory-safe pipeline to pretrain a 164M/165M-parameter GPT-style model (SydsGPT/SydsGPTv2) on large text corpora using Hugging Face Datasets, Parquet sharding, and PyTorch optimizations such as torch.compile, gradient accumulation, and rotating checkpoints. The workspace is designed for Windows + PowerShell (pwsh) and CUDA acceleration when available.

Project root: e:\Code\SydsGPT-Pretraining-LargeDS
Primary notebook: pretraining-largecorpus.ipynb
Model code: model/, modules/
Data caches: data/

Key Features

Large corpus ingestion: FineWeb, Wikipedia, ArXiv via Hugging Face.
Tokenization with tiktoken (GPT-2 vocab) and persisted HF datasets.
Memory-safe fixed-length chunking to Parquet shards (e.g., 2048 tokens/chunk).
Flexible DataLoaders with external label shift (no attention mask for fixed chunks).
Training loop with cosine LR schedule + warmup, gradient clipping, gradient accumulation.
torch.compile integration (Inductor) for speed improvements.
Optimizer swap to GaLoreAdamW (with safe fallback to AdamW).
Rotating checkpoints (keep last two) to avoid accidental overwrites.
Simple text generation utilities for validation.

Repository Structure

pretraining-largecorpus.ipynb
model/
  SydsGPT.py
modules/
  DataLoader.py
  DataLoaderv2.py
  FeedForward.py
  GELU.py
  Generate.py
  LayerNorm.py
  Loss.py
  MultiHeadAttention.py
  Training.py
  TransformerBlock.py
data/
  combined_dataset/ ...
  combined_tokenized_dataset/ ...
  combined_chunks_train_parquet/ ...
  combined_chunks_val_parquet/ ...

Environment Setup (Windows + pwsh)

Use Python 3.12 (recommended). If you plan to use CUDA, install a matching PyTorch build.

# From repo root
python -m venv .venv
.\.venv\Scripts\Activate.ps1

# Basic dependencies
pip install --upgrade pip wheel setuptools
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install datasets tiktoken pyarrow

# Optional: GaLore optimizer (one of these may work depending on availability)
pip install galore-torch || pip install galore

Notes:

If you encounter space issues on C: drive, set pip cache/temp to a larger drive before installs:
```
$env:PIP_CACHE_DIR = "E:\pip-cache"
$env:TEMP = "E:\temp"; $env:TMP = "E:\temp"
```
Verify CUDA availability in Python: import torch; torch.cuda.is_available().

Data Preparation Workflow

All steps are demonstrated in pretraining-largecorpus.ipynb.

Load datasets via HF:
- FineWeb: HuggingFaceFW/fineweb (e.g., sample-10BT)
- Wikipedia: wikimedia/wikipedia (e.g., 20231101.en)
- ArXiv: timaeus/pile-arxiv
Retain only text column, save to disk under data/.
Reload, shuffle, optionally trim, and concatenate into data/combined_dataset.
Tokenize with tiktoken and save to data/combined_tokenized_dataset including input_ids and length.
Stream chunking to Parquet:
- Flatten tokens in a buffer and write fixed 2048-token chunks to shards.
- Approximate train/val split per-chunk (e.g., 80/20) into combined_chunks_{train,val}_parquet.

Why Parquet shards?

Avoids out-of-memory by not materializing the full flattened token list.
Enables memory-mapped reads and scalable DataLoader construction.

Training DataLoaders

For fixed 2048-token chunks, we omit attention masks and shift labels externally.

Collate behavior (collate_shift):
- Inputs: input_ids tensor of shape (B, 2048)
- Targets: same shape with a right-shift; final position set to -100 (ignored by loss).
Batch size (BATCH_SIZE) can be changed freely; shards remain valid.

Models

SydsGPT and SydsGPTv2 are defined in model/ and notebook cells.
FlashAttention-style block uses PyTorch scaled_dot_product_attention with causal masking.

Training Loop

Features implemented in the notebook:

Cosine decay with warmup:
- Initial LR → peak LR during warmup.
- Cosine schedule towards a minimum LR.
Gradient clipping after warmup (max_norm=1.0).
Gradient accumulation:
- Scale loss by 1/grad_accum_steps per batch.
- Call optimizer.step() every grad_accum_steps batches.
Checkpoint rotation:
- autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth
- autosave_ckpt1_prev1_sydsgpt_v2_164m_trained_model_optimizer.pth
- autosave_ckpt1_prev2_sydsgpt_v2_164m_trained_model_optimizer.pth

Learning Rate Guidance (example for ~165M params)

initial_lr: 1e-6
peak_lr: 1e-4
min_lr: 1e-5 (10% of peak)
Warmup: 1–2% of total steps (tune per hardware and stability).

torch.compile Integration

Enabled with Inductor backend for performance acceleration.
Recommended flags on Ampere+ GPUs:
- torch.backends.cuda.matmul.allow_tf32 = True
- torch.backends.cudnn.allow_tf32 = True
- torch.backends.cudnn.benchmark = True
Choose precision preference: torch.set_float32_matmul_precision('high').

Compile:

model = torch.compile(model, backend='inductor', mode='default', dynamic=False)

Note: First step pays compile cost; subsequent iterations are faster.
If batch sizes change frequently, consider dynamic=True (may reduce potential speedups).

Optimizer: GaLoreAdamW

The notebook attempts to import GaLoreAdamW from galore_torch or galore, with a fallback to torch.optim.AdamW.

Typical instantiation:

from galore_torch import GaLoreAdamW  # if available
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)
# Fallback: optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.05)

Checkpointing & Resume

Rotating checkpoints (keep last two) avoid accidental loss of progress.

Resume example:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
checkpoint = torch.load('autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.05)
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.to(device)
model.to(device)

Generate Text (Quick Sanity Check)

Use modules/Generate.py helpers to produce sample text.

from modules.Generate import generate, text_to_tokens, tokens_to_text
import tiktoken

enc = tiktoken.get_encoding('gpt2')
input_text = 'Once upon a time there was a kingdom far away where'
input_tokens = text_to_tokens(input_text, enc).to(device)
output_tokens = generate(model, input_tokens, 200, SYDSGPT_CONFIG_V2_164M['context_length'], temperature=1.5, top_k=40)
print(tokens_to_text(output_tokens, enc))

Quick Start (Notebook)

Activate environment and start VS Code:

.\.venv\Scripts\Activate.ps1
code e:\Code\SydsGPT-Pretraining-LargeDS

Open pretraining-largecorpus.ipynb and run cells in order:
- Load, sanitize, save datasets.
- Tokenize and persist tokenized dataset.
- Stream chunk to Parquet shards.
- Build DataLoaders.
- Initialize model and compile.
- Select optimizer (GaLore/AdamW) and start training.

Troubleshooting

CUDA not available:
- Ensure NVIDIA drivers + CUDA toolkit compatible with your PyTorch build.
- Reinstall PyTorch with the correct CUDA wheel (cu124, etc.).
OOM during flattening:
- Use the Parquet shard pipeline (already implemented).
- Reduce SHARD_SIZE_CHUNKS or BATCH_SIZE; increase grad_accum_steps.
Slow compile step:
- Expected on first iteration; try mode='reduce-overhead' if startup dominates.
Tokenization stalls:
- Set num_proc appropriately; ensure enough RAM; consider batching (batch_size in map).

License

This repository does not include a license file. If you plan to distribute, add an appropriate license.

Acknowledgments

Hugging Face Datasets
tiktoken
PyTorch & Inductor
GaLore optimizer (if available)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
model		model
modules		modules
.gitignore		.gitignore
README.md		README.md
pretraining-largecorpus.ipynb		pretraining-largecorpus.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SydsGPT Pretraining on Large Datasets

Key Features

Repository Structure

Environment Setup (Windows + pwsh)

Data Preparation Workflow

Training DataLoaders

Models

Training Loop

Learning Rate Guidance (example for ~165M params)

torch.compile Integration

Optimizer: GaLoreAdamW

Checkpointing & Resume

Generate Text (Quick Sanity Check)

Quick Start (Notebook)

Troubleshooting

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SydsGPT Pretraining on Large Datasets

Key Features

Repository Structure

Environment Setup (Windows + pwsh)

Data Preparation Workflow

Training DataLoaders

Models

Training Loop

Learning Rate Guidance (example for ~165M params)

torch.compile Integration

Optimizer: GaLoreAdamW

Checkpointing & Resume

Generate Text (Quick Sanity Check)

Quick Start (Notebook)

Troubleshooting

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages