A practical, memory-safe pipeline to pretrain a 164M/165M-parameter GPT-style model (SydsGPT/SydsGPTv2) on large text corpora using Hugging Face Datasets, Parquet sharding, and PyTorch optimizations such as torch.compile, gradient accumulation, and rotating checkpoints. The workspace is designed for Windows + PowerShell (pwsh) and CUDA acceleration when available.
- Project root:
e:\Code\SydsGPT-Pretraining-LargeDS - Primary notebook:
pretraining-largecorpus.ipynb - Model code:
model/,modules/ - Data caches:
data/
- Large corpus ingestion: FineWeb, Wikipedia, ArXiv via Hugging Face.
- Tokenization with
tiktoken(GPT-2 vocab) and persisted HF datasets. - Memory-safe fixed-length chunking to Parquet shards (e.g., 2048 tokens/chunk).
- Flexible DataLoaders with external label shift (no attention mask for fixed chunks).
- Training loop with cosine LR schedule + warmup, gradient clipping, gradient accumulation.
torch.compileintegration (Inductor) for speed improvements.- Optimizer swap to GaLoreAdamW (with safe fallback to AdamW).
- Rotating checkpoints (keep last two) to avoid accidental overwrites.
- Simple text generation utilities for validation.
pretraining-largecorpus.ipynb
model/
SydsGPT.py
modules/
DataLoader.py
DataLoaderv2.py
FeedForward.py
GELU.py
Generate.py
LayerNorm.py
Loss.py
MultiHeadAttention.py
Training.py
TransformerBlock.py
data/
combined_dataset/ ...
combined_tokenized_dataset/ ...
combined_chunks_train_parquet/ ...
combined_chunks_val_parquet/ ...
Use Python 3.12 (recommended). If you plan to use CUDA, install a matching PyTorch build.
# From repo root
python -m venv .venv
.\.venv\Scripts\Activate.ps1
# Basic dependencies
pip install --upgrade pip wheel setuptools
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install datasets tiktoken pyarrow
# Optional: GaLore optimizer (one of these may work depending on availability)
pip install galore-torch || pip install galoreNotes:
- If you encounter space issues on
C:drive, set pip cache/temp to a larger drive before installs:$env:PIP_CACHE_DIR = "E:\pip-cache" $env:TEMP = "E:\temp"; $env:TMP = "E:\temp"
- Verify CUDA availability in Python:
import torch; torch.cuda.is_available().
All steps are demonstrated in pretraining-largecorpus.ipynb.
- Load datasets via HF:
- FineWeb:
HuggingFaceFW/fineweb(e.g.,sample-10BT) - Wikipedia:
wikimedia/wikipedia(e.g.,20231101.en) - ArXiv:
timaeus/pile-arxiv
- FineWeb:
- Retain only
textcolumn, save to disk underdata/. - Reload, shuffle, optionally trim, and concatenate into
data/combined_dataset. - Tokenize with
tiktokenand save todata/combined_tokenized_datasetincludinginput_idsandlength. - Stream chunking to Parquet:
- Flatten tokens in a buffer and write fixed 2048-token chunks to shards.
- Approximate train/val split per-chunk (e.g., 80/20) into
combined_chunks_{train,val}_parquet.
Why Parquet shards?
- Avoids out-of-memory by not materializing the full flattened token list.
- Enables memory-mapped reads and scalable DataLoader construction.
For fixed 2048-token chunks, we omit attention masks and shift labels externally.
- Collate behavior (
collate_shift):- Inputs:
input_idstensor of shape(B, 2048) - Targets: same shape with a right-shift; final position set to
-100(ignored by loss).
- Inputs:
- Batch size (
BATCH_SIZE) can be changed freely; shards remain valid.
SydsGPTandSydsGPTv2are defined inmodel/and notebook cells.- FlashAttention-style block uses PyTorch
scaled_dot_product_attentionwith causal masking.
Features implemented in the notebook:
- Cosine decay with warmup:
- Initial LR → peak LR during warmup.
- Cosine schedule towards a minimum LR.
- Gradient clipping after warmup (
max_norm=1.0). - Gradient accumulation:
- Scale loss by
1/grad_accum_stepsper batch. - Call
optimizer.step()everygrad_accum_stepsbatches.
- Scale loss by
- Checkpoint rotation:
autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pthautosave_ckpt1_prev1_sydsgpt_v2_164m_trained_model_optimizer.pthautosave_ckpt1_prev2_sydsgpt_v2_164m_trained_model_optimizer.pth
initial_lr:1e-6peak_lr:1e-4min_lr:1e-5(10% of peak)- Warmup: 1–2% of total steps (tune per hardware and stability).
- Enabled with Inductor backend for performance acceleration.
- Recommended flags on Ampere+ GPUs:
torch.backends.cuda.matmul.allow_tf32 = Truetorch.backends.cudnn.allow_tf32 = Truetorch.backends.cudnn.benchmark = True
- Choose precision preference:
torch.set_float32_matmul_precision('high'). - Compile:
model = torch.compile(model, backend='inductor', mode='default', dynamic=False)
- Note: First step pays compile cost; subsequent iterations are faster.
- If batch sizes change frequently, consider
dynamic=True(may reduce potential speedups).
- The notebook attempts to import GaLoreAdamW from
galore_torchorgalore, with a fallback totorch.optim.AdamW. - Typical instantiation:
from galore_torch import GaLoreAdamW # if available optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05) # Fallback: optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.05)
- Rotating checkpoints (keep last two) avoid accidental loss of progress.
- Resume example:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M) checkpoint = torch.load('autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth', map_location=device) model.load_state_dict(checkpoint['model_state_dict']) optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.05) optimizer.load_state_dict(checkpoint['optimizer_state_dict']) for state in optimizer.state.values(): for k, v in state.items(): if isinstance(v, torch.Tensor): state[k] = v.to(device) model.to(device)
Use modules/Generate.py helpers to produce sample text.
from modules.Generate import generate, text_to_tokens, tokens_to_text
import tiktoken
enc = tiktoken.get_encoding('gpt2')
input_text = 'Once upon a time there was a kingdom far away where'
input_tokens = text_to_tokens(input_text, enc).to(device)
output_tokens = generate(model, input_tokens, 200, SYDSGPT_CONFIG_V2_164M['context_length'], temperature=1.5, top_k=40)
print(tokens_to_text(output_tokens, enc))- Activate environment and start VS Code:
.\.venv\Scripts\Activate.ps1 code e:\Code\SydsGPT-Pretraining-LargeDS
- Open
pretraining-largecorpus.ipynband run cells in order:- Load, sanitize, save datasets.
- Tokenize and persist tokenized dataset.
- Stream chunk to Parquet shards.
- Build DataLoaders.
- Initialize model and compile.
- Select optimizer (GaLore/AdamW) and start training.
- CUDA not available:
- Ensure NVIDIA drivers + CUDA toolkit compatible with your PyTorch build.
- Reinstall PyTorch with the correct CUDA wheel (
cu124, etc.).
- OOM during flattening:
- Use the Parquet shard pipeline (already implemented).
- Reduce
SHARD_SIZE_CHUNKSorBATCH_SIZE; increasegrad_accum_steps.
- Slow compile step:
- Expected on first iteration; try
mode='reduce-overhead'if startup dominates.
- Expected on first iteration; try
- Tokenization stalls:
- Set
num_procappropriately; ensure enough RAM; consider batching (batch_sizein map).
- Set
This repository does not include a license file. If you plan to distribute, add an appropriate license.
- Hugging Face Datasets
- tiktoken
- PyTorch & Inductor
- GaLore optimizer (if available)