Skip to content

albert-ying/MethylGPT

Repository files navigation

MethylGPT

schem

MethylGPT: a foundation model for the DNA methylome

Preprint   PyPI version   License

Overview

MethylGPT is a transformer-based foundation model pretrained on 226,555 human DNA methylation profiles from 5,281 datasets. It learns universal representations of the methylome that transfer to downstream tasks including:

  • Embedding extraction -- dense sample-level representations for clustering, visualization, and classification
  • Age prediction -- biological age estimation from methylation data
  • Disease prediction -- disease risk stratification from methylation profiles
  • Imputation -- recovery of missing CpG site values

Installation

Requirements: Python 3.9, 3.10, or 3.11 | PyTorch >= 2.0

From PyPI (recommended)

pip install methylgpt

From requirements.txt (pinned versions)

git clone https://github.com/albert-ying/MethylGPT.git
cd MethylGPT
pip install -r requirements.txt
pip install -e .

From conda (environment.yml)

git clone https://github.com/albert-ying/MethylGPT.git
cd MethylGPT
conda env create -f environment.yml
conda activate methylgpt
pip install -e .

From source (for development)

git clone https://github.com/albert-ying/MethylGPT.git
cd MethylGPT
pip install -e ".[dev]"

Optional: flash attention

For faster training and inference on supported GPUs:

pip install flash-attn --no-build-isolation

Quick Start

import torch
from methylgpt.model.methyl_model import MethylGPTModel
from methylgpt.model.methyl_vocab import MethylVocab

# Load vocabulary
vocab = MethylVocab(
    probe_id_dir="data/probe_ids_type3.csv",
    pad_token="<pad>",
    special_tokens=["<pad>", "<cls>", "<eoc>"],
    save_dir=None,
)

# Load pretrained model
config = {
    "layer_size": 128, "nhead": 4, "nlayers": 6,
    "dropout": 0.0, "fast_transformer": False, "pre_norm": False,
    "load_model": True,
    "pretrained_file": "pretrained_models/methylgpt-medium/best_model.pt",
}
model = MethylGPTModel.from_pretrained(config, vocab)
model.eval()

# Extract embeddings (gene_ids and values from your tokenized data)
with torch.no_grad():
    embeddings = model.get_cell_embeddings(gene_ids, values)

For complete working examples, see the tutorials below.

Pretrained Models

Model Embedding Dim Layers Heads Parameters Download
methylGPT-base 64 6 4 3M Download
methylGPT-medium 128 6 4 7M Download
methylGPT-large 256 6 4 15M Download
  • base: Lightweight experiments and quick prototyping.
  • medium: Balanced performance for most applications (recommended).
  • large: Comprehensive studies and maximum accuracy.

Pretraining Data

The pretraining corpus comprises 226,555 human DNA methylation profiles from 5,281 datasets across EWAS Data Hub and Clockbase.

  • Preprocessed dataset (type3, default): Download
  • CpG probe IDs (type3, default): Download

You can also load these data from a LaminDB instance: https://lamin.ai/laminlabs/methyldata

Tutorials

Tutorial Description Format
Quickstart Get started in 5 minutes Notebook
Embedding extraction Extract cell-level embeddings from pretrained model Notebook + script
Embedding analysis UMAP visualization, clustering, silhouette scores Notebook
Age prediction Finetune for biological age prediction Notebook + script
Disease prediction Disease risk prediction with Ridge regression Notebook
Imputation Recover missing CpG values Notebook
Pretraining Train MethylGPT from scratch on methylation data Script
CpG selection Feature selection analysis Notebook

All notebooks include a Colab setup cell and work both locally and on Google Colab.

Documentation

Document Description
Inference Guide Loading models, extracting embeddings, imputation, GPU/CPU tips
API Reference Full public API with signatures and descriptions
Troubleshooting Common errors and solutions (flash-attn, CUDA OOM, torchtext, Colab)

Hardware Requirements

  • Inference / embedding extraction: 1 GPU with >= 8 GB VRAM (or CPU, slower)
  • Finetuning: 1 GPU with >= 16 GB VRAM recommended
  • Pretraining: 1+ GPUs with >= 40 GB VRAM recommended (H100, A100)

Contributing

We welcome contributions. Please submit a pull request for ideas or bug fixes. Open an issue for questions or problems.

Acknowledgements

MethylGPT's backend architecture is based on scGPT (Wang Lab). We also thank:

Citing MethylGPT

@article{ying2024methylgpt,
  title={MethylGPT: a foundation model for the DNA methylome},
  author={Ying, Kejun and Song, Jinyeop and Cui, Haotian and Zhang, Yikun and Li, Siyuan and Chen, Xingyu and Liu, Hanna and Eames, Alec and McCartney, Daniel L and Marioni, Riccardo E and others},
  journal={bioRxiv},
  pages={2024--10},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

If you use the GEO metadata (derived from ClockBase), please also cite:

@article{ying2023clockbase,
  title={ClockBase: a comprehensive platform for biological age profiling in human and mouse},
  author={Ying, K. and Tyshkovskiy, A. and Trapp, A. and Liu, H. and Moqri, M. and Kerepesi, C. and Gladyshev, V.N.},
  journal={bioRxiv},
  year={2023},
  doi={10.1101/2023.02.28.530532}
}

License

Apache 2.0

About

This is the official codebase for methylGPT : a foundation model for the DNA methylome.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors