MethylGPT: a foundation model for the DNA methylome
MethylGPT is a transformer-based foundation model pretrained on 226,555 human DNA methylation profiles from 5,281 datasets. It learns universal representations of the methylome that transfer to downstream tasks including:
- Embedding extraction -- dense sample-level representations for clustering, visualization, and classification
- Age prediction -- biological age estimation from methylation data
- Disease prediction -- disease risk stratification from methylation profiles
- Imputation -- recovery of missing CpG site values
Requirements: Python 3.9, 3.10, or 3.11 | PyTorch >= 2.0
pip install methylgptgit clone https://github.com/albert-ying/MethylGPT.git
cd MethylGPT
pip install -r requirements.txt
pip install -e .git clone https://github.com/albert-ying/MethylGPT.git
cd MethylGPT
conda env create -f environment.yml
conda activate methylgpt
pip install -e .git clone https://github.com/albert-ying/MethylGPT.git
cd MethylGPT
pip install -e ".[dev]"For faster training and inference on supported GPUs:
pip install flash-attn --no-build-isolationimport torch
from methylgpt.model.methyl_model import MethylGPTModel
from methylgpt.model.methyl_vocab import MethylVocab
# Load vocabulary
vocab = MethylVocab(
probe_id_dir="data/probe_ids_type3.csv",
pad_token="<pad>",
special_tokens=["<pad>", "<cls>", "<eoc>"],
save_dir=None,
)
# Load pretrained model
config = {
"layer_size": 128, "nhead": 4, "nlayers": 6,
"dropout": 0.0, "fast_transformer": False, "pre_norm": False,
"load_model": True,
"pretrained_file": "pretrained_models/methylgpt-medium/best_model.pt",
}
model = MethylGPTModel.from_pretrained(config, vocab)
model.eval()
# Extract embeddings (gene_ids and values from your tokenized data)
with torch.no_grad():
embeddings = model.get_cell_embeddings(gene_ids, values)For complete working examples, see the tutorials below.
| Model | Embedding Dim | Layers | Heads | Parameters | Download |
|---|---|---|---|---|---|
| methylGPT-base | 64 | 6 | 4 | 3M | Download |
| methylGPT-medium | 128 | 6 | 4 | 7M | Download |
| methylGPT-large | 256 | 6 | 4 | 15M | Download |
- base: Lightweight experiments and quick prototyping.
- medium: Balanced performance for most applications (recommended).
- large: Comprehensive studies and maximum accuracy.
The pretraining corpus comprises 226,555 human DNA methylation profiles from 5,281 datasets across EWAS Data Hub and Clockbase.
You can also load these data from a LaminDB instance: https://lamin.ai/laminlabs/methyldata
| Tutorial | Description | Format |
|---|---|---|
| Quickstart | Get started in 5 minutes | Notebook |
| Embedding extraction | Extract cell-level embeddings from pretrained model | Notebook + script |
| Embedding analysis | UMAP visualization, clustering, silhouette scores | Notebook |
| Age prediction | Finetune for biological age prediction | Notebook + script |
| Disease prediction | Disease risk prediction with Ridge regression | Notebook |
| Imputation | Recover missing CpG values | Notebook |
| Pretraining | Train MethylGPT from scratch on methylation data | Script |
| CpG selection | Feature selection analysis | Notebook |
All notebooks include a Colab setup cell and work both locally and on Google Colab.
| Document | Description |
|---|---|
| Inference Guide | Loading models, extracting embeddings, imputation, GPU/CPU tips |
| API Reference | Full public API with signatures and descriptions |
| Troubleshooting | Common errors and solutions (flash-attn, CUDA OOM, torchtext, Colab) |
- Inference / embedding extraction: 1 GPU with >= 8 GB VRAM (or CPU, slower)
- Finetuning: 1 GPU with >= 16 GB VRAM recommended
- Pretraining: 1+ GPUs with >= 40 GB VRAM recommended (H100, A100)
We welcome contributions. Please submit a pull request for ideas or bug fixes. Open an issue for questions or problems.
MethylGPT's backend architecture is based on scGPT (Wang Lab). We also thank:
@article{ying2024methylgpt,
title={MethylGPT: a foundation model for the DNA methylome},
author={Ying, Kejun and Song, Jinyeop and Cui, Haotian and Zhang, Yikun and Li, Siyuan and Chen, Xingyu and Liu, Hanna and Eames, Alec and McCartney, Daniel L and Marioni, Riccardo E and others},
journal={bioRxiv},
pages={2024--10},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}If you use the GEO metadata (derived from ClockBase), please also cite:
@article{ying2023clockbase,
title={ClockBase: a comprehensive platform for biological age profiling in human and mouse},
author={Ying, K. and Tyshkovskiy, A. and Trapp, A. and Liu, H. and Moqri, M. and Kerepesi, C. and Gladyshev, V.N.},
journal={bioRxiv},
year={2023},
doi={10.1101/2023.02.28.530532}
}