MethylGPT

MethylGPT: a foundation model for the DNA methylome

Overview

MethylGPT is a transformer-based foundation model pretrained on 226,555 human DNA methylation profiles from 5,281 datasets. It learns universal representations of the methylome that transfer to downstream tasks including:

Embedding extraction -- dense sample-level representations for clustering, visualization, and classification
Age prediction -- biological age estimation from methylation data
Disease prediction -- disease risk stratification from methylation profiles
Imputation -- recovery of missing CpG site values

Installation

Requirements: Python 3.9, 3.10, or 3.11 | PyTorch >= 2.0

From PyPI (recommended)

pip install methylgpt

From requirements.txt (pinned versions)

git clone https://github.com/albert-ying/MethylGPT.git
cd MethylGPT
pip install -r requirements.txt
pip install -e .

From conda (environment.yml)

git clone https://github.com/albert-ying/MethylGPT.git
cd MethylGPT
conda env create -f environment.yml
conda activate methylgpt
pip install -e .

From source (for development)

git clone https://github.com/albert-ying/MethylGPT.git
cd MethylGPT
pip install -e ".[dev]"

Optional: flash attention

For faster training and inference on supported GPUs:

pip install flash-attn --no-build-isolation

Quick Start

import torch
from methylgpt.model.methyl_model import MethylGPTModel
from methylgpt.model.methyl_vocab import MethylVocab

# Load vocabulary
vocab = MethylVocab(
    probe_id_dir="data/probe_ids_type3.csv",
    pad_token="<pad>",
    special_tokens=["<pad>", "<cls>", "<eoc>"],
    save_dir=None,
)

# Load pretrained model
config = {
    "layer_size": 128, "nhead": 4, "nlayers": 6,
    "dropout": 0.0, "fast_transformer": False, "pre_norm": False,
    "load_model": True,
    "pretrained_file": "pretrained_models/methylgpt-medium/best_model.pt",
}
model = MethylGPTModel.from_pretrained(config, vocab)
model.eval()

# Extract embeddings (gene_ids and values from your tokenized data)
with torch.no_grad():
    embeddings = model.get_cell_embeddings(gene_ids, values)

For complete working examples, see the tutorials below.

Pretrained Models

Model	Embedding Dim	Layers	Heads	Parameters	Download
methylGPT-base	64	6	4	3M	Download
methylGPT-medium	128	6	4	7M	Download
methylGPT-large	256	6	4	15M	Download

base: Lightweight experiments and quick prototyping.
medium: Balanced performance for most applications (recommended).
large: Comprehensive studies and maximum accuracy.

Pretraining Data

The pretraining corpus comprises 226,555 human DNA methylation profiles from 5,281 datasets across EWAS Data Hub and Clockbase.

Preprocessed dataset (type3, default): Download
CpG probe IDs (type3, default): Download

You can also load these data from a LaminDB instance: https://lamin.ai/laminlabs/methyldata

Tutorials

Tutorial	Description	Format
Quickstart	Get started in 5 minutes	Notebook
Embedding extraction	Extract cell-level embeddings from pretrained model	Notebook + script
Embedding analysis	UMAP visualization, clustering, silhouette scores	Notebook
Age prediction	Finetune for biological age prediction	Notebook + script
Disease prediction	Disease risk prediction with Ridge regression	Notebook
Imputation	Recover missing CpG values	Notebook
Pretraining	Train MethylGPT from scratch on methylation data	Script
CpG selection	Feature selection analysis	Notebook

All notebooks include a Colab setup cell and work both locally and on Google Colab.

Documentation

Document	Description
Inference Guide	Loading models, extracting embeddings, imputation, GPU/CPU tips
API Reference	Full public API with signatures and descriptions
Troubleshooting	Common errors and solutions (flash-attn, CUDA OOM, torchtext, Colab)

Hardware Requirements

Inference / embedding extraction: 1 GPU with >= 8 GB VRAM (or CPU, slower)
Finetuning: 1 GPU with >= 16 GB VRAM recommended
Pretraining: 1+ GPUs with >= 40 GB VRAM recommended (H100, A100)

Contributing

We welcome contributions. Please submit a pull request for ideas or bug fixes. Open an issue for questions or problems.

Acknowledgements

MethylGPT's backend architecture is based on scGPT (Wang Lab). We also thank:

Citing MethylGPT

@article{ying2024methylgpt,
  title={MethylGPT: a foundation model for the DNA methylome},
  author={Ying, Kejun and Song, Jinyeop and Cui, Haotian and Zhang, Yikun and Li, Siyuan and Chen, Xingyu and Liu, Hanna and Eames, Alec and McCartney, Daniel L and Marioni, Riccardo E and others},
  journal={bioRxiv},
  pages={2024--10},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

If you use the GEO metadata (derived from ClockBase), please also cite:

@article{ying2023clockbase,
  title={ClockBase: a comprehensive platform for biological age profiling in human and mouse},
  author={Ying, K. and Tyshkovskiy, A. and Trapp, A. and Liu, H. and Moqri, M. and Kerepesi, C. and Gladyshev, V.N.},
  journal={bioRxiv},
  year={2023},
  doi={10.1101/2023.02.28.530532}
}

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.claude/agents		.claude/agents
.github/workflows		.github/workflows
data		data
docs		docs
methylgpt		methylgpt
pretrained_models/dev_pretraining_test-dataset_CpGs_type3-preprocessing_False-Sep26-10-27		pretrained_models/dev_pretraining_test-dataset_CpGs_type3-preprocessing_False-Sep26-10-27
tutorials		tutorials
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MethylGPT

Overview

Installation

From PyPI (recommended)

From requirements.txt (pinned versions)

From conda (environment.yml)

From source (for development)

Optional: flash attention

Quick Start

Pretrained Models

Pretraining Data

Tutorials

Documentation

Hardware Requirements

Contributing

Acknowledgements

Citing MethylGPT

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MethylGPT

Overview

Installation

From PyPI (recommended)

From requirements.txt (pinned versions)

From conda (environment.yml)

From source (for development)

Optional: flash attention

Quick Start

Pretrained Models

Pretraining Data

Tutorials

Documentation

Hardware Requirements

Contributing

Acknowledgements

Citing MethylGPT

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages