Introduction

Source code for Seq2Symm: Rapid and accurate prediction of protein homo-oligomer symmetry, published here:

Meghana Kshirsagar, Artur Meller, Ian R. Humphreys, Samuel Sledzieski, Yixi Xu, Rahul Dodhia, Eric Horvitz, Bonnie Berger, Gregory R. Bowman, Juan Lavista Ferres, David Baker, Minkyung Baek, Nature Communications

https://www.nature.com/articles/s41467-025-57148-3

Seq2Symm takes a protein sequence as input and predicts the protein symmetry, such as 'C1', 'C2', 'C5', 'D2', 'D3', 'D5', 'H', 'O', 'T', 'I' etc. for 17 different symmetry types.

Getting Started

There are two options: do inference using Google Colab or a local installation where you can do both inference and/or training

Google Colab notebook for small-scale inference

This self-contained notebook will do inference on input protein sequences

https://colab.research.google.com/drive/1ptQTyC22ExxJ3BnSK6dnPaCeg7J8i3le#scrollTo=CJm12VgOiDfj

Conda environment for local installation and large-scale inference and training

Dependencies are in the yaml file esm2_finetune.yaml

conda env create --name esm2 --file=esm2_finetune.yaml

Downloads

Latest model trained in March 2025 (please look under 3. for the model used for all the results in the paper) https://drive.google.com/file/d/1Hk4idiCGNnn5B0KUTCsnMtv1vtvWb6lh/view?usp=sharing
the predictions from the model on various datasets, predictions on proteomes http://files.ipd.uw.edu/pub/seq2symm/predictions.zip
Seq2Symm model from the Nature Communications, 2025 paper http://files.ipd.uw.edu/pub/seq2symm/ESM2_model.ckpt
data download links are here https://github.com/microsoft/seq2symm/tree/main/datasets

All code, datasets, model, predictions are also available on Zenodo: http://doi.org/10.5281/zenodo.14659968

Training the model

python src/finetune.py --meta_data_file ../datasets/homomer_pdbids_hash_clusterid_labels_sampled.csv --data_dir ../datasets/ --model_dir models/ --output_model seq2symm --output_dir outputs/seq2symm --bs 16 --data_splits_file ../datasets/train_val_test_splits.pkl --limit 65536 --granularity 3 --n_classes 20 --n_epoch 100 --weighted_sampler 1

Predicting using the model

Jupyternotebook

A jupyter notebook is available at src/load_chkpt_and_predict.ipynb that shows examples of how this is done for two different file formats

Predicting via command line

This script is available courtesy of Moritz Ertlet:

The src/predict_oligmerization.py script predicts the oligomerization states of protein sequences provided in a FASTA file, outputting the probabilities for each symmetry state with a probability >= 1%.

python predict_oligomerization.py -input_file <input_fasta> -chkpt_file <model_checkpoint> [-output_file <results.csv>] [-batch_size <n>]

-input_file: Path to the input FASTA file containing the sequences for prediction (required).
-chkpt_file: Path to the model checkpoint file (required).
-output_file: Path to save the prediction results as a CSV file (default: ./results.csv).
-batch_size: Batch size for processing sequences (default: 1).

Example output:

fasta_id, predicted_oligomerization_state
ID1, "{'C1': 0.9, 'C2': 0.1, 'Other': 0.001}"
ID2, "{'C4': 0.85, 'C6': 0.15, 'D3': 0.01}"

Which you can load with pandas like this:

import pandas as pd
import ast
df = pd.read_csv("results.csv")
df['predicted_oligomerization_state'] = df['predicted_oligomerization_state'].apply(ast.literal_eval)

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
datasets		datasets
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
esm2_finetune.yaml		esm2_finetune.yaml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Getting Started

Google Colab notebook for small-scale inference

Conda environment for local installation and large-scale inference and training

Downloads

Training the model

Predicting using the model

Jupyternotebook

Predicting via command line

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Getting Started

Google Colab notebook for small-scale inference

Conda environment for local installation and large-scale inference and training

Downloads

Training the model

Predicting using the model

Jupyternotebook

Predicting via command line

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages