Pathogenicity detection using protein language models
This project presents a machine learning model designed to analyze human protein sequences and classify single amino-acid polymorphisms as either benign or pathogenic. By training on curated datasets of known benign and pathogenic variants, the model provides a tool for researchers in R&D, medical geneticists, and clinicians engaged in embryo screening or infant genetic testing. Its key functionality lies in detecting mutations within amino acid sequences and predicting their potential pathogenic impact. The model leverages the transformers library (built on PyTorch) for sequence analysis and employs Weights & Biases for performance tracking and visualization. This approach offers a scalable, data-driven method to assist in genetic screening and early diagnosis, with potential applications in precision medicine and healthcare research.
To install with conda, run:
conda create -n demo python
conda activate demo
pip install Demo/requirements.txt# To see a pre-selected set of protien sequence predictions, run:
python ./Demo/Demo.py
# If you would like to predict pathogenicity for sequences of your own, enter them into the console when prompted and separate them by comma:
python ./Demo/Demo.py --interact
# Our selection of trained models can be specified as followed:
python ./Demo/Demo.py --model_type [esm-fz, esm-ft, esm-fz+mf, or esm-ft+mf]
# Inference can be run on available gpus with:
python ./Demo/Demo.py --device gpuThis repo is organized into:
- Data, datasets used to train the model.
- Demo, a notebook and cli to run existing models.
- Models, the actual models being run.
- Training, the script defining model classes and training procedure.
Contributions are welcome! If you'd like to contribute, please open an issue or submit a pull request. See the contribution guidelines for more information.
If you have any issues or need help, please open an issue or contact the project maintainers.
This project is licensed under the MIT License.