Inspiration
As machine learning technologies develop new uses diagnostically (such as in radiology), a next step could be its diagnostic use biochemically. Some of us have personal experience with IDRs and know that pathogenic mutations in IDRs are important but poorly understood.
What it does
Predict a score of pathogenicity given a mutant protein sequence.
How we built it
Our architecture was a mixture of ESM2 (protein language model) and a molecular features model. We experimented with finetuning vs. freezing the ESM2 model and including / not including the molecular features calculation. We used the transformers library (built on PyTorch) to train our model and Weights and Biases to visualize our model progress and performance.
Challenges we ran into
Attaining pre-processed datasets, figuring out the AWS ecosystem surrounding Sage notebooks, and delegating work.
Accomplishments that we're proud of
We finally got AWS to run a variant of our training script. In addition, we developed a flexible script for training four different related model architectures, and managed to train four models for a non-trivial number of epochs. We believe that with more time for hyperparameter tuning, we could get better performance.
What we learned
While AWS was a large hurdle for a few of us, we are all walking away with a greater appreciation for the technology. Several members of our team gained their first exposure to working on problems within bioinformatics and AI - in particular, learning about IDRs.
What's next for patho-detection-plm
To train the models for longer to see if the loss truly continues to decrease, and to see if adjusting the hyperparameters improves performance of the models.
After getting reasonable performance, comparing them to other baseline models. Also potentially training an IDP-only protein language model that is even more specialized for this task.
Built With
- python
- transformers
- wandb
Log in or sign up for Devpost to join the conversation.