Inspiration

As machine learning technologies develop new uses diagnostically (such as in radiology), a next step could be its diagnostic use biochemically. Some of us have personal experience with IDRs and know that pathogenic mutations in IDRs are important but poorly understood.

What it does

Predict a score of pathogenicity given a mutant protein sequence.

How we built it

Our architecture was a mixture of ESM2 (protein language model) and a molecular features model. We experimented with finetuning vs. freezing the ESM2 model and including / not including the molecular features calculation. We used the transformers library (built on PyTorch) to train our model and Weights and Biases to visualize our model progress and performance.

Challenges we ran into

Attaining pre-processed datasets, figuring out the AWS ecosystem surrounding Sage notebooks, and delegating work.

Accomplishments that we're proud of

We finally got AWS to run a variant of our training script. In addition, we developed a flexible script for training four different related model architectures, and managed to train four models for a non-trivial number of epochs. We believe that with more time for hyperparameter tuning, we could get better performance.

What we learned

While AWS was a large hurdle for a few of us, we are all walking away with a greater appreciation for the technology. Several members of our team gained their first exposure to working on problems within bioinformatics and AI - in particular, learning about IDRs.

What's next for patho-detection-plm

To train the models for longer to see if the loss truly continues to decrease, and to see if adjusting the hyperparameters improves performance of the models.

After getting reasonable performance, comparing them to other baseline models. Also potentially training an IDP-only protein language model that is even more specialized for this task.

Built With

Share this project:

Updates