Skip to content

aloehdew/slipslop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Inspiration

1-Dimensional Convolutional Neural Networks (1dCNNs) are incredibly powerful tools for the study of nucleotide sequences. In searching for applications, we could find none that had been developed for the specified purpose of detecting Programmed Ribosomal Frameshifting (PRF), an important translation phenomenon in the understanding of virology.

What it does

  • SlipSlop takes a 250bp nucleotide sequence and returns the probability that the sequence contains a PRF site
  • Particularly accurate for -1 PRF frameshifts that contain the slippery site heptamer motif.

How we built it

  • After collecting ~1250 known slip-sites from NCBI databases, augmenting data to construct larger training sets, and pulling negative examples, we ended up with ~75,000 data-points that were passed into a convolutional deep neural network for training.
  • Done in Jupyter Notebooks using PyTorch for network construction and Biopython for data collection. chart1

Challenges we ran into

  • When augmenting data, we had to make sure not to introduce any augmentation artifacts that would bias results.
  • Since the majority of existing data comes from just a few viral families, it is hard to properly cluster data for splitting and we relied instead on random splitting.
  • This data is also heavily biased toward -1 frameshift slippage sites, leading that to be the main strength of the model's recognition.

Accomplishments that we're proud of

precision last
  • When testing against the same known viral -1 PRF site sequences, our model outperformed the only other tool for detecting these sites, PRFect.

  • With a Recall of 40% compared to their 8% and precision of 90% compared to their 97%, we argue that our tool has better performance since False Positives are easier to disprove through other tools while False Negatives may lead to a critical omission of translational behavior. ============================================================

    HEAD-TO-HEAD: RECODE −1 frameshifts

============================================================

PRFect recall: 56/244 (23.0%)

SlipSlop recall: 40/52 (76.9%)

============================================================

What we learned

  • First and foremost, we learned a tremendous amount about model design and data curation.
  • Further, we were able to gain a much deeper understanding of why existing tools have the limitation that they do; particularly, that the availability of datasets for niche biological phenomenon are usually not well maintained and often difficult to find or interpret.

What's next for SlipSlop

  • The model would benefit heavily from longer training and more data, as any neural network would.
  • In order to create a viable product we must institute proper data clustering, something that would only be possible if a wider variety of positive data existed.
  • If such data existed, the overall ambition for this tool would be to expand its functionality to detect all forms of PRF, even though they can use diverse and vastly different mechanisms. This would likely involve scaling the depth, width, and resolution of our model.

About

A convolutional neural network trained on NCBI data to detect programmed ribosomal frameshifting (PRF) sites. The model specializes in viral -1 slips, but has been trained on a variety of slippage sites.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors