Molecode

Protein sequencing at your fingertips
About us
99% accuracy!

💡 Inspiration

Drug discovery is at the forefront of biomedical research.

Discovering new drugs allows us to combat various diseases, improve patient outcomes, and extend human lifespan. For instance, for users with high cholesterol, there exist specialized enzymes found in certain microorganisms which can break down cholesterol through targeted reactions. Identifying these cholesterol-degrading microorganisms in the environment would greatly improve the quality of managing cholesterol-related conditions, abating the increased risk of diabetes, kidney failure, and liver dysfunction.

However, traditional methods to identify enzymes targeting specific substrates are costly, involving equipment such as NMR spectrometers, chromatography, X-ray crystallography, and mass spectrometry. NMR spectrometers range from $35,000 to $150,000 to purchase, with analysis costing $17 to $50 per hour. Mass spectrometry rates range from $60 to $560.

This is the inspiration behind our solution, Molecode, which uses machine learning to substantially cut these expenses.

✨ What it does

Molecode is an integrated API to a ML model we developed, which allows users to rapidly analyse protein sequences remotely and predict protein functions, with a 90% accuracy rate in our trained dataset (recognizing cholesterol degrading proteins).

Process

Upload: Upload a file in FASTA format.
Decode: MoleCode’s AI software analyzes the provided enzyme sequences.
Discover: Receive results at 90% accuracy!

🛠 How we built it

Backend: Flask, Python Frontend: ReactJS, HTML, CSS Machine Learning: TensorFlow, Keras Data Engineering: NumPy, BioPython

📊 Data understanding

We used a training dataset of 1977 protein sequences protein sequences - 727 enzymes in the cholesterol oxidase (oxidoreductase) class and 1250 related enzymes that do not exhibit cholesterol degrading activity. From this data, we developed an accurate classification model using a training dataset and classified each protein uploaded to the model into either degrading or non-degrading proteins.

We labelled each protein sequence 1 or 0 according to their ability to break down cholesterol. Sequences are made up of, at most, 1670 amino acids; for any sequences shorter than 1670 values, we used padding to fill out the remaining space.

80% of our data points were used for training the model, 10% for validation and the remaining 10% was used for testing. Our model consists of an input layer, a dense layer with 16 nodes, another dense layer with 8 nodes, and a flatten layer before it’s finally classified into either 1 or 0 using another dense layer.

We used the sigmoid function as our activation function at the output layer and used binary cross entropy as our loss function. The model was trained for 50 epochs, with a batch size of 10.

🤖 Machine learning modelling

We developed a recurrent machine learning neural network with 90% accuracy with TensorFlow to classify whether a protein exhibits cholesterol-degradation. Then, we brought the machine learning model to production stage into a web application using Flask and React.js.

💪🏼 Challenges we ran into

Connecting the final front end and back end and deploying the website on Vercel was a challenge we ran into.

The selection of data was another challenge that we ran into. For the enzyme sequences we selected as training data, we used the online proteomic database: rcsb. While our team was able to get the machine learning model to a stage where predictions exhibited 90% accuracy there are some scientific 'edge-cases' that the training data do not address. Given more time for research, a more robust dataset would be ideal. To elaborate, training data consisted of enzyme sequences of cholesterol oxidases that are phylogenically related. That is, they evolved from common precursor proteins (most likely lipid-degrading oxidases/hydroxylases). Given that life is extremely robust, there can be many other types of proteins that do not share evolutionarily related structures which our model would not pick up on. With more time, we would be able to continue increasing the accuracy rate and predict more enzymes.

📈 Scalability/Business Practicality

Applications in Biopharma and Biomedical Engineering Our model allows biotechnology researchers to identify cholesterol degrading enzymes in organisms shown to degrade cholesterol in-vitro based off of their proteomes. This data can then be used to identify target genes for genetic engineering or synthetic microorganisms for clinical use (i.e., biotechnology-probiotics).

Economics With MoleCode, cholesterol can be used as a feedstock in bioreactors for biomanufacturing of plastics, cellular agriculture, and pharmaceutical biomanufacturing. In addition, our addition to the biotech research scenery in Canada further invigorates the current ‘bioeconomy”, which is growing by the year as more startups and venture capitalists invest their time and money into biomanufacturing and fermentation-based solutions.

🏆 Accomplishments that we're proud of

With our current training and data sets, we were able to reach 90% prediction accuracy on predicting whether a protein is likely to degrade cholesterol!

The front-end websites using React.js and Flask was built entirely from scratch as well.

🌙 Originality

As of 2023, no comprehensive commercial solutions exist for identifying cholesterol-degrading proteins or protein sequences. While some research papers and articles explore this concept theoretically, this is the pioneering practical model, with a training accuracy already reaching 90%.

💠 Implications

Health Implications: Cholesterol is a vital component of cell membranes and a precursor to important molecules such as hormones and bile acids. However, high levels of cholesterol in the blood are associated with atherosclerosis and an increased risk of cardiovascular diseases. Bacteria capable of degrading cholesterol could potentially play a role in reducing cholesterol levels in the body, thereby contributing to better heart health.
Bioremediation: Cholesterol is a natural compound found in many environments, including soil and water. Some industrial processes can lead to the accumulation of cholesterol in the environment, which can be harmful. Cholesterol-degrading bacteria can be harnessed for bioremediation purposes, helping to break down cholesterol and reduce its negative impact on ecosystems.
Biosynthesis: Cholesterol-degrading bacteria can potentially be used to produce valuable compounds derived from cholesterol. These compounds could have applications in pharmaceuticals, cosmetics, and other industries. biotechnological processes, expanding our understanding of microbial ecology and metabolism.
Probiotics and Health Supplements: Identifying and characterizing cholesterol-degrading bacteria could open up possibilities for the development of probiotics or health supplements. These products could potentially help individuals manage their cholesterol levels naturally.
Drug Development: Cholesterol metabolism is interconnected with various cellular processes, and understanding how bacteria degrade cholesterol could provide new targets for drug development. Enzymes involved in cholesterol degradation could be targeted to develop drugs for managing cholesterol-related diseases.
Microbiome Research: The human gut microbiome plays a significant role in overall health. Identifying bacteria capable of cholesterol degradation within the gut microbiome could contribute to understanding the complex interactions between gut bacteria and host health.

💫 The Future of Molecode

We would like to extrapolate our software, and train it to recongize and predict other protein sequences with functions other than cholesterol degredation. By accelerating the drug discovery process, Molecode increases the chances that vital medicines are found and produced.