Inspiration

Protein Language Models are an exciting new area of development. We wanted to see how they could be applied outside of merely obtaining protein embeddings—to generate meaningful proteins. Enzymes are almost unparalleled in their ubiquity and usefulness throughout nature. Therefore, we saw it a natural fit to try to generate enzymes for our project.

What it does

HackEnzyme takes in molecule strings called SMILES, and outputs the predicted amino acid sequence that breaks down or combines the substrate.

How we built it

Our first goal was to acquire a dataset, we scraped multiple Rest Api’s to have a label: enzyme sequence: with corresponding: chemical compound strings, smiles. From this, we utilized models to map smiles strings into embeddings, and then appending back to its corresponding enzyme sequence.

We fine tuned Roslab.Prot_T5_Large on our bioinformatic dataset that we created earlier, integrating into our web application.

We implement esm-fold which maps protein sequences to 3d structures to better understand the output of our model.

Challenges we ran into

Lack of datasets Mapping smiles sequences to embedding space File type embedding complications Problem with Implementing the Model Bypassing encoders Training the Model Lack of Computational Power API integration with the front-end The model 3D Protein Sequence Modelling Chat-GPT LLM Integration

Accomplishments that we're proud of

We are proud of our ability to collaborate on a topic that the majority of the team lacked understanding of. There was common discussion about different project ideas because of our lack of confidence; however, our perseverance and restlessness are what made our project pull-through.

What we learned

Better understood transformer layers 3D Protein Modelling and Full-Stack Integration A lot of Biology

What's next for HackEnzyme

Optimizing the base model Using more parameters Training the model on a bigger dataset Fine tuning on multiple sequence alignments, chemical properties, and functional annotations to create a model that can be used to generate enzymes with specific properties outside of their input and output.

Built With

Share this project:

Updates