MiniFold

Inspiration

Our inspiration for this project was advancements in ML that allow it to be applied to a wide variety of fields including biology and bioinformatics, as well as how we could optimize existing systems like AlphaFold 3 to increase general access to these tools.

What it does

MiniFold is a lightweight AI system for protein structure analysis designed to make computational biology tools accessible without requiring large-scale computing resources. Unlike systems such as AlphaFold 3, MiniFold offers comparable functionality with far fewer model parameters, efficient performance on commodity hardware, and affordable access. It integrates the industry-standard PDB dataset, pre-trained ESM-2 protein language model, advanced 3D molecular visualization, and a combined transformer and diffusion model architecture for protein structure prediction. A user-friendly Streamlit interface supports intuitive interaction, while the entire system remains under 50M parameters to ensure computational efficiency.

This project seeks to democratize protein research by combining lightweight implementation with extensive configuration options, requiring minimal technical expertise from users.

How we built it

The project is organized into distinct modules (embedding, transformer, diffusion, UI) using industry-standard tools (PDB, ESM, PyTorch).

The protein data embedding stage is optimized through parallel computation and batch downloading. While a batch is being processed and embedded, the next batch is downloaded in parallel using Python's multiprocessing module. The download operation itself is also parallelized through ThreadPoolExecutor, which further saves time in large batches by downloading multiple files at once. The output of embedding is one .pt file per batch, with each file containing one tensor per protein in the batch. Embedding is a standalone script that only needs to be run once prior to training since all output data is saved on disk.

The output of the embedding stage is fed into the transformer network, which analyzes interactions between tokens in the protein’s representation. The transformer is built with PyTorch and returns classification tokens for proteins, which will be used in the diffusion model. The forward pass of the transformer model gets classification tokens from embeddings.

The diffusion model uses a graph neural network to accurately capture relationships between atoms in protein structures. Originally, we had planned to use a Multi-Layer Perceptron (MLP) for this task, but switched to a GNN-based model for improved structural comprehension. This model also incorporates bonding information and amino acid element data for more accurate structure prediction. The input of the diffusion model is the classification token from the transformer and protein complex embeddings, and its output is the predicted protein complex structure (e.g., coordinates, structure file, or visualization).

The final stage of the project is the UI, which is built using Streamlit for intuitive visualization of the predicted protein structure. It renders the predicted structure from the model as an interactive 3D viewer.

Challenges we ran into

One challenge we faced was determining whether to use ESM2, ESM3, or METL in the embedding. After further research, we found out that ESM3 required an API key, METL had a different function than ESM, and as a result, we used ESM2.

Accomplishments that we're proud of

We're proud of our ability to design the model's complex architecture and pipeline, then program it and put it together in one day.

What we learned

Over the course of developing this project, we learned about the architecture of protein prediction systems such as using transformers and diffusion models, integrating a database with an AI system, and the variety of tools and resources available in computational protein research including PDB and ESM.

What's next for MiniFold

We are developing a streamlined user interface that lowers the barrier to biomolecular modeling. MiniFold will provide an intuitive web-based dashboard and API, allowing users to submit sequences, ligands, or complexes and receive predicted structures, docking scores, and visualizations without coding.

To ensure product–market fit, we are working closely with early users in biotech startups, CROs, and academic labs. Their feedback shapes the workflows, from screening large compound libraries to visualizing protein–ligand binding sites.

Our first pilot program will focus on compound screening in early-stage drug discovery. This setting is an ideal proving ground: users face urgent needs, limited compute budgets, and high value from faster, affordable prediction. Demonstrating real gains in this context will validate MiniFold’s impact and open the door to broader adoption.

Built With

biopython
esm-2
pdb
python
pytorch

Submitted to

Toronto Bioinformatics Hackathon 2025

Created by

I contributed to the design and integration of the lightweight Mini-AlphaFold pipeline by focusing on the transformer interaction model.

Lizzy-Ok Oke
pwatana
deleted deleted
Aidan Sun
Ruby511
Donghoon Lee
josephDURAISINGH Duraisingh

Updates

deleted deleted started this project — Sep 21, 2025 12:56 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.

Lizzy-Ok Oke posted an update — Sep 21, 2025 12:53 PM EDT

# Usage: def get_classification_token(batch) -> torch.Tensor: """Process embeddings and return classification token.""" model = ProteinInteractionTransformer() return model(batch, num_embeddings=len(batch), embedding_dimensions=batch[0].size(0))

Log in or sign up for Devpost to join the conversation.