Welcome to MiniFold! This repository contains a modular pipeline for protein structure prediction and analysis, designed for rapid prototyping and bioinformatics research.
pip install -r requirements.txt
MiniFold is designed as a linear, modular pipeline so each stage can be developed, tested, and swapped independently. Below is a detailed walkthrough of each step, what it expects as input, what it produces as output, and implementation notes.
-
METL Embedder (models/metl_embedder.py)
- Input: one or more protein sequences (FASTA or raw strings) provided by
data_loader. - Purpose: convert variable-length amino-acid sequences into fixed-size numerical embeddings suitable for downstream models using pre-trained model.
- Output: per-sequence embeddings (Tensor shape: [num_sequences, embedding_dim]).
- Input: one or more protein sequences (FASTA or raw strings) provided by
-
Interaction Transformer (models/interaction_transformer.py)
- Input: embeddings from the METL embedder. For complexes, embeddings from each chain are combined (concatenation or pairwise features).
- Purpose: model pairwise and higher-order interactions between residues or chains. Produces a pooled classification token that summarizes interaction context.
- Output: classification token per complex (Vector shape: [batch, hidden_dim]).
- Notes: use a small transformer encoder with a learnable CLS token. Support batching and attention masks.
-
Similarity Search (models/similarity_search.py)
- Input: classification token(s) or pooled embeddings from the transformer.
- Purpose: retrieve similar known complexes from the
data/reference_db/by comparing embeddings (cosine similarity or L2 distance). - Output: top-k neighbor metadata and/or neighbor embeddings used as templates or conditioning signals.
- Notes: Precompute a reference embedding matrix (e.g., numpy or torch file). For speed, use FAISS for large DBs; for prototyping, simple vector search suffices.
-
Diffusion / Structure Generator (models/diffusion.py)
- Input: classification token and retrieved neighbor/template embeddings. May also accept raw sequences if diffusion model requires them.
- Purpose: generate 3D coordinates for the protein complex using a diffusion-based generative model (DiffDock-style) or another structure generator.
- Output: predicted structure (PDB or coordinate array), optionally confidence scores.
- Notes: This module now implements GNNConditionalDDPM, a Graph Neural Network-based diffusion model that utilizes bonding information and amino acid element data for more accurate structure prediction.
-
Output and Evaluation
- Output: a PDB file or coordinate arrays saved to
output/(ordata/predictions/). - Evaluation: use
utils/evaluation.pyto compute RMSD, TM-score, and other metrics against reference structures when available.
- Output: a PDB file or coordinate arrays saved to
The project includes a comprehensive test suite to ensure functionality and reliability:
test/test_gnn_diffusion.py: Comprehensive tests for the GNN-based diffusion modeltest/test_coordinate_to_cif.py: Tests for coordinate to CIF conversion utilitiestest/test_pdb_helpers.py: Tests for PDB/mmCIF parsing utilitiestest/test_plddt.py: Tests for pLDDT (predicted Local Distance Difference Test) computationtest/test_plddt_comprehensive.py: Extended pLDDT tests with larger structures
test/test_pipeline.py: End-to-end pipeline integration testtest/test_diffuse_end_to_end.py: Complete diffusion model workflow test
test/test_gnn_diffusion_with_noise.py: Test script that generates structures with different noise levels for visualizationtest/test_2RPP_cif.py: Test with real protein data from PDB
Run all tests with:
cd test && python -m pytest -v
Or run individual test scripts directly:
cd test && python test_gnn_diffusion.py
The project is organized as follows:
Coordinates the overall workflow, integrating all modules for end-to-end protein analysis.
Defines global constants, paths, and model names for easy customization and reproducibility.
- sequences/: Place your input FASTA or .txt files here. (Initially empty)
- reference_db/: Contains pre-downloaded PDBs and metadata for reference.
- metl_embedder.py: Module 1 — Generates METL embeddings for input sequences.
- interaction_transformer.py: Module 2 — Applies a transformer to model pairwise interactions.
- similarity_search.py: Module 3 — Finds similar protein structures using embeddings.
- diffusion.py: Module 4 — Implements DiffDock/diffusion-based structure prediction.
- training.py: Utilities for training models.
- evaluation.py: Evaluation metrics and analysis tools.
- data_loader.py: Functions for downloading and loading data.
- pdb_helpers.py: PDB/mmCIF parsing and export utilities.
- coordinate_to_cif.py: Coordinate to CIF conversion utilities.
- test_*.py: Unit and integration tests for various components.
- output/: Directory for test output files.
List of required Python packages for the project.
- Clone the repository.
- Install dependencies from
requirements.txt. - Add your input data to
data/sequences/. - Run
minifold.pyto start the pipeline.
Generated structures can be visualized using PyMOL:
pymol data/predictions/*.cif
The test scripts generate CIF files that show the backbone atoms (N, CA, C, O) which PyMOL can connect to visualize alpha helix structures and other secondary structures.
See CONTRIBUTING.md for guidelines.
Distributed under the terms of the license in LICENSE.