MiniFold

Welcome to MiniFold! This repository contains a modular pipeline for protein structure prediction and analysis, designed for rapid prototyping and bioinformatics research.

Installation

pip install -r requirements.txt

Pipeline — step-by-step

MiniFold is designed as a linear, modular pipeline so each stage can be developed, tested, and swapped independently. Below is a detailed walkthrough of each step, what it expects as input, what it produces as output, and implementation notes.

METL Embedder (models/metl_embedder.py)
- Input: one or more protein sequences (FASTA or raw strings) provided by data_loader.
- Purpose: convert variable-length amino-acid sequences into fixed-size numerical embeddings suitable for downstream models using pre-trained model.
- Output: per-sequence embeddings (Tensor shape: [num_sequences, embedding_dim]).
Interaction Transformer (models/interaction_transformer.py)
- Input: embeddings from the METL embedder. For complexes, embeddings from each chain are combined (concatenation or pairwise features).
- Purpose: model pairwise and higher-order interactions between residues or chains. Produces a pooled classification token that summarizes interaction context.
- Output: classification token per complex (Vector shape: [batch, hidden_dim]).
- Notes: use a small transformer encoder with a learnable CLS token. Support batching and attention masks.
Similarity Search (models/similarity_search.py)
- Input: classification token(s) or pooled embeddings from the transformer.
- Purpose: retrieve similar known complexes from the data/reference_db/ by comparing embeddings (cosine similarity or L2 distance).
- Output: top-k neighbor metadata and/or neighbor embeddings used as templates or conditioning signals.
- Notes: Precompute a reference embedding matrix (e.g., numpy or torch file). For speed, use FAISS for large DBs; for prototyping, simple vector search suffices.
Diffusion / Structure Generator (models/diffusion.py)
- Input: classification token and retrieved neighbor/template embeddings. May also accept raw sequences if diffusion model requires them.
- Purpose: generate 3D coordinates for the protein complex using a diffusion-based generative model (DiffDock-style) or another structure generator.
- Output: predicted structure (PDB or coordinate array), optionally confidence scores.
- Notes: This module now implements GNNConditionalDDPM, a Graph Neural Network-based diffusion model that utilizes bonding information and amino acid element data for more accurate structure prediction.
Output and Evaluation
- Output: a PDB file or coordinate arrays saved to output/ (or data/predictions/).
- Evaluation: use utils/evaluation.py to compute RMSD, TM-score, and other metrics against reference structures when available.

Testing

The project includes a comprehensive test suite to ensure functionality and reliability:

Unit Tests

test/test_gnn_diffusion.py: Comprehensive tests for the GNN-based diffusion model
test/test_coordinate_to_cif.py: Tests for coordinate to CIF conversion utilities
test/test_pdb_helpers.py: Tests for PDB/mmCIF parsing utilities
test/test_plddt.py: Tests for pLDDT (predicted Local Distance Difference Test) computation
test/test_plddt_comprehensive.py: Extended pLDDT tests with larger structures

Integration Tests

test/test_pipeline.py: End-to-end pipeline integration test
test/test_diffuse_end_to_end.py: Complete diffusion model workflow test

Specialized Tests

test/test_gnn_diffusion_with_noise.py: Test script that generates structures with different noise levels for visualization
test/test_2RPP_cif.py: Test with real protein data from PDB

Run all tests with:

cd test && python -m pytest -v

Or run individual test scripts directly:

cd test && python test_gnn_diffusion.py

Project Structure

The project is organized as follows:

minifold.py (Main Pipeline)

Coordinates the overall workflow, integrating all modules for end-to-end protein analysis.

config.py (Configuration)

Defines global constants, paths, and model names for easy customization and reproducibility.

data/

sequences/: Place your input FASTA or .txt files here. (Initially empty)
reference_db/: Contains pre-downloaded PDBs and metadata for reference.

models/

metl_embedder.py: Module 1 — Generates METL embeddings for input sequences.
interaction_transformer.py: Module 2 — Applies a transformer to model pairwise interactions.
similarity_search.py: Module 3 — Finds similar protein structures using embeddings.
diffusion.py: Module 4 — Implements DiffDock/diffusion-based structure prediction.

utils/

training.py: Utilities for training models.
evaluation.py: Evaluation metrics and analysis tools.
data_loader.py: Functions for downloading and loading data.
pdb_helpers.py: PDB/mmCIF parsing and export utilities.
coordinate_to_cif.py: Coordinate to CIF conversion utilities.

test/

test_*.py: Unit and integration tests for various components.
output/: Directory for test output files.

requirements.txt

List of required Python packages for the project.

Getting Started

Clone the repository.
Install dependencies from requirements.txt.
Add your input data to data/sequences/.
Run minifold.py to start the pipeline.

Visualization

Generated structures can be visualized using PyMOL:

pymol data/predictions/*.cif

The test scripts generate CIF files that show the backbone atoms (N, CA, C, O) which PyMOL can connect to visualize alpha helix structures and other secondary structures.

Contributing

See CONTRIBUTING.md for guidelines.

License

Distributed under the terms of the license in LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MiniFold

Installation

Pipeline — step-by-step

Testing

Unit Tests

Integration Tests

Specialized Tests

Project Structure

minifold.py (Main Pipeline)

config.py (Configuration)

data/

models/

utils/

test/

requirements.txt

Getting Started

Visualization

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
UI		UI
archive		archive
data/reference_db		data/reference_db
models		models
test		test
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.py		config.py
minifold.py		minifold.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MiniFold

Installation

Pipeline — step-by-step

Testing

Unit Tests

Integration Tests

Specialized Tests

Project Structure

minifold.py (Main Pipeline)

config.py (Configuration)

data/

models/

utils/

test/

requirements.txt

Getting Started

Visualization

Contributing

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages