Skip to content

hackbio-ca/mini-alphafold

MiniFold

Welcome to MiniFold! This repository contains a modular pipeline for protein structure prediction and analysis, designed for rapid prototyping and bioinformatics research.

Installation

pip install -r requirements.txt

Pipeline — step-by-step

MiniFold is designed as a linear, modular pipeline so each stage can be developed, tested, and swapped independently. Below is a detailed walkthrough of each step, what it expects as input, what it produces as output, and implementation notes.

  1. METL Embedder (models/metl_embedder.py)

    • Input: one or more protein sequences (FASTA or raw strings) provided by data_loader.
    • Purpose: convert variable-length amino-acid sequences into fixed-size numerical embeddings suitable for downstream models using pre-trained model.
    • Output: per-sequence embeddings (Tensor shape: [num_sequences, embedding_dim]).
  2. Interaction Transformer (models/interaction_transformer.py)

    • Input: embeddings from the METL embedder. For complexes, embeddings from each chain are combined (concatenation or pairwise features).
    • Purpose: model pairwise and higher-order interactions between residues or chains. Produces a pooled classification token that summarizes interaction context.
    • Output: classification token per complex (Vector shape: [batch, hidden_dim]).
    • Notes: use a small transformer encoder with a learnable CLS token. Support batching and attention masks.
  3. Similarity Search (models/similarity_search.py)

    • Input: classification token(s) or pooled embeddings from the transformer.
    • Purpose: retrieve similar known complexes from the data/reference_db/ by comparing embeddings (cosine similarity or L2 distance).
    • Output: top-k neighbor metadata and/or neighbor embeddings used as templates or conditioning signals.
    • Notes: Precompute a reference embedding matrix (e.g., numpy or torch file). For speed, use FAISS for large DBs; for prototyping, simple vector search suffices.
  4. Diffusion / Structure Generator (models/diffusion.py)

    • Input: classification token and retrieved neighbor/template embeddings. May also accept raw sequences if diffusion model requires them.
    • Purpose: generate 3D coordinates for the protein complex using a diffusion-based generative model (DiffDock-style) or another structure generator.
    • Output: predicted structure (PDB or coordinate array), optionally confidence scores.
    • Notes: This module now implements GNNConditionalDDPM, a Graph Neural Network-based diffusion model that utilizes bonding information and amino acid element data for more accurate structure prediction.
  5. Output and Evaluation

    • Output: a PDB file or coordinate arrays saved to output/ (or data/predictions/).
    • Evaluation: use utils/evaluation.py to compute RMSD, TM-score, and other metrics against reference structures when available.

Testing

The project includes a comprehensive test suite to ensure functionality and reliability:

Unit Tests

  • test/test_gnn_diffusion.py: Comprehensive tests for the GNN-based diffusion model
  • test/test_coordinate_to_cif.py: Tests for coordinate to CIF conversion utilities
  • test/test_pdb_helpers.py: Tests for PDB/mmCIF parsing utilities
  • test/test_plddt.py: Tests for pLDDT (predicted Local Distance Difference Test) computation
  • test/test_plddt_comprehensive.py: Extended pLDDT tests with larger structures

Integration Tests

  • test/test_pipeline.py: End-to-end pipeline integration test
  • test/test_diffuse_end_to_end.py: Complete diffusion model workflow test

Specialized Tests

  • test/test_gnn_diffusion_with_noise.py: Test script that generates structures with different noise levels for visualization
  • test/test_2RPP_cif.py: Test with real protein data from PDB

Run all tests with:

cd test && python -m pytest -v

Or run individual test scripts directly:

cd test && python test_gnn_diffusion.py

Project Structure

The project is organized as follows:

minifold.py (Main Pipeline)

Coordinates the overall workflow, integrating all modules for end-to-end protein analysis.

config.py (Configuration)

Defines global constants, paths, and model names for easy customization and reproducibility.

data/

  • sequences/: Place your input FASTA or .txt files here. (Initially empty)
  • reference_db/: Contains pre-downloaded PDBs and metadata for reference.

models/

  • metl_embedder.py: Module 1 — Generates METL embeddings for input sequences.
  • interaction_transformer.py: Module 2 — Applies a transformer to model pairwise interactions.
  • similarity_search.py: Module 3 — Finds similar protein structures using embeddings.
  • diffusion.py: Module 4 — Implements DiffDock/diffusion-based structure prediction.

utils/

  • training.py: Utilities for training models.
  • evaluation.py: Evaluation metrics and analysis tools.
  • data_loader.py: Functions for downloading and loading data.
  • pdb_helpers.py: PDB/mmCIF parsing and export utilities.
  • coordinate_to_cif.py: Coordinate to CIF conversion utilities.

test/

  • test_*.py: Unit and integration tests for various components.
  • output/: Directory for test output files.

requirements.txt

List of required Python packages for the project.

Getting Started

  1. Clone the repository.
  2. Install dependencies from requirements.txt.
  3. Add your input data to data/sequences/.
  4. Run minifold.py to start the pipeline.

Visualization

Generated structures can be visualized using PyMOL:

pymol data/predictions/*.cif

The test scripts generate CIF files that show the backbone atoms (N, CA, C, O) which PyMOL can connect to visualize alpha helix structures and other secondary structures.

Contributing

See CONTRIBUTING.md for guidelines.

License

Distributed under the terms of the license in LICENSE.

About

Lightweight MiniAlphaFold Implementation

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages