Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

This is an official implementation of the paper Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking.

News

We have updated the repository and the Hugging Face checkpoints: the current code and weights correspond to the improved pipeline and much better results! The previous version, however, remains available as tag v1 (git checkout v1).

Overview

Matcha is a molecular docking pipeline that combines multi-stage flow matching with physical validity filtering. It consists of three sequential stages that progressively refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces (R³, SO(3), and SO(2)). Physical validity filters eliminate unrealistic poses, and GNINA minimization and scoring ranks final predictions.

Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBBind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 31× faster than modern large-scale co-folding models.

Content

Installation

# Install with uv
uv sync

Or with pip:

pip install -e .

CLI usage

The matcha CLI wraps the full inference pipeline (ESM embeddings, 3-stage docking, PoseBusters filtering) with GNINA minimization and scoring into a single command.

Single ligand

uv run matcha -r protein.pdb -l ligand.sdf -o results/

Batch mode (multi-ligand file or directory)

uv run matcha -r protein.pdb --ligand-dir ligands.sdf -o results/

All molecules are processed in a single pipeline pass (native batching).

Key options

Flag	Description
`-r`, `--receptor`	Protein structure (`.pdb`)
`-l`, `--ligand`	Single ligand (`.sdf`/`.mol`/`.mol2`/`.pdb`)
`--ligand-dir`	Multi-ligand `.sdf` file or directory
`-o`, `--out`	Output directory
`-g`, `--device`	`auto`, `cpu`, `cuda`, `cuda:N`, or `mps` (Apple Metal)
`--gpus`	Multi-GPU batch sharding, e.g. `--gpus 2,3` (batch dir mode only)
`--n-samples`	Poses per ligand (default: 40)
`--scorer`	`gnina` (default), `custom`, or `none`
`--scorer-minimize` / `--no-scorer-minimize`	GNINA minimization (default: on)
`--autobox-ligand`	Box center from reference ligand
`--center-x/y/z`	Manual box center (Å)
`--overwrite`	Overwrite existing run

Run matcha --help for the full list.

Multi-GPU batch mode (2/3 GPUs)

For large ligand directories, Matcha can shard ligands across multiple GPUs by launching one process per GPU.

# 2 GPUs
uv run matcha -r protein.pdb --ligand-dir ligands/ --gpus 2,3 --box-json target_box.json -o out_2gpu

# 3 GPUs
uv run matcha -r protein.pdb --ligand-dir ligands/ --gpus 1,2,3 --box-json target_box.json -o out_3gpu

Outputs are merged into:

out_<...>/<run-name>/merged/
out_<...>/<run-name>/benchmark_summary.json
out_<...>/<run-name>/benchmark_summary.md

Search space

Blind docking (default): searches the entire protein surface.
Autobox: --autobox-ligand ref.sdf — centers the search on a reference ligand.
Manual box: --center-x X --center-y Y --center-z Z — explicit coordinates.

Output

Single mode produces <run-name>_best.sdf (top pose) and <run-name>_poses.sdf (all ranked poses). Batch mode creates best_poses/ and all_poses/ directories with per-ligand SDFs. A detailed log file is written alongside the results.

Datasets

Existing datasets

Astex and PoseBusters datasets can be downloaded here. PDBBind_processed can be found here. DockGen can be downloaded from here.

Adding new dataset

Use a dataset folder with the following structure:

dataset_path/
    uid1/
        uid1_protein.pdb
        uid1_ligand.sdf
    uid2/
        uid2_protein.pdb
        uid2_ligand.sdf
    ...

Preparing the config file

Edit configs/paths/paths.yaml: set posebusters_data_dir, astex_data_dir, pdbbind_data_dir, dockgen_data_dir (or any_data_dir for a new dataset). Comment out unneeded entries in test_dataset_types.
Set paths for intermediate and final data:
- cache_path, data_folder, inference_results_folder
- preprocessed_receptors_base: root directory for preprocessed protein structures used by the GNINA affinity scripts (see Protein preprocessing). Required when using GNINA steps; layout: {preprocessed_receptors_base}/{dataset}_{uid}/{dataset}_{uid}_protein.pdb.
Download checkpoints from Hugging Face (LigandPro/Matcha) (the matcha_pipeline folder). Set checkpoints_folder in paths.yaml to the folder that contains it.

Protein preprocessing (for GNINA)

Protein structures used by the GNINA affinity scripts must be preprocessed (hydrogenation, PDBQT, etc.). We use the dockprep-pipeline for receptor and ligand preparation; see that repository for a minimal pipeline (Reduce/OpenMM hydrogenation, Meeko PDBQT). Further details are in the paper.

Running inference step-by-step

Preprocessing

uv run python scripts/prepare_esm_sequences.py -p configs/paths/paths.yaml
CUDA_VISIBLE_DEVICES=0 uv run python scripts/compute_esm_embeddings.py -p configs/paths/paths.yaml

Matcha inference

CUDA_VISIBLE_DEVICES=<gpu_device_id> uv run python scripts/run_inference_pipeline.py -c configs/base.yaml -p configs/paths/paths.yaml -n inference_folder_name --n-samples 20

Pose selection and filtration

To run the full pipeline including GNINA affinity, minimization, top-pose selection, and metrics:

uv run bash scripts/final_inference_pipeline.sh -n inference_folder_name -c configs/base.yaml -p configs/paths/paths.yaml -d <gpu_device_id> -s 20 -g </path/to/gnina_executable> [--compute_final_metrics]

You must set preprocessed_receptors_base in paths.yaml (or provide preprocessed structures as required by the GNINA scripts) and pass -g with the path to your GNINA runner script. If you pass --compute_final_metrics, the script will compute dataset-level metrics for top-1 pose for each complex. Metrics include the computation of symmetry-corrected RMSD and PoseBusters filters.

Benchmarking and pocket-aligned RMSD computation

For other docking methods, prepare a folder of predictions with the structure described in the script. Then:

uv run python scripts/compute_aligned_rmsd.py -p configs/paths/paths.yaml -a base --init-preds-path <path_to_initial_preds>

Set methods_data and dataset_names inside the script as needed. For each method in methods_data, set flag has_predicted_proteins that indicates that the protein pdb itself has coordinates that differ from the original holo structure. Choose between base and pocket alignment (see Appendix G in the paper). By default we use pocket alignment for methods that have predicted protein structures (eg. AlphaFold3), and base for rigid docking methods (eg. DiffDock). In the latter case for rigid docking the alignment is not performed, but the results are rearranged for the further metrics computation. The resulting structures will appear in the inference_results_folder/<baseline_method_name>_<pocket_alignment_type>.

After aligning the predicted structures to the original holo protein structure, metrics from the best SDF predictions can be computed with:

uv run python scripts/compute_metrics_from_sdf.py -p configs/paths/paths.yaml -n <baseline_method_name>_<pocket_alignment_type> --prediction-type best_base_predictions

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Citation

If you use Matcha in your work, please cite:

@misc{frolova2025matchamultistageriemannianflow,
      title={Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking}, 
      author={Daria Frolova and Talgat Daulbaev and Egor Sevriugov and Sergei A. Nikolenko and Dmitry N. Ivankov and Ivan Oseledets and Marina A. Pak},
      year={2025},
      eprint={2510.14586},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.14586}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github		.github
configs		configs
data		data
matcha		matcha
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

News

Overview

Content

Installation

CLI usage

Single ligand

Batch mode (multi-ligand file or directory)

Key options

Multi-GPU batch mode (2/3 GPUs)

Search space

Output

Datasets

Existing datasets

Adding new dataset

Preparing the config file

Protein preprocessing (for GNINA)

Running inference step-by-step

Preprocessing

Matcha inference

Pose selection and filtration

Benchmarking and pocket-aligned RMSD computation

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

News

Overview

Content

Installation

CLI usage

Single ligand

Batch mode (multi-ligand file or directory)

Key options

Multi-GPU batch mode (2/3 GPUs)

Search space

Output

Datasets

Existing datasets

Adding new dataset

Preparing the config file

Protein preprocessing (for GNINA)

Running inference step-by-step

Preprocessing

Matcha inference

Pose selection and filtration

Benchmarking and pocket-aligned RMSD computation

License

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages