This is an official implementation of the paper Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking.
We have updated the repository and the Hugging Face checkpoints: the current code and weights correspond to the improved pipeline and much better results! The previous version, however, remains available as tag v1 (git checkout v1).
Matcha is a molecular docking pipeline that combines multi-stage flow matching with physical validity filtering. It consists of three sequential stages that progressively refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces (R³, SO(3), and SO(2)). Physical validity filters eliminate unrealistic poses, and GNINA minimization and scoring ranks final predictions.
Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBBind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 31× faster than modern large-scale co-folding models.
- Installation
- CLI usage
- Datasets
- Preparing the config file
- Protein preprocessing (for GNINA)
- Running inference step-by-step
- Benchmarking and pocket-aligned RMSD computation
- License
- Citation
# Install with uv
uv syncOr with pip:
pip install -e .The matcha CLI wraps the full inference pipeline (ESM embeddings, 3-stage docking, PoseBusters filtering) with GNINA minimization and scoring into a single command.
uv run matcha -r protein.pdb -l ligand.sdf -o results/uv run matcha -r protein.pdb --ligand-dir ligands.sdf -o results/All molecules are processed in a single pipeline pass (native batching).
| Flag | Description |
|---|---|
-r, --receptor |
Protein structure (.pdb) |
-l, --ligand |
Single ligand (.sdf/.mol/.mol2/.pdb) |
--ligand-dir |
Multi-ligand .sdf file or directory |
-o, --out |
Output directory |
-g, --device |
auto, cpu, cuda, cuda:N, or mps (Apple Metal) |
--gpus |
Multi-GPU batch sharding, e.g. --gpus 2,3 (batch dir mode only) |
--n-samples |
Poses per ligand (default: 40) |
--scorer |
gnina (default), custom, or none |
--scorer-minimize / --no-scorer-minimize |
GNINA minimization (default: on) |
--autobox-ligand |
Box center from reference ligand |
--center-x/y/z |
Manual box center (Å) |
--overwrite |
Overwrite existing run |
Run matcha --help for the full list.
For large ligand directories, Matcha can shard ligands across multiple GPUs by launching one process per GPU.
# 2 GPUs
uv run matcha -r protein.pdb --ligand-dir ligands/ --gpus 2,3 --box-json target_box.json -o out_2gpu
# 3 GPUs
uv run matcha -r protein.pdb --ligand-dir ligands/ --gpus 1,2,3 --box-json target_box.json -o out_3gpuOutputs are merged into:
out_<...>/<run-name>/merged/out_<...>/<run-name>/benchmark_summary.jsonout_<...>/<run-name>/benchmark_summary.md
- Blind docking (default): searches the entire protein surface.
- Autobox:
--autobox-ligand ref.sdf— centers the search on a reference ligand. - Manual box:
--center-x X --center-y Y --center-z Z— explicit coordinates.
Single mode produces <run-name>_best.sdf (top pose) and <run-name>_poses.sdf (all ranked poses). Batch mode creates best_poses/ and all_poses/ directories with per-ligand SDFs. A detailed log file is written alongside the results.
Astex and PoseBusters datasets can be downloaded here. PDBBind_processed can be found here. DockGen can be downloaded from here.
Use a dataset folder with the following structure:
dataset_path/
uid1/
uid1_protein.pdb
uid1_ligand.sdf
uid2/
uid2_protein.pdb
uid2_ligand.sdf
...
-
Edit
configs/paths/paths.yaml: setposebusters_data_dir,astex_data_dir,pdbbind_data_dir,dockgen_data_dir(orany_data_dirfor a new dataset). Comment out unneeded entries intest_dataset_types. -
Set paths for intermediate and final data:
cache_path,data_folder,inference_results_folderpreprocessed_receptors_base: root directory for preprocessed protein structures used by the GNINA affinity scripts (see Protein preprocessing). Required when using GNINA steps; layout:{preprocessed_receptors_base}/{dataset}_{uid}/{dataset}_{uid}_protein.pdb.
-
Download checkpoints from Hugging Face (LigandPro/Matcha) (the
matcha_pipelinefolder). Setcheckpoints_folderin paths.yaml to the folder that contains it.
Protein structures used by the GNINA affinity scripts must be preprocessed (hydrogenation, PDBQT, etc.). We use the dockprep-pipeline for receptor and ligand preparation; see that repository for a minimal pipeline (Reduce/OpenMM hydrogenation, Meeko PDBQT). Further details are in the paper.
uv run python scripts/prepare_esm_sequences.py -p configs/paths/paths.yaml
CUDA_VISIBLE_DEVICES=0 uv run python scripts/compute_esm_embeddings.py -p configs/paths/paths.yamlCUDA_VISIBLE_DEVICES=<gpu_device_id> uv run python scripts/run_inference_pipeline.py -c configs/base.yaml -p configs/paths/paths.yaml -n inference_folder_name --n-samples 20To run the full pipeline including GNINA affinity, minimization, top-pose selection, and metrics:
uv run bash scripts/final_inference_pipeline.sh -n inference_folder_name -c configs/base.yaml -p configs/paths/paths.yaml -d <gpu_device_id> -s 20 -g </path/to/gnina_executable> [--compute_final_metrics]You must set preprocessed_receptors_base in paths.yaml (or provide preprocessed structures as required by the GNINA scripts) and pass -g with the path to your GNINA runner script.
If you pass --compute_final_metrics, the script will compute dataset-level metrics for top-1 pose for each complex.
Metrics include the computation of symmetry-corrected RMSD and PoseBusters filters.
For other docking methods, prepare a folder of predictions with the structure described in the script. Then:
uv run python scripts/compute_aligned_rmsd.py -p configs/paths/paths.yaml -a base --init-preds-path <path_to_initial_preds>Set methods_data and dataset_names inside the script as needed.
For each method in methods_data, set flag has_predicted_proteins that indicates that the protein pdb itself has coordinates that differ from the original holo structure.
Choose between base and pocket alignment (see Appendix G in the paper).
By default we use pocket alignment for methods that have predicted protein structures (eg. AlphaFold3), and base for rigid docking methods (eg. DiffDock). In the latter case for rigid docking the alignment is not performed, but the results are rearranged for the further metrics computation.
The resulting structures will appear in the inference_results_folder/<baseline_method_name>_<pocket_alignment_type>.
After aligning the predicted structures to the original holo protein structure, metrics from the best SDF predictions can be computed with:
uv run python scripts/compute_metrics_from_sdf.py -p configs/paths/paths.yaml -n <baseline_method_name>_<pocket_alignment_type> --prediction-type best_base_predictionsThis work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
If you use Matcha in your work, please cite:
@misc{frolova2025matchamultistageriemannianflow,
title={Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking},
author={Daria Frolova and Talgat Daulbaev and Egor Sevriugov and Sergei A. Nikolenko and Dmitry N. Ivankov and Ivan Oseledets and Marina A. Pak},
year={2025},
eprint={2510.14586},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.14586},
}


