Skip to content

gbyuvd/chemembed-faiss-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About The Project

ChEmbed v0.1 - is a sentence-transformers based on MiniLM-L6-H384-uncased fine-tuned on around 1 million pairs of valid natural compounds' SELFIES (Krenn et al. 2020) taken from COCONUTDB (Sorokina et al. 2021). It maps compounds' Self-Referencing Embedded Strings (SELFIES) into a 768-dimensional dense vector space, potentially can be used for similarity search, classification, clustering, and more.

Here, we will use the above model to embed ~407K natural product molecular SELFIES representations (from COCONUTDB) and utilize Meta's FAISS for indexing and doing fast searches for structurally similar compounds based on one or more inputs.

How This WOrks

Disclaimer: For Academic Purposes Only

The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.

Getting Started

Please clone/download this repo and (1) extract the archives provided in ./db_raw if you want to prepare all locally or (2) download the pre-embedded FAISS-Index here, after that, please see Demo.ipynb for tutorial on:

  • Part I: Importing and Loading Model
  • Part II: Preparing the Dataset and its FAISS-Index (You can skip Part II for direct use)
  • Part III: Querying One Molecule as Input
  • Part IV: Querying Multiple Molecule and using Averaged Embedding

Prerequisites

sentence_transformers, pandas, rdkit, tqdm, selfies, numpy, scikit-learn, faiss, pyarrow, matplotlib, pickle

Data Attribution

COCONUTDB

@article{sorokina2021coconut,
  title={COCONUT online: Collection of Open Natural Products database},
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={2},
  year={2021},
  doi={10.1186/s13321-020-00478-9}
}

License

Creative Commons Attribution Share Alike 3.0

Contact

GP Bayu - HF:@gbyuvd - e-mail:[email protected]

ko-fi

Acknowledgments

Citations

If you find this project useful in your research and wish to cite it, please use the following BibTex entries:

ChEmbed-v0.1

@software{chembed_selfies01,
  author = {GP Bayu},
  title = {{ChEmbed}: Fine-tuning A Lightweight Sentence Transformer Model on Molecular SELFIES},
  url = {https://huggingface.co/gbyuvd/ChemEmbed-v01},
  version = {0.1},
  year = {2024},
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

COCONUTDB

@article{sorokina2021coconut,
  title={COCONUT online: Collection of Open Natural Products database},
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={2},
  year={2021},
  doi={10.1186/s13321-020-00478-9}
}

SELFIES

@article{krenn2020selfies,
  title={Self-referencing embedded strings (SELFIES): A 100\% robust molecular string representation},
  author={Krenn, Mario and H{\"a}se, Florian and Nigam, AkshatKumar and Friederich, Pascal and Aspuru-Guzik, Alan},
  journal={Machine Learning: Science and Technology},
  volume={1},
  number={4},
  pages={045024},
  year={2020},
  doi={10.1088/2632-2153/aba947}
}

FAISS

@article{douze2024faiss,
      title={The Faiss library},
      author={Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre-Emmanuel Mazaré and Maria Lomeli and Lucas Hosseini and Hervé Jégou},
      year={2024},
      eprint={2401.08281},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

About

Demonstration of HF:ChemEmbed-v01's Use for Fast Molecular Similarity Search on Large Natural Product SELFIES Dataset

Topics

Resources

License

Stars

Watchers

Forks

Contributors