Skip to content

Licht0812/chemixhub

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction

Paper License: MIT

🚀 Overview

CheMixHub is the first comprehensive benchmark suite for machine learning on chemical mixtures. It provides curated datasets, robust data splitting strategies, and baseline models to accelerate research in predicting and optimizing mixture properties.

This repository contains all datasets, data processing scripts, and code for the baselines presented in our paper: "CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction"

✨ Features & Key Contributions

  • Curated Datasets (13 Tasks):
    • Standardized 11 tasks from 7 diverse public datasets.
    • New: 2 large-scale tasks (116,896 data points) from ILThermo, larger than any existing public mixture dataset.
    • See Dataset Details below for a list of all included datasets and sources.
  • Robust Generalization Splits:
    • Random
    • Unseen Chemical Components
    • Varying Mixture Size/Composition
    • Extrapolation Context (e.g., temperature)
  • Baseline Models & Benchmarks:
    • Implementations of representative ML models.
    • Established initial performance benchmarks for easy comparison.
  • 🥐 Croissant Metadata: Each dataset includes a croissant.json file, providing rich, standardized metadata for improved findability, usability, and interoperability with ML tools. (See datasets/<dataset_name>/croissant.json)

🏁 Getting Started

  1. Clone the repository:
    git clone https://github.com/chemcognition-lab/chemixhub.git # Corrected path
    cd chemixhub
  2. Install dependencies:
    pip install -e .
  3. Explore the datasets:
    • Datasets are located in the datasets/ directory. Each has a README.md with specific details.
    • Processed data (processed_data.csv and compounds.csv) and data splits are in datasets/<dataset_name>/processed_data/.
  4. (Optional) Run baseline models:
    • Scripts for training and evaluation are in scripts/. (See scripts/README.md for more details).

📁 Repository Structure

  • datasets/: Contains all curated datasets.
  • config/: Basic config files used to train models.
  • scripts/: Utility scripts (training, evaluation, feature precomputation, etc.).
  • src/: Core library code.

Check out the separate READMEs in each folder for more details!

📊 Dataset Details

CheMixHub consolidates and standardizes data from the following sources:

  • Miscible Solvents: Density, enthalpy of mixing, partial molar enthalpy. Source Paper
  • ILThermo: Transport properties (ionic conductivity, viscosity) for ionic liquid mixtures. Source Paper | Includes 2 new large-scale tasks. | To get the dataset, run the following script: datasets/ionic-liquids/raw_data/fetch_ilthermo.py
  • NIST Viscosity: Viscosity for organic mixtures from NIST ThermoData Engine (via Zenodo). Source Paper | Link to Dataset | Path to data file in Zenodo: nist_dippr_source/NIST_Visc_Data.csv | Download and put the file in datasets/nist-logV/raw_data and to get the processed dataset, run the following script: datasets/nist-logV/processed_data/data_processing.py
  • Drug Solubility: Drug solubility in solvent mixtures. Source Paper
  • Solid Polymer Electrolyte Ionic Conductivity (SPEs): Ionic conductivity for polymer–salt mixtures. Source Paper
  • Olfactory Similarity: Perceptual similarity scores for mixtures. Source Paper
  • Motor Octane Number (MON): Octane numbers for hydrocarbons and fuels. Source Paper

Citing CheMixHub

If you use CheMixHub in your research, please cite our paper:

@article{rajaonson2025chemixhub,
  title={CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction},
  author={Rajaonson, Ella Miray and Kochi, Mahyar Rajabi and Mendoza, Luis Martin Mejia and Moosavi, Seyed Mohamad and Sanchez-Lengeling, Benjamin},
  journal={arXiv preprint arXiv:2506.12231},
  year={2025}
}

About

CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%