CheMixHub is the first comprehensive benchmark suite for machine learning on chemical mixtures. It provides curated datasets, robust data splitting strategies, and baseline models to accelerate research in predicting and optimizing mixture properties.
This repository contains all datasets, data processing scripts, and code for the baselines presented in our paper: "CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction"
- Curated Datasets (13 Tasks):
- Standardized 11 tasks from 7 diverse public datasets.
- New: 2 large-scale tasks (116,896 data points) from ILThermo, larger than any existing public mixture dataset.
- See Dataset Details below for a list of all included datasets and sources.
- Robust Generalization Splits:
- Random
- Unseen Chemical Components
- Varying Mixture Size/Composition
- Extrapolation Context (e.g., temperature)
- Baseline Models & Benchmarks:
- Implementations of representative ML models.
- Established initial performance benchmarks for easy comparison.
- 🥐 Croissant Metadata: Each dataset includes a
croissant.jsonfile, providing rich, standardized metadata for improved findability, usability, and interoperability with ML tools. (Seedatasets/<dataset_name>/croissant.json)
- Clone the repository:
git clone https://github.com/chemcognition-lab/chemixhub.git # Corrected path cd chemixhub
- Install dependencies:
pip install -e . - Explore the datasets:
- Datasets are located in the
datasets/directory. Each has aREADME.mdwith specific details. - Processed data (
processed_data.csvandcompounds.csv) and data splits are indatasets/<dataset_name>/processed_data/.
- Datasets are located in the
- (Optional) Run baseline models:
- Scripts for training and evaluation are in
scripts/. (Seescripts/README.mdfor more details).
- Scripts for training and evaluation are in
datasets/: Contains all curated datasets.config/: Basic config files used to train models.scripts/: Utility scripts (training, evaluation, feature precomputation, etc.).src/: Core library code.
Check out the separate READMEs in each folder for more details!
CheMixHub consolidates and standardizes data from the following sources:
- Miscible Solvents: Density, enthalpy of mixing, partial molar enthalpy. Source Paper
- ILThermo: Transport properties (ionic conductivity, viscosity) for ionic liquid mixtures.
Source Paper | Includes 2 new large-scale tasks. | To get the dataset, run the following script:
datasets/ionic-liquids/raw_data/fetch_ilthermo.py - NIST Viscosity: Viscosity for organic mixtures from NIST ThermoData Engine (via Zenodo).
Source Paper
| Link to Dataset |
Path to data file in Zenodo:
nist_dippr_source/NIST_Visc_Data.csv| Download and put the file indatasets/nist-logV/raw_dataand to get the processed dataset, run the following script:datasets/nist-logV/processed_data/data_processing.py - Drug Solubility: Drug solubility in solvent mixtures. Source Paper
- Solid Polymer Electrolyte Ionic Conductivity (SPEs): Ionic conductivity for polymer–salt mixtures. Source Paper
- Olfactory Similarity: Perceptual similarity scores for mixtures. Source Paper
- Motor Octane Number (MON): Octane numbers for hydrocarbons and fuels. Source Paper
If you use CheMixHub in your research, please cite our paper:
@article{rajaonson2025chemixhub,
title={CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction},
author={Rajaonson, Ella Miray and Kochi, Mahyar Rajabi and Mendoza, Luis Martin Mejia and Moosavi, Seyed Mohamad and Sanchez-Lengeling, Benjamin},
journal={arXiv preprint arXiv:2506.12231},
year={2025}
}