CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction

🚀 Overview

CheMixHub is the first comprehensive benchmark suite for machine learning on chemical mixtures. It provides curated datasets, robust data splitting strategies, and baseline models to accelerate research in predicting and optimizing mixture properties.

This repository contains all datasets, data processing scripts, and code for the baselines presented in our paper: "CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction"

✨ Features & Key Contributions

Curated Datasets (13 Tasks):
- Standardized 11 tasks from 7 diverse public datasets.
- New: 2 large-scale tasks (116,896 data points) from ILThermo, larger than any existing public mixture dataset.
- See Dataset Details below for a list of all included datasets and sources.
Robust Generalization Splits:
- Random
- Unseen Chemical Components
- Varying Mixture Size/Composition
- Extrapolation Context (e.g., temperature)
Baseline Models & Benchmarks:
- Implementations of representative ML models.
- Established initial performance benchmarks for easy comparison.
🥐 Croissant Metadata: Each dataset includes a croissant.json file, providing rich, standardized metadata for improved findability, usability, and interoperability with ML tools. (See datasets/<dataset_name>/croissant.json)

🏁 Getting Started

Clone the repository:

git clone https://github.com/chemcognition-lab/chemixhub.git # Corrected path
cd chemixhub

Install dependencies:
```
pip install -e .
```
Explore the datasets:
- Datasets are located in the datasets/ directory. Each has a README.md with specific details.
- Processed data (processed_data.csv and compounds.csv) and data splits are in datasets/<dataset_name>/processed_data/.
(Optional) Run baseline models:
- Scripts for training and evaluation are in scripts/. (See scripts/README.md for more details).

📁 Repository Structure

datasets/: Contains all curated datasets.
config/: Basic config files used to train models.
scripts/: Utility scripts (training, evaluation, feature precomputation, etc.).
src/: Core library code.

Check out the separate READMEs in each folder for more details!

📊 Dataset Details

CheMixHub consolidates and standardizes data from the following sources:

Miscible Solvents: Density, enthalpy of mixing, partial molar enthalpy. Source Paper
ILThermo: Transport properties (ionic conductivity, viscosity) for ionic liquid mixtures. Source Paper | Includes 2 new large-scale tasks. | To get the dataset, run the following script: datasets/ionic-liquids/raw_data/fetch_ilthermo.py
NIST Viscosity: Viscosity for organic mixtures from NIST ThermoData Engine (via Zenodo). Source Paper | Link to Dataset | Path to data file in Zenodo: nist_dippr_source/NIST_Visc_Data.csv | Download and put the file in datasets/nist-logV/raw_data and to get the processed dataset, run the following script: datasets/nist-logV/processed_data/data_processing.py
Drug Solubility: Drug solubility in solvent mixtures. Source Paper
Solid Polymer Electrolyte Ionic Conductivity (SPEs): Ionic conductivity for polymer–salt mixtures. Source Paper
Olfactory Similarity: Perceptual similarity scores for mixtures. Source Paper
Motor Octane Number (MON): Octane numbers for hydrocarbons and fuels. Source Paper

Citing CheMixHub

If you use CheMixHub in your research, please cite our paper:

@article{rajaonson2025chemixhub,
  title={CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction},
  author={Rajaonson, Ella Miray and Kochi, Mahyar Rajabi and Mendoza, Luis Martin Mejia and Moosavi, Seyed Mohamad and Sanchez-Lengeling, Benjamin},
  journal={arXiv preprint arXiv:2506.12231},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
config		config
datasets		datasets
media/figures		media/figures
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction

🚀 Overview

✨ Features & Key Contributions

🏁 Getting Started

📁 Repository Structure

📊 Dataset Details

Citing CheMixHub

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction

🚀 Overview

✨ Features & Key Contributions

🏁 Getting Started

📁 Repository Structure

📊 Dataset Details

Citing CheMixHub

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages