Multi-target regression predicting 5 polymer properties from molecular SMILES — NeurIPS 2025 competition.
Predicts 5 polymer properties (glass transition temperature, fractional free volume, thermal conductivity, density, radius of gyration) from molecular SMILES strings for the NeurIPS Open Polymer Prediction 2025 competition. Features three tiers of molecular representations: basic SMILES parsing, RDKit descriptors, and ChemBERTa transformer embeddings.
| Property | RMSE | R² | Samples |
|---|---|---|---|
| Glass transition temp (Tg) | 74.30 °C | 0.535 | 557 |
| Fractional free volume (FFV) | 0.0168 | 0.667 | 7,892 |
| Thermal conductivity (Tc) | 0.0325 W/m·K | 0.832 | 1,611 |
| Density | 0.0939 g/cm³ | 0.580 | 613 |
| Radius of gyration (Rg) | 3.26 A | 0.490 | 614 |
Optuna Bayesian optimization yielded 13.6% improvement on Tc over GridSearchCV.
- Python 3.11+
- Optional: RDKit, PyTorch + transformers (for advanced solutions)
git clone https://github.com/YOUR_USERNAME/kaggle-polymer-properties.git
cd kaggle-polymer-properties
pip install -r requirements.txtDownload competition data from Kaggle into the project root.
python kaggle_universal_solution.pykaggle-polymer-properties/
├── kaggle_universal_solution.py # Baseline RF + 13 SMILES features
├── kaggle_advanced_optimization_fixed.py # GridSearchCV + Optuna HPO
├── kaggle_transformer_solution.py # Full ChemBERTa pipeline (GPU)
├── kaggle_transformer_lite.py # Lightweight with graceful fallbacks
├── requirements.txt # Python dependencies
├── OPTIMAL_HYPERPARAMETERS.md # Best Optuna results
├── OPTIMIZATION_SUMMARY.md # GridSearchCV comparison
├── TRANSFORMER_SOLUTION_GUIDE.md # Transformer implementation guide
├── WORKFLOW_DIAGRAM.md # Execution flow diagrams
├── information/ # Competition documentation
├── train.csv # Main training data
├── test.csv # Test data
└── train_supplement/ # Supplementary datasets
Data: Polymer SMILES strings with 5 target properties at varying completeness (557-7,892 samples per target). Supplementary datasets add up to 54% more samples for some targets.
Features: Three tiers — (1) 13 basic SMILES-derived features (atomic counts, structural patterns, ratios), (2) 25 RDKit molecular descriptors (MW, LogP, TPSA, etc.), (3) 768D ChemBERTa embeddings from a transformer pre-trained on 77M molecules.
Models: Random Forest with Optuna Bayesian optimization (100 trials per target). Four solution variants from baseline to full transformer, each self-contained and auto-detecting data paths.
# Baseline (works everywhere)
python kaggle_universal_solution.py
# With hyperparameter optimization
python kaggle_advanced_optimization_fixed.py
# Full transformer pipeline (needs GPU + transformers)
python kaggle_transformer_solution.py
# Auto-detects available libraries
python kaggle_transformer_lite.pypip install -r requirements.txt
python kaggle_universal_solution.pyTraining time: ~30s for baseline on CPU. Transformer solution requires GPU.
- Supplementary data matters — Tc gained 54% more samples, improving R² from 0.801 to 0.832
- Optuna >> GridSearchCV — Bayesian optimization found better params in fewer trials (13.6% Tc gain)
- Dataset size is the bottleneck — Tg and Rg (<615 samples) limited regardless of model complexity
- Different targets need different configs — Tg needs depth limits (15); FFV works best unlimited
- Graceful fallbacks are essential — Kaggle notebooks may lack transformers/RDKit
- NeurIPS Open Polymer Prediction 2025
- ChemBERTa pre-trained molecular model