Skip to content

Thiebauts/kaggle-polymer-properties

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kaggle Polymer Properties

Multi-target regression predicting 5 polymer properties from molecular SMILES — NeurIPS 2025 competition.

Python Status scikit-learn RDKit

Overview

Predicts 5 polymer properties (glass transition temperature, fractional free volume, thermal conductivity, density, radius of gyration) from molecular SMILES strings for the NeurIPS Open Polymer Prediction 2025 competition. Features three tiers of molecular representations: basic SMILES parsing, RDKit descriptors, and ChemBERTa transformer embeddings.

Highlights

Property RMSE Samples
Glass transition temp (Tg) 74.30 °C 0.535 557
Fractional free volume (FFV) 0.0168 0.667 7,892
Thermal conductivity (Tc) 0.0325 W/m·K 0.832 1,611
Density 0.0939 g/cm³ 0.580 613
Radius of gyration (Rg) 3.26 A 0.490 614

Optuna Bayesian optimization yielded 13.6% improvement on Tc over GridSearchCV.

Getting Started

Prerequisites

  • Python 3.11+
  • Optional: RDKit, PyTorch + transformers (for advanced solutions)

Installation

git clone https://github.com/YOUR_USERNAME/kaggle-polymer-properties.git
cd kaggle-polymer-properties
pip install -r requirements.txt

Download competition data from Kaggle into the project root.

Quick Start

python kaggle_universal_solution.py

Project Structure

kaggle-polymer-properties/
├── kaggle_universal_solution.py          # Baseline RF + 13 SMILES features
├── kaggle_advanced_optimization_fixed.py # GridSearchCV + Optuna HPO
├── kaggle_transformer_solution.py        # Full ChemBERTa pipeline (GPU)
├── kaggle_transformer_lite.py            # Lightweight with graceful fallbacks
├── requirements.txt                      # Python dependencies
├── OPTIMAL_HYPERPARAMETERS.md            # Best Optuna results
├── OPTIMIZATION_SUMMARY.md              # GridSearchCV comparison
├── TRANSFORMER_SOLUTION_GUIDE.md        # Transformer implementation guide
├── WORKFLOW_DIAGRAM.md                  # Execution flow diagrams
├── information/                          # Competition documentation
├── train.csv                             # Main training data
├── test.csv                              # Test data
└── train_supplement/                     # Supplementary datasets

Methodology

Data: Polymer SMILES strings with 5 target properties at varying completeness (557-7,892 samples per target). Supplementary datasets add up to 54% more samples for some targets.

Features: Three tiers — (1) 13 basic SMILES-derived features (atomic counts, structural patterns, ratios), (2) 25 RDKit molecular descriptors (MW, LogP, TPSA, etc.), (3) 768D ChemBERTa embeddings from a transformer pre-trained on 77M molecules.

Models: Random Forest with Optuna Bayesian optimization (100 trials per target). Four solution variants from baseline to full transformer, each self-contained and auto-detecting data paths.

Usage

# Baseline (works everywhere)
python kaggle_universal_solution.py

# With hyperparameter optimization
python kaggle_advanced_optimization_fixed.py

# Full transformer pipeline (needs GPU + transformers)
python kaggle_transformer_solution.py

# Auto-detects available libraries
python kaggle_transformer_lite.py

Reproducibility

pip install -r requirements.txt
python kaggle_universal_solution.py

Training time: ~30s for baseline on CPU. Transformer solution requires GPU.

Key Findings

  1. Supplementary data matters — Tc gained 54% more samples, improving R² from 0.801 to 0.832
  2. Optuna >> GridSearchCV — Bayesian optimization found better params in fewer trials (13.6% Tc gain)
  3. Dataset size is the bottleneck — Tg and Rg (<615 samples) limited regardless of model complexity
  4. Different targets need different configs — Tg needs depth limits (15); FFV works best unlimited
  5. Graceful fallbacks are essential — Kaggle notebooks may lack transformers/RDKit

Acknowledgments