Regression on synthetic flood data — R² = 0.869 on private leaderboard.
Predicts flood probability from 20 synthetic environmental features for the Kaggle Playground Series S4E5 competition. The key insight is that the target is approximately a weighted sum of all features — giving tree models explicit access to this sum via aggregate features was the single biggest lever.
| Metric | Score |
|---|---|
| Public LB | R² = 0.86911 |
| Private LB | R² = 0.86871 |
| Local CV (5-fold) | R² = 0.86919 |
CV-to-LB gap of just 0.00008, confirming a reliable validation setup.
Impact by phase:
| Phase | Technique | R² Gain |
|---|---|---|
| Feature engineering | Aggregate features (sum, std, skew...) | +0.028 |
| Model selection | LightGBM over Ridge baseline | +0.024 |
| Hyperparameter tuning | lr=0.02, 63 leaves, depth=8 | +0.0001 |
| Ensembling | Ridge stacking on OOF predictions | +0.00002 |
- Python 3.11+
git clone https://github.com/YOUR_USERNAME/kaggle-flood-prediction.git
cd kaggle-flood-prediction
pip install scikit-learn lightgbm xgboost catboost pandas numpy scipy matplotlibDownload competition data from Kaggle into data/.
python train_flood.py --features agg --model lgbmkaggle-flood-prediction/
├── train_flood.py # Main training script (feature eng + k-fold CV)
├── solution.py # Initial baseline (4 models + ensemble)
├── advanced_experiments.py # Target encoding, residual boosting
├── ensemble.py # Weighted averaging + stacking
├── run_azure.py # Azure ML job submission
├── figures.py # Publication-quality visualizations
├── REPORT.md # Detailed competition report
├── azure_config.example.json # Azure ML config template
├── data/ # Competition data (gitignored)
├── results/ # 35 experiment runs (gitignored)
└── figures/ # Output plots
Data: 1.1M training rows, 745k test rows, 20 integer-valued features, continuous target. Linear regression revealed the target is ~0.00565 * sum(all features) + noise, explaining 84.5% of variance.
Approach: Feature engineering focused on aggregate features (sum, mean, std, min, max, range, skew, kurtosis) to give tree models direct access to the linear component. LightGBM on 31 aggregate features captured 99.97% of the final ensemble's performance.
Validation: 5-fold CV with seed 42. Ensembling via Nelder-Mead optimized weighted average and Ridge stacking on OOF predictions.
# Train with different feature sets
python train_flood.py --features raw --model ridge
python train_flood.py --features agg --model lgbm
python train_flood.py --features full --model xgboost
# Run advanced experiments
python advanced_experiments.py
# Build ensemble from all results
python ensemble.py
# Generate report figures
python figures.pyAzure ML is optional — only needed if you want to run training on cloud compute.
cp azure_config.example.json azure_config.json
# Fill in your Azure subscription_id, resource_group, workspace_name
python run_azure.pypip install scikit-learn lightgbm xgboost catboost pandas numpy scipy
# Download data into data/
python train_flood.py --features agg --model lgbm
python ensemble.pyAll random seeds fixed at 42. Expected runtime: ~4 min for LightGBM 5-fold CV on CPU.
- Understand the data first — 5 min of linear regression was worth more than all subsequent modeling
- Feature engineering >> model selection >> HPO — impact ratio ~280:1:0.02
- Trees need help with linear patterns — explicit
sum(features)boosted R² by 0.028 - Simpler feature sets beat complex ones — 31 agg features > 100 full features
- Know when to stop — feature engineering yielded 96% of final score improvement
- Kaggle Playground Series S4E5
- Azure ML for cloud compute (~$0.26 total)