Skip to content

Thiebauts/kaggle-flood-prediction

Repository files navigation

Kaggle Flood Prediction

Regression on synthetic flood data — R² = 0.869 on private leaderboard.

Python Status scikit-learn LightGBM

Overview

Predicts flood probability from 20 synthetic environmental features for the Kaggle Playground Series S4E5 competition. The key insight is that the target is approximately a weighted sum of all features — giving tree models explicit access to this sum via aggregate features was the single biggest lever.

Highlights

Metric Score
Public LB R² = 0.86911
Private LB R² = 0.86871
Local CV (5-fold) R² = 0.86919

CV-to-LB gap of just 0.00008, confirming a reliable validation setup.

Impact by phase:

Phase Technique R² Gain
Feature engineering Aggregate features (sum, std, skew...) +0.028
Model selection LightGBM over Ridge baseline +0.024
Hyperparameter tuning lr=0.02, 63 leaves, depth=8 +0.0001
Ensembling Ridge stacking on OOF predictions +0.00002

Getting Started

Prerequisites

  • Python 3.11+

Installation

git clone https://github.com/YOUR_USERNAME/kaggle-flood-prediction.git
cd kaggle-flood-prediction
pip install scikit-learn lightgbm xgboost catboost pandas numpy scipy matplotlib

Download competition data from Kaggle into data/.

Quick Start

python train_flood.py --features agg --model lgbm

Project Structure

kaggle-flood-prediction/
├── train_flood.py             # Main training script (feature eng + k-fold CV)
├── solution.py                # Initial baseline (4 models + ensemble)
├── advanced_experiments.py    # Target encoding, residual boosting
├── ensemble.py                # Weighted averaging + stacking
├── run_azure.py               # Azure ML job submission
├── figures.py                 # Publication-quality visualizations
├── REPORT.md                  # Detailed competition report
├── azure_config.example.json  # Azure ML config template
├── data/                      # Competition data (gitignored)
├── results/                   # 35 experiment runs (gitignored)
└── figures/                   # Output plots

Methodology

Data: 1.1M training rows, 745k test rows, 20 integer-valued features, continuous target. Linear regression revealed the target is ~0.00565 * sum(all features) + noise, explaining 84.5% of variance.

Approach: Feature engineering focused on aggregate features (sum, mean, std, min, max, range, skew, kurtosis) to give tree models direct access to the linear component. LightGBM on 31 aggregate features captured 99.97% of the final ensemble's performance.

Validation: 5-fold CV with seed 42. Ensembling via Nelder-Mead optimized weighted average and Ridge stacking on OOF predictions.

Usage

# Train with different feature sets
python train_flood.py --features raw --model ridge
python train_flood.py --features agg --model lgbm
python train_flood.py --features full --model xgboost

# Run advanced experiments
python advanced_experiments.py

# Build ensemble from all results
python ensemble.py

# Generate report figures
python figures.py

Configuration

Azure ML is optional — only needed if you want to run training on cloud compute.

cp azure_config.example.json azure_config.json
# Fill in your Azure subscription_id, resource_group, workspace_name
python run_azure.py

Reproducibility

pip install scikit-learn lightgbm xgboost catboost pandas numpy scipy
# Download data into data/
python train_flood.py --features agg --model lgbm
python ensemble.py

All random seeds fixed at 42. Expected runtime: ~4 min for LightGBM 5-fold CV on CPU.

Key Findings

  1. Understand the data first — 5 min of linear regression was worth more than all subsequent modeling
  2. Feature engineering >> model selection >> HPO — impact ratio ~280:1:0.02
  3. Trees need help with linear patterns — explicit sum(features) boosted R² by 0.028
  4. Simpler feature sets beat complex ones — 31 agg features > 100 full features
  5. Know when to stop — feature engineering yielded 96% of final score improvement

Acknowledgments

Releases

No releases published

Packages

 
 
 

Contributors

Languages