Kaggle Flood Prediction

Regression on synthetic flood data — R² = 0.869 on private leaderboard.

Overview

Predicts flood probability from 20 synthetic environmental features for the Kaggle Playground Series S4E5 competition. The key insight is that the target is approximately a weighted sum of all features — giving tree models explicit access to this sum via aggregate features was the single biggest lever.

Highlights

Metric	Score
Public LB	R² = 0.86911
Private LB	R² = 0.86871
Local CV (5-fold)	R² = 0.86919

CV-to-LB gap of just 0.00008, confirming a reliable validation setup.

Impact by phase:

Phase	Technique	R² Gain
Feature engineering	Aggregate features (sum, std, skew...)	+0.028
Model selection	LightGBM over Ridge baseline	+0.024
Hyperparameter tuning	lr=0.02, 63 leaves, depth=8	+0.0001
Ensembling	Ridge stacking on OOF predictions	+0.00002

Getting Started

Prerequisites

Python 3.11+

Installation

git clone https://github.com/YOUR_USERNAME/kaggle-flood-prediction.git
cd kaggle-flood-prediction
pip install scikit-learn lightgbm xgboost catboost pandas numpy scipy matplotlib

Download competition data from Kaggle into data/.

Quick Start

python train_flood.py --features agg --model lgbm

Project Structure

kaggle-flood-prediction/
├── train_flood.py             # Main training script (feature eng + k-fold CV)
├── solution.py                # Initial baseline (4 models + ensemble)
├── advanced_experiments.py    # Target encoding, residual boosting
├── ensemble.py                # Weighted averaging + stacking
├── run_azure.py               # Azure ML job submission
├── figures.py                 # Publication-quality visualizations
├── REPORT.md                  # Detailed competition report
├── azure_config.example.json  # Azure ML config template
├── data/                      # Competition data (gitignored)
├── results/                   # 35 experiment runs (gitignored)
└── figures/                   # Output plots

Methodology

Data: 1.1M training rows, 745k test rows, 20 integer-valued features, continuous target. Linear regression revealed the target is ~0.00565 * sum(all features) + noise, explaining 84.5% of variance.

Approach: Feature engineering focused on aggregate features (sum, mean, std, min, max, range, skew, kurtosis) to give tree models direct access to the linear component. LightGBM on 31 aggregate features captured 99.97% of the final ensemble's performance.

Validation: 5-fold CV with seed 42. Ensembling via Nelder-Mead optimized weighted average and Ridge stacking on OOF predictions.

Usage

# Train with different feature sets
python train_flood.py --features raw --model ridge
python train_flood.py --features agg --model lgbm
python train_flood.py --features full --model xgboost

# Run advanced experiments
python advanced_experiments.py

# Build ensemble from all results
python ensemble.py

# Generate report figures
python figures.py

Configuration

Azure ML is optional — only needed if you want to run training on cloud compute.

cp azure_config.example.json azure_config.json
# Fill in your Azure subscription_id, resource_group, workspace_name
python run_azure.py

Reproducibility

pip install scikit-learn lightgbm xgboost catboost pandas numpy scipy
# Download data into data/
python train_flood.py --features agg --model lgbm
python ensemble.py

All random seeds fixed at 42. Expected runtime: ~4 min for LightGBM 5-fold CV on CPU.

Key Findings

Understand the data first — 5 min of linear regression was worth more than all subsequent modeling
Feature engineering >> model selection >> HPO — impact ratio ~280:1:0.02
Trees need help with linear patterns — explicit sum(features) boosted R² by 0.028
Simpler feature sets beat complex ones — 31 agg features > 100 full features
Know when to stop — feature engineering yielded 96% of final score improvement

Acknowledgments

Kaggle Playground Series S4E5
Azure ML for cloud compute (~$0.26 total)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle Flood Prediction

Overview

Highlights

Getting Started

Prerequisites

Installation

Quick Start

Project Structure

Methodology

Usage

Configuration

Reproducibility

Key Findings

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
REPORT.md		REPORT.md
advanced_experiments.py		advanced_experiments.py
azure_config.example.json		azure_config.example.json
ensemble.py		ensemble.py
figures.py		figures.py
run_azure.py		run_azure.py
solution.py		solution.py
train_flood.py		train_flood.py

Folders and files

Latest commit

History

Repository files navigation

Kaggle Flood Prediction

Overview

Highlights

Getting Started

Prerequisites

Installation

Quick Start

Project Structure

Methodology

Usage

Configuration

Reproducibility

Key Findings

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages