Kaggle Customer Churn

Binary classification for telecom churn — AUC = 0.914 on public leaderboard.

Overview

Predicts telecom customer churn for the Kaggle Playground Series S6E3 competition. 75 experiments across 6 model families, distilled into an 11-model greedy ensemble. Raw features consistently beat engineered ones — the gains came from ensemble diversity, not feature engineering.

Highlights

Method	CV AUC	Public LB
Best single model (CatBoost)	0.91621	0.91370
Top-20 average ensemble	0.91672	0.91401
Greedy forward selection (11 models)	0.91691	~0.91410

Getting Started

Prerequisites

Python 3.11+

Installation

git clone https://github.com/YOUR_USERNAME/kaggle-customer-churn.git
cd kaggle-customer-churn
pip install scikit-learn lightgbm xgboost catboost optuna pandas numpy

Download competition data from Kaggle into data/.

Quick Start

python train_churn.py --features raw --model catboost --seeds 3

Project Structure

kaggle-customer-churn/
├── train_churn.py             # Main training script (5-fold CV)
├── run_azure.py               # Azure ML job orchestrator
├── advanced_experiments.py    # Target/frequency encoding experiments
├── optuna_hpo.py              # CatBoost/LightGBM Bayesian HPO
├── lgbm_hpo.py                # Focused LightGBM + XGBoost HPO
├── diverse_models.py          # HistGBT, ExtraTrees, Ridge diversity
├── fast_ensemble.py           # Simple averaging, rank averaging
├── greedy_ensemble.py         # Greedy forward selection + stacking
├── pseudo_label.py            # Semi-supervised pseudo-labeling
├── figures.py                 # Visualizations
├── improvement_log.md         # Detailed experiment log
├── session_log.md             # Azure ML session log
├── azure_config.example.json  # Azure config template
├── data/                      # Competition data (gitignored)
└── results/                   # 75 experiment runs (gitignored)

Methodology

Data: 594k training rows, 255k test rows, 19 features (demographics, services, billing). Binary target: churn yes/no.

Approach: Tested 6 model families (CatBoost, LightGBM, XGBoost, HistGBT, ExtraTrees, Ridge) with raw label-encoded features. Bayesian HPO via Optuna (80 LightGBM + 40 XGBoost trials). Final ensemble built via greedy forward selection — iteratively adding models only if they improved CV AUC, selecting 11 from 75 candidates.

Validation: 5-fold stratified CV. OOF predictions saved for offline ensemble experimentation.

Usage

# Train with different models
python train_churn.py --features raw --model lgbm --seeds 5
python train_churn.py --features raw --model xgboost --seeds 3

# Run Bayesian HPO
python lgbm_hpo.py
python optuna_hpo.py

# Add ensemble diversity
python diverse_models.py

# Build final ensemble
python greedy_ensemble.py

Configuration

Azure ML is optional.

cp azure_config.example.json azure_config.json
python run_azure.py

Reproducibility

pip install scikit-learn lightgbm xgboost catboost optuna pandas numpy
python train_churn.py --features raw --model catboost --seeds 3
python greedy_ensemble.py

Random seed: 42. Multi-seed training (3-5 seeds) for variance estimation.

Key Findings

Raw features beat engineered features — trees handled categoricals better as simple label-encoded integers
Ensemble diversity > individual quality — weak models (HistGBT, ExtraTrees) contributed via architectural diversity
Greedy selection > simple averaging — selecting 11 from 75 models outperformed blindly averaging top-20
Multi-seed training is low-hanging fruit — averaging 5 LightGBM seeds already beats the single best model
HPO needs enough trials — 12 CatBoost trials in 10-dim space was insufficient; 50+ needed

Acknowledgments

Kaggle Playground Series S6E3
Azure ML for cloud compute (~$0.27 total)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle Customer Churn

Overview

Highlights

Getting Started

Prerequisites

Installation

Quick Start

Project Structure

Methodology

Usage

Configuration

Reproducibility

Key Findings

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.amlignore		.amlignore
.gitignore		.gitignore
README.md		README.md
advanced_experiments.py		advanced_experiments.py
azure_config.example.json		azure_config.example.json
boost_v2.py		boost_v2.py
catboost_native.py		catboost_native.py
check_jobs.py		check_jobs.py
complete_hpo.py		complete_hpo.py
diverse_models.py		diverse_models.py
ensemble.py		ensemble.py
fast_ensemble.py		fast_ensemble.py
figures.py		figures.py
greedy_ensemble.py		greedy_ensemble.py
improvement_log.md		improvement_log.md
lgbm_hpo.py		lgbm_hpo.py
optuna_hpo.py		optuna_hpo.py
orig_integration.py		orig_integration.py
pseudo_label.py		pseudo_label.py
run_azure.py		run_azure.py
session_log.md		session_log.md
stacking_ensemble.py		stacking_ensemble.py
train_churn.py		train_churn.py

Folders and files

Latest commit

History

Repository files navigation

Kaggle Customer Churn

Overview

Highlights

Getting Started

Prerequisites

Installation

Quick Start

Project Structure

Methodology

Usage

Configuration

Reproducibility

Key Findings

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages