Skip to content

Thiebauts/kaggle-customer-churn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kaggle Customer Churn

Binary classification for telecom churn — AUC = 0.914 on public leaderboard.

Python Status scikit-learn CatBoost

Overview

Predicts telecom customer churn for the Kaggle Playground Series S6E3 competition. 75 experiments across 6 model families, distilled into an 11-model greedy ensemble. Raw features consistently beat engineered ones — the gains came from ensemble diversity, not feature engineering.

Highlights

Method CV AUC Public LB
Best single model (CatBoost) 0.91621 0.91370
Top-20 average ensemble 0.91672 0.91401
Greedy forward selection (11 models) 0.91691 ~0.91410

Getting Started

Prerequisites

  • Python 3.11+

Installation

git clone https://github.com/YOUR_USERNAME/kaggle-customer-churn.git
cd kaggle-customer-churn
pip install scikit-learn lightgbm xgboost catboost optuna pandas numpy

Download competition data from Kaggle into data/.

Quick Start

python train_churn.py --features raw --model catboost --seeds 3

Project Structure

kaggle-customer-churn/
├── train_churn.py             # Main training script (5-fold CV)
├── run_azure.py               # Azure ML job orchestrator
├── advanced_experiments.py    # Target/frequency encoding experiments
├── optuna_hpo.py              # CatBoost/LightGBM Bayesian HPO
├── lgbm_hpo.py                # Focused LightGBM + XGBoost HPO
├── diverse_models.py          # HistGBT, ExtraTrees, Ridge diversity
├── fast_ensemble.py           # Simple averaging, rank averaging
├── greedy_ensemble.py         # Greedy forward selection + stacking
├── pseudo_label.py            # Semi-supervised pseudo-labeling
├── figures.py                 # Visualizations
├── improvement_log.md         # Detailed experiment log
├── session_log.md             # Azure ML session log
├── azure_config.example.json  # Azure config template
├── data/                      # Competition data (gitignored)
└── results/                   # 75 experiment runs (gitignored)

Methodology

Data: 594k training rows, 255k test rows, 19 features (demographics, services, billing). Binary target: churn yes/no.

Approach: Tested 6 model families (CatBoost, LightGBM, XGBoost, HistGBT, ExtraTrees, Ridge) with raw label-encoded features. Bayesian HPO via Optuna (80 LightGBM + 40 XGBoost trials). Final ensemble built via greedy forward selection — iteratively adding models only if they improved CV AUC, selecting 11 from 75 candidates.

Validation: 5-fold stratified CV. OOF predictions saved for offline ensemble experimentation.

Usage

# Train with different models
python train_churn.py --features raw --model lgbm --seeds 5
python train_churn.py --features raw --model xgboost --seeds 3

# Run Bayesian HPO
python lgbm_hpo.py
python optuna_hpo.py

# Add ensemble diversity
python diverse_models.py

# Build final ensemble
python greedy_ensemble.py

Configuration

Azure ML is optional.

cp azure_config.example.json azure_config.json
python run_azure.py

Reproducibility

pip install scikit-learn lightgbm xgboost catboost optuna pandas numpy
python train_churn.py --features raw --model catboost --seeds 3
python greedy_ensemble.py

Random seed: 42. Multi-seed training (3-5 seeds) for variance estimation.

Key Findings

  1. Raw features beat engineered features — trees handled categoricals better as simple label-encoded integers
  2. Ensemble diversity > individual quality — weak models (HistGBT, ExtraTrees) contributed via architectural diversity
  3. Greedy selection > simple averaging — selecting 11 from 75 models outperformed blindly averaging top-20
  4. Multi-seed training is low-hanging fruit — averaging 5 LightGBM seeds already beats the single best model
  5. HPO needs enough trials — 12 CatBoost trials in 10-dim space was insufficient; 50+ needed

Acknowledgments

Releases

No releases published

Packages

 
 
 

Contributors

Languages