Skip to content

cacaprog/churn-prediction-ecommerce

Repository files navigation

Olist Seller Churn Prediction

Python 3.11+ uv Docker License: MIT

End-to-end machine learning pipeline to predict seller churn for the Olist e-commerce marketplace β€” enabling proactive retention strategies and measurable revenue protection.


🎯 Business Problem

Olist's marketplace depends on an active seller base. High churn erodes GMV and increases acquisition costs. This project builds a dual-stage predictive system that flags at-risk sellers before they churn, giving account managers a prioritised intervention list with estimated revenue impact.

Key Results (842 sellers, Jun 2017 – Aug 2018)

Metric Value
Overall Churn Rate 85.0%
Never-Activated Sellers 515 (61.2%) β€” onboarding failure
Dormant Sellers 201 (23.9%) β€” retention failure
Active Sellers 140 (16.6%)
Revenue at Risk R$ 272,607
High / Critical Risk Sellers 669 (79.5%)
Pre-Activation Model AUC 0.975 (Logistic Regression)
Retention Model AUC 0.706 (Gradient Boosting)
Active Sellers Targeted for Intervention 33

πŸ—οΈ Architecture

Raw CSVs
   β”‚
   β–Ό
DataLoader ──► DataPreprocessor ──► ChurnAnalyzer (labels + cohorts)
                                          β”‚
                                          β–Ό
                                  FeatureEngineer
                                   /            \
                        Pre-Activation         Retention
                          Features              Features
                              β”‚                    β”‚
                              β–Ό                    β–Ό
                          ChurnModeler ──────► ChurnModeler
                       (LogReg / RF / GBM)  (LogReg / RF / GBM)
                              β”‚                    β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β–Ό
                                 ModelEvaluator
                              (ROC Β· PR Β· CM Β· FI)
                                       β”‚
                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β–Ό                         β–Ό
                   Risk Scoring              InsightsReporter
                (overall_churn_risk)    (churn_insights_report.md)
                          β”‚
                          β–Ό
               InterventionPrioritizer
              (intervention_priority_list.csv)
                          β”‚
                          β–Ό
               πŸ“Š Interactive Dashboard
              (dashboard/index.html)

πŸš€ Quick Start

Option A β€” Local (uv)

# 1. Clone and enter the project
git clone <repo>
cd olist-ecommerce

# 2. Copy and configure environment variables
cp .env.example .env   # set DATA_PATH to your Olist CSV folder

# 3. Install dependencies (uv required)
make install

# 4. Run the full pipeline
make run-pipeline

# 5. Generate the interactive dashboard
make dashboard         # β†’ opens dashboard/index.html

# 6. View generated reports
ls outputs/            # seller_master, risk_scores, segments, cohorts
ls outputs/figures/    # ROC, PR, confusion matrix, feature importance

Option B β€” Docker

# 1. Clone and configure env (same as above)
git clone <repo>
cd olist-ecommerce
cp .env.example .env

# 2. Build the image (one-time, ~2 min)
make docker-build

# 3. Run the full training pipeline
make docker-pipeline

# 4. Launch the MLflow UI β†’ http://localhost:5000
make docker-mlflow

Raw CSVs are read from ./data/raw/ on your host via a bind mount β€” no copying into the image required.


🐳 Docker

The project ships a multi-stage Dockerfile and a docker-compose.yml that orchestrates three independent services.

How it works

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  builder stage (python:3.10-slim + build-essential)      β”‚
β”‚   └─ uv sync β†’ resolves & installs deps into .venv       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚  COPY .venv only
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  runtime stage (python:3.10-slim, no compiler tools)     β”‚
β”‚   β”œβ”€ runs as non-root user (appuser)                     β”‚
β”‚   β”œβ”€ bind mount: ./data  β†’ /app/data  (host CSVs)        β”‚
β”‚   └─ named volumes: outputs Β· models Β· mlruns Β· logs     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Services

Service Profile Description
pipeline pipeline Runs the full training pipeline
inference inference Scores sellers with saved models
mlflow mlflow Experiment tracking UI on port 5000

Services use Compose profiles β€” nothing starts by default. Activate the one you need.

Commands

make docker-build      # Build the image (needed once, or after dep changes)
make docker-pipeline   # Train β€” reads ./data/raw, writes to volumes
make docker-inference  # Score β€” uses saved models from the models volume
make docker-mlflow     # Start MLflow UI β†’ http://localhost:5000
make docker-down       # Stop & remove containers
make docker-clean      # ⚠ Remove containers, volumes AND the image

When to rebuild

You only need make docker-build after:

  • Changing Dockerfile
  • Updating pyproject.toml or uv.lock (dependency changes)
  • Modifying files in src/, scripts/, or config/

Changes to docker-compose.yml never require a rebuild.


πŸ“Š Interactive Dashboard

After running the pipeline, generate a self-contained stakeholder dashboard:

make dashboard
# β†’ dashboard/index.html

The dashboard is a single HTML file (no server required) with five tabs:

Tab Contents
🏠 Overview 6 KPI cards · key insight pills · status & risk donuts
πŸ“… Cohort Analysis Monthly churn vs activation trend Β· GMV by cohort
πŸ—‚οΈ Segmentation Churn by business segment Β· lead type Β· behaviour profile Β· state
πŸ€– Model Performance Metric bars per model Β· cross-model comparison chart Β· pipeline architecture
🎯 Interventions Searchable & filterable priority table for 30 high-risk sellers

Open dashboard/index.html in any browser β€” no dependencies, no server.


πŸ“ Project Structure

olist-ecommerce/
β”œβ”€β”€ πŸ“ config/
β”‚   └── settings.py              # Centralised config via pydantic-settings (.env)
β”œβ”€β”€ πŸ“ src/                      # Core library β€” importable modules
β”‚   β”œβ”€β”€ pipeline.py              # End-to-end orchestrator + all pipeline classes
β”‚   β”œβ”€β”€ features.py              # Feature engineering (pre-activation + retention)
β”‚   β”œβ”€β”€ models.py                # Model training (LogReg, RF, GBM comparison)
β”‚   β”œβ”€β”€ evaluation.py            # Evaluation + chart generation β†’ outputs/figures/
β”‚   β”œβ”€β”€ reports.py               # Stakeholder Markdown report generation
β”‚   └── validation/
β”‚       └── schemas.py           # Pydantic data validation schemas
β”œβ”€β”€ πŸ“ scripts/
β”‚   β”œβ”€β”€ run_pipeline.py          # Entry point: runs src/pipeline.main()
β”‚   └── generate_dashboard.py    # Reads outputs/ CSVs β†’ dashboard/index.html
β”œβ”€β”€ πŸ“ dashboard/                # Dashboard source (index.html is gitignored)
β”‚   └── template.html            # HTML/JS/CSS template (Chart.js, dark theme)
β”œβ”€β”€ πŸ“ notebooks/
β”‚   └── poc_churn_analysis.ipynb # Exploratory analysis
β”œβ”€β”€ πŸ“ tests/
β”‚   └── test_pipeline.py         # Unit tests
β”œβ”€β”€ πŸ“ docs/                     # Project documentation
β”‚   β”œβ”€β”€ dataset.md               # Dataset description
β”‚   β”œβ”€β”€ glossary.md              # Domain glossary
β”‚   β”œβ”€β”€ project.md               # Architecture & design notes
β”‚   └── poc.md                   # POC findings
β”œβ”€β”€ πŸ“ outputs/                  # ← Generated on pipeline run (gitignored)
β”‚   β”œβ”€β”€ figures/                 # Charts: ROC, PR, confusion matrix, feature importance
β”‚   β”œβ”€β”€ seller_master.csv        # Full seller dataset with churn labels
β”‚   β”œβ”€β”€ seller_risk_scores.csv   # Per-seller risk scores
β”‚   β”œβ”€β”€ cohort_analysis.csv      # Monthly cohort stats
β”‚   β”œβ”€β”€ segment_analysis_*.csv   # Churn by segment, lead type, state, profile
β”‚   β”œβ”€β”€ intervention_priority_list.csv
β”‚   └── analysis_summary.txt
β”œβ”€β”€ πŸ“ models/                   # ← Generated on pipeline run (gitignored)
β”‚   β”œβ”€β”€ pre_activation_model.joblib
β”‚   └── retention_model.joblib
β”œβ”€β”€ πŸ“ data/                     # (gitignored) β€” place Olist CSVs here
β”‚   └── raw/
β”œβ”€β”€ .env.example                 # Required environment variables template
β”œβ”€β”€ Dockerfile                   # Multi-stage image (builder β†’ runtime)
β”œβ”€β”€ docker-compose.yml           # pipeline Β· inference Β· mlflow services
β”œβ”€β”€ .dockerignore                # Keeps build context lean
β”œβ”€β”€ Makefile                     # Developer shortcuts (local + Docker)
└── requirements.txt             # Pinned dependencies

πŸ€– Model Details

Stage 1 β€” Pre-Activation Model

Predicts whether a newly-onboarded seller will never make a sale (61% of the dataset).

Model AUC-ROC Accuracy F1
Logistic Regression βœ… 0.975 0.930 0.944
Random Forest 0.962 0.937 0.947
Gradient Boosting 0.953 0.918 0.933

Stage 2 β€” Retention Model

Predicts whether an activated seller will go dormant (60+ days without an order).

Model AUC-ROC Accuracy F1
Logistic Regression 0.647 0.597 0.638
Random Forest 0.673 0.597 0.638
Gradient Boosting βœ… 0.706 0.645 0.694

Charts for each model (ROC curve, Precision-Recall, Confusion Matrix, Feature Importance) are saved to reports/figures/ on every run.


πŸ› οΈ Developer Commands

Local

make install           # Install dependencies via uv
make run-pipeline      # Run the full end-to-end pipeline
make run-inference     # Score sellers using saved models
make dashboard         # Generate dashboard/index.html from latest outputs
make test              # Run unit tests
make test-coverage     # Run tests with HTML coverage report
make lint              # flake8 + black + isort + bandit checks
make format            # Auto-format with black + isort
make ci-check          # Full local CI simulation (lint β†’ typecheck β†’ tests)
make pre-commit-run    # Run all pre-commit hooks over the codebase
make mlflow-ui         # Open MLflow UI at http://localhost:5000
make clean             # Remove __pycache__, .pytest_cache, htmlcov
make clean-outputs     # Remove generated CSVs, reports, and figures
make setup-dirs        # Create required directories from scratch

Docker

make docker-build      # Build the Docker image
make docker-pipeline   # Run training pipeline in a container
make docker-inference  # Run inference in a container
make docker-mlflow     # Start MLflow UI at http://localhost:5000
make docker-down       # Stop & remove containers
make docker-clean      # ⚠ Remove containers, volumes AND the image

βš™οΈ Configuration

All settings are managed through config/settings.py and read from .env. No hardcoded paths anywhere in the codebase.

# .env
DATA_PATH=./data/raw          # Olist raw CSV files
OUTPUT_PATH=./outputs         # Generated CSVs and text files
MODELS_PATH=./models          # Saved .joblib models
FIGURES_PATH=./outputs/figures # Charts and plots

See .env.example for the full list of configurable values.


πŸ§ͺ Testing

make test             # Run all unit tests
make test-coverage    # Run with coverage (open htmlcov/index.html)

πŸ“§ Contact

Cairo Cananea