End-to-end machine learning pipeline to predict seller churn for the Olist e-commerce marketplace β enabling proactive retention strategies and measurable revenue protection.
Olist's marketplace depends on an active seller base. High churn erodes GMV and increases acquisition costs. This project builds a dual-stage predictive system that flags at-risk sellers before they churn, giving account managers a prioritised intervention list with estimated revenue impact.
| Metric | Value |
|---|---|
| Overall Churn Rate | 85.0% |
| Never-Activated Sellers | 515 (61.2%) β onboarding failure |
| Dormant Sellers | 201 (23.9%) β retention failure |
| Active Sellers | 140 (16.6%) |
| Revenue at Risk | R$ 272,607 |
| High / Critical Risk Sellers | 669 (79.5%) |
| Pre-Activation Model AUC | 0.975 (Logistic Regression) |
| Retention Model AUC | 0.706 (Gradient Boosting) |
| Active Sellers Targeted for Intervention | 33 |
Raw CSVs
β
βΌ
DataLoader βββΊ DataPreprocessor βββΊ ChurnAnalyzer (labels + cohorts)
β
βΌ
FeatureEngineer
/ \
Pre-Activation Retention
Features Features
β β
βΌ βΌ
ChurnModeler βββββββΊ ChurnModeler
(LogReg / RF / GBM) (LogReg / RF / GBM)
β β
ββββββββββ¬ββββββββββββ
βΌ
ModelEvaluator
(ROC Β· PR Β· CM Β· FI)
β
ββββββββββββββ΄βββββββββββββ
βΌ βΌ
Risk Scoring InsightsReporter
(overall_churn_risk) (churn_insights_report.md)
β
βΌ
InterventionPrioritizer
(intervention_priority_list.csv)
β
βΌ
π Interactive Dashboard
(dashboard/index.html)
# 1. Clone and enter the project
git clone <repo>
cd olist-ecommerce
# 2. Copy and configure environment variables
cp .env.example .env # set DATA_PATH to your Olist CSV folder
# 3. Install dependencies (uv required)
make install
# 4. Run the full pipeline
make run-pipeline
# 5. Generate the interactive dashboard
make dashboard # β opens dashboard/index.html
# 6. View generated reports
ls outputs/ # seller_master, risk_scores, segments, cohorts
ls outputs/figures/ # ROC, PR, confusion matrix, feature importance# 1. Clone and configure env (same as above)
git clone <repo>
cd olist-ecommerce
cp .env.example .env
# 2. Build the image (one-time, ~2 min)
make docker-build
# 3. Run the full training pipeline
make docker-pipeline
# 4. Launch the MLflow UI β http://localhost:5000
make docker-mlflowRaw CSVs are read from
./data/raw/on your host via a bind mount β no copying into the image required.
The project ships a multi-stage Dockerfile and a docker-compose.yml that orchestrates three independent services.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β builder stage (python:3.10-slim + build-essential) β
β ββ uv sync β resolves & installs deps into .venv β
βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β COPY .venv only
βββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
β runtime stage (python:3.10-slim, no compiler tools) β
β ββ runs as non-root user (appuser) β
β ββ bind mount: ./data β /app/data (host CSVs) β
β ββ named volumes: outputs Β· models Β· mlruns Β· logs β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Service | Profile | Description |
|---|---|---|
pipeline |
pipeline |
Runs the full training pipeline |
inference |
inference |
Scores sellers with saved models |
mlflow |
mlflow |
Experiment tracking UI on port 5000 |
Services use Compose profiles β nothing starts by default. Activate the one you need.
make docker-build # Build the image (needed once, or after dep changes)
make docker-pipeline # Train β reads ./data/raw, writes to volumes
make docker-inference # Score β uses saved models from the models volume
make docker-mlflow # Start MLflow UI β http://localhost:5000
make docker-down # Stop & remove containers
make docker-clean # β Remove containers, volumes AND the imageYou only need make docker-build after:
- Changing
Dockerfile - Updating
pyproject.tomloruv.lock(dependency changes) - Modifying files in
src/,scripts/, orconfig/
Changes to docker-compose.yml never require a rebuild.
After running the pipeline, generate a self-contained stakeholder dashboard:
make dashboard
# β dashboard/index.htmlThe dashboard is a single HTML file (no server required) with five tabs:
| Tab | Contents |
|---|---|
| π Overview | 6 KPI cards Β· key insight pills Β· status & risk donuts |
| π Cohort Analysis | Monthly churn vs activation trend Β· GMV by cohort |
| ποΈ Segmentation | Churn by business segment Β· lead type Β· behaviour profile Β· state |
| π€ Model Performance | Metric bars per model Β· cross-model comparison chart Β· pipeline architecture |
| π― Interventions | Searchable & filterable priority table for 30 high-risk sellers |
Open dashboard/index.html in any browser β no dependencies, no server.
olist-ecommerce/
βββ π config/
β βββ settings.py # Centralised config via pydantic-settings (.env)
βββ π src/ # Core library β importable modules
β βββ pipeline.py # End-to-end orchestrator + all pipeline classes
β βββ features.py # Feature engineering (pre-activation + retention)
β βββ models.py # Model training (LogReg, RF, GBM comparison)
β βββ evaluation.py # Evaluation + chart generation β outputs/figures/
β βββ reports.py # Stakeholder Markdown report generation
β βββ validation/
β βββ schemas.py # Pydantic data validation schemas
βββ π scripts/
β βββ run_pipeline.py # Entry point: runs src/pipeline.main()
β βββ generate_dashboard.py # Reads outputs/ CSVs β dashboard/index.html
βββ π dashboard/ # Dashboard source (index.html is gitignored)
β βββ template.html # HTML/JS/CSS template (Chart.js, dark theme)
βββ π notebooks/
β βββ poc_churn_analysis.ipynb # Exploratory analysis
βββ π tests/
β βββ test_pipeline.py # Unit tests
βββ π docs/ # Project documentation
β βββ dataset.md # Dataset description
β βββ glossary.md # Domain glossary
β βββ project.md # Architecture & design notes
β βββ poc.md # POC findings
βββ π outputs/ # β Generated on pipeline run (gitignored)
β βββ figures/ # Charts: ROC, PR, confusion matrix, feature importance
β βββ seller_master.csv # Full seller dataset with churn labels
β βββ seller_risk_scores.csv # Per-seller risk scores
β βββ cohort_analysis.csv # Monthly cohort stats
β βββ segment_analysis_*.csv # Churn by segment, lead type, state, profile
β βββ intervention_priority_list.csv
β βββ analysis_summary.txt
βββ π models/ # β Generated on pipeline run (gitignored)
β βββ pre_activation_model.joblib
β βββ retention_model.joblib
βββ π data/ # (gitignored) β place Olist CSVs here
β βββ raw/
βββ .env.example # Required environment variables template
βββ Dockerfile # Multi-stage image (builder β runtime)
βββ docker-compose.yml # pipeline Β· inference Β· mlflow services
βββ .dockerignore # Keeps build context lean
βββ Makefile # Developer shortcuts (local + Docker)
βββ requirements.txt # Pinned dependencies
Predicts whether a newly-onboarded seller will never make a sale (61% of the dataset).
| Model | AUC-ROC | Accuracy | F1 |
|---|---|---|---|
| Logistic Regression β | 0.975 | 0.930 | 0.944 |
| Random Forest | 0.962 | 0.937 | 0.947 |
| Gradient Boosting | 0.953 | 0.918 | 0.933 |
Predicts whether an activated seller will go dormant (60+ days without an order).
| Model | AUC-ROC | Accuracy | F1 |
|---|---|---|---|
| Logistic Regression | 0.647 | 0.597 | 0.638 |
| Random Forest | 0.673 | 0.597 | 0.638 |
| Gradient Boosting β | 0.706 | 0.645 | 0.694 |
Charts for each model (ROC curve, Precision-Recall, Confusion Matrix, Feature Importance) are saved to
reports/figures/on every run.
make install # Install dependencies via uv
make run-pipeline # Run the full end-to-end pipeline
make run-inference # Score sellers using saved models
make dashboard # Generate dashboard/index.html from latest outputs
make test # Run unit tests
make test-coverage # Run tests with HTML coverage report
make lint # flake8 + black + isort + bandit checks
make format # Auto-format with black + isort
make ci-check # Full local CI simulation (lint β typecheck β tests)
make pre-commit-run # Run all pre-commit hooks over the codebase
make mlflow-ui # Open MLflow UI at http://localhost:5000
make clean # Remove __pycache__, .pytest_cache, htmlcov
make clean-outputs # Remove generated CSVs, reports, and figures
make setup-dirs # Create required directories from scratchmake docker-build # Build the Docker image
make docker-pipeline # Run training pipeline in a container
make docker-inference # Run inference in a container
make docker-mlflow # Start MLflow UI at http://localhost:5000
make docker-down # Stop & remove containers
make docker-clean # β Remove containers, volumes AND the imageAll settings are managed through config/settings.py and read from .env.
No hardcoded paths anywhere in the codebase.
# .env
DATA_PATH=./data/raw # Olist raw CSV files
OUTPUT_PATH=./outputs # Generated CSVs and text files
MODELS_PATH=./models # Saved .joblib models
FIGURES_PATH=./outputs/figures # Charts and plotsSee .env.example for the full list of configurable values.
make test # Run all unit tests
make test-coverage # Run with coverage (open htmlcov/index.html)Cairo Cananea
- Blog: cairocananea.com.br
- Linkedin: Cairo Cananea
- Github: Cairo Cananea