This repository presents a research-oriented, reproducible default prediction study over two heterogeneous credit-risk problems:
- LendingClub installment loans (US lending platform data)
- UCI credit card default benchmark (Taiwan card clients)
The objective is not just to train a classifier, but to show a full machine-learning research workflow that is audit-friendly and replication-ready:
- clear EDA and data diagnostics,
- leakage-aware preprocessing and feature engineering,
- model training and hyperparameter finetuning in notebooks,
- fixed best-parameter pipelines in standalone Python scripts,
- artifact outputs that can be re-generated from code.
-
pipelines/card_pipeline.py- End-to-end UCI credit-card default pipeline.
- Trains Logistic Regression, Random Forest, and XGBoost (if available).
- Writes
results/card/metrics.jsonplus model diagnostics underresults/card/plots/andresults/card/predictions/.
-
pipelines/loan_pipeline.py- End-to-end LendingClub loan default pipeline.
- Chunked loading for large raw CSV.
- Terminal-status target mapping with leakage-safe defaults.
- Trains Logistic Regression, Random Forest, and XGBoost (if available).
- Writes
results/loan/metrics.jsonplus model diagnostics underresults/loan/plots/andresults/loan/predictions/.
main.py- Run one problem or both with CLI flags.
- Writes merged run summary to
results/metrics.json.
notebooks/01_card_default_EDA.ipynbnotebooks/02_card_preprocessing_feature_engineering.ipynbnotebooks/03_card_modeling.ipynbnotebooks/04_card_evaluation_analysis.ipynb
notebooks/01_loan_default_EDA.ipynbnotebooks/02_loan_preprocessing_feature_engineering.ipynbnotebooks/03_loan_modeling.ipynbnotebooks/04_loan_evaluation_analysis.ipynb
Both modeling notebooks contain explicit hyperparameter finetuning sections and export tuned parameter files under results/.
- UCI Default of Credit Card Clients
- LendingClub historical loan records
Relevant data paths:
data/default_of_credit_card_clients.csvdata/raw/accepted_2007_to_2018Q4.csvdata/raw/rejected_2007_to_2018Q4.csv
Large files are expected to be tracked through Git LFS.
uv syncIf you use pip instead of uv, install dependencies from pyproject.toml manually.
Run both problems:
python main.py --problem bothRun only card default:
python main.py --problem cardRun only loan default with controlled sample size:
python main.py --problem loan --max-loan-rows 120000 --loan-chunksize 100000results/card/metrics.jsonresults/loan/metrics.jsonresults/metrics.json- Diagnostic PNG visualizations per model and dataset under
results/card/plots/andresults/loan/plots/ - Validation prediction exports under
results/card/predictions/andresults/loan/predictions/
Each metrics file includes class distribution, ranking metrics (ROC-AUC, PR-AUC), threshold metrics (F1/precision/recall), top-decile capture rate, and per-model classification reports.
The following values were generated by running:
python main.py --problem both --max-loan-rows 30000 --loan-chunksize 50000| Problem | Best model | ROC-AUC | PR-AUC | F1 | Precision | Recall | Top-decile capture |
|---|---|---|---|---|---|---|---|
| Credit card default (30,000 rows) | XGBoost | 0.7761 | 0.5557 | 0.4666 | 0.6522 | 0.3632 | 0.3090 |
| LendingClub loan default (30,000-row bounded run) | Random Forest | 0.7366 | 0.4207 | 0.4489 | 0.3570 | 0.6046 | 0.2549 |
These numbers are expected to shift when --max-loan-rows changes or when full-data notebook tuning is rerun.
Research workflow design:
- Finetune in notebooks (
03_*_modeling.ipynb) using cross-validated PR-AUC. - Persist best params to JSON in
results/. - Promote stable best params into the replication scripts:
pipelines/card_pipeline.pypipelines/loan_pipeline.py
This keeps experimentation flexible while maintaining production-like reproducibility in script form.
Key principles used throughout this repo:
- remove/avoid post-outcome leakage columns for loan modeling,
- map targets using terminal outcomes only,
- use stratified splits,
- prioritize ranking metrics for imbalanced default prediction,
- include top-decile capture as an operational risk metric.
This project demonstrates both research thinking and engineering discipline:
- cross-dataset validation mindset,
- notebook-first experimentation and script-first replication,
- explicit metric governance for imbalanced risk tasks,
- deterministic seeds and structured output artifacts,
- practical CLI controls for dataset scale and runtime.
- Run EDA notebooks (
01_*). - Run preprocessing notebooks (
02_*). - Run modeling notebooks (
03_*) and tuning sections. - Run evaluation notebooks (
04_*). - Execute script pipelines with
main.pyfor clean replication.
Please review the licensing terms of each upstream dataset before redistribution.