Loan Defaulter Prediction Study

This repository presents a research-oriented, reproducible default prediction study over two heterogeneous credit-risk problems:

LendingClub installment loans (US lending platform data)
UCI credit card default benchmark (Taiwan card clients)

The objective is not just to train a classifier, but to show a full machine-learning research workflow that is audit-friendly and replication-ready:

clear EDA and data diagnostics,
leakage-aware preprocessing and feature engineering,
model training and hyperparameter finetuning in notebooks,
fixed best-parameter pipelines in standalone Python scripts,
artifact outputs that can be re-generated from code.

What Is Implemented

1) Two single-file replication pipelines

pipelines/card_pipeline.py
- End-to-end UCI credit-card default pipeline.
- Trains Logistic Regression, Random Forest, and XGBoost (if available).
- Writes results/card/metrics.json plus model diagnostics under results/card/plots/ and results/card/predictions/.
pipelines/loan_pipeline.py
- End-to-end LendingClub loan default pipeline.
- Chunked loading for large raw CSV.
- Terminal-status target mapping with leakage-safe defaults.
- Trains Logistic Regression, Random Forest, and XGBoost (if available).
- Writes results/loan/metrics.json plus model diagnostics under results/loan/plots/ and results/loan/predictions/.

2) Unified execution entrypoint

main.py
- Run one problem or both with CLI flags.
- Writes merged run summary to results/metrics.json.

3) Notebook research workflow (4 per problem)

Credit card track

notebooks/01_card_default_EDA.ipynb
notebooks/02_card_preprocessing_feature_engineering.ipynb
notebooks/03_card_modeling.ipynb
notebooks/04_card_evaluation_analysis.ipynb

Loan track

notebooks/01_loan_default_EDA.ipynb
notebooks/02_loan_preprocessing_feature_engineering.ipynb
notebooks/03_loan_modeling.ipynb
notebooks/04_loan_evaluation_analysis.ipynb

Both modeling notebooks contain explicit hyperparameter finetuning sections and export tuned parameter files under results/.

Data Sources

UCI Default of Credit Card Clients
LendingClub historical loan records

Relevant data paths:

data/default_of_credit_card_clients.csv
data/raw/accepted_2007_to_2018Q4.csv
data/raw/rejected_2007_to_2018Q4.csv

Large files are expected to be tracked through Git LFS.

Reproducibility Protocol

Environment setup

uv sync

If you use pip instead of uv, install dependencies from pyproject.toml manually.

Run pipelines from CLI

Run both problems:

python main.py --problem both

Run only card default:

python main.py --problem card

Run only loan default with controlled sample size:

python main.py --problem loan --max-loan-rows 120000 --loan-chunksize 100000

Artifacts generated

results/card/metrics.json
results/loan/metrics.json
results/metrics.json
Diagnostic PNG visualizations per model and dataset under results/card/plots/ and results/loan/plots/
Validation prediction exports under results/card/predictions/ and results/loan/predictions/

Each metrics file includes class distribution, ranking metrics (ROC-AUC, PR-AUC), threshold metrics (F1/precision/recall), top-decile capture rate, and per-model classification reports.

Latest validated benchmark snapshot

The following values were generated by running:

python main.py --problem both --max-loan-rows 30000 --loan-chunksize 50000

Problem	Best model	ROC-AUC	PR-AUC	F1	Precision	Recall	Top-decile capture
Credit card default (30,000 rows)	XGBoost	0.7761	0.5557	0.4666	0.6522	0.3632	0.3090
LendingClub loan default (30,000-row bounded run)	Random Forest	0.7366	0.4207	0.4489	0.3570	0.6046	0.2549

These numbers are expected to shift when --max-loan-rows changes or when full-data notebook tuning is rerun.

Hyperparameter Strategy

Research workflow design:

Finetune in notebooks (03_*_modeling.ipynb) using cross-validated PR-AUC.
Persist best params to JSON in results/.
Promote stable best params into the replication scripts:
- pipelines/card_pipeline.py
- pipelines/loan_pipeline.py

This keeps experimentation flexible while maintaining production-like reproducibility in script form.

Leakage and Imbalance Handling

Key principles used throughout this repo:

remove/avoid post-outcome leakage columns for loan modeling,
map targets using terminal outcomes only,
use stratified splits,
prioritize ranking metrics for imbalanced default prediction,
include top-decile capture as an operational risk metric.

Why This Repo Is Hiring-Grade

This project demonstrates both research thinking and engineering discipline:

cross-dataset validation mindset,
notebook-first experimentation and script-first replication,
explicit metric governance for imbalanced risk tasks,
deterministic seeds and structured output artifacts,
practical CLI controls for dataset scale and runtime.

Suggested Execution Order

Run EDA notebooks (01_*).
Run preprocessing notebooks (02_*).
Run modeling notebooks (03_*) and tuning sections.
Run evaluation notebooks (04_*).
Execute script pipelines with main.py for clean replication.

License and Usage Notes

Please review the licensing terms of each upstream dataset before redistribution.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
artifacts		artifacts
data		data
notebooks		notebooks
papers		papers
pipelines		pipelines
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
NOTEBOOK_GUIDE.md		NOTEBOOK_GUIDE.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Loan Defaulter Prediction Study

What Is Implemented

1) Two single-file replication pipelines

2) Unified execution entrypoint

3) Notebook research workflow (4 per problem)

Credit card track

Loan track

Data Sources

Reproducibility Protocol

Environment setup

Run pipelines from CLI

Artifacts generated

Latest validated benchmark snapshot

Hyperparameter Strategy

Leakage and Imbalance Handling

Why This Repo Is Hiring-Grade

Suggested Execution Order

License and Usage Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Loan Defaulter Prediction Study

What Is Implemented

1) Two single-file replication pipelines

2) Unified execution entrypoint

3) Notebook research workflow (4 per problem)

Credit card track

Loan track

Data Sources

Reproducibility Protocol

Environment setup

Run pipelines from CLI

Artifacts generated

Latest validated benchmark snapshot

Hyperparameter Strategy

Leakage and Imbalance Handling

Why This Repo Is Hiring-Grade

Suggested Execution Order

License and Usage Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages