Skip to content

riju-talk/loan-defaulter-prediction-study

Repository files navigation

Loan Defaulter Prediction Study

This repository presents a research-oriented, reproducible default prediction study over two heterogeneous credit-risk problems:

  1. LendingClub installment loans (US lending platform data)
  2. UCI credit card default benchmark (Taiwan card clients)

The objective is not just to train a classifier, but to show a full machine-learning research workflow that is audit-friendly and replication-ready:

  • clear EDA and data diagnostics,
  • leakage-aware preprocessing and feature engineering,
  • model training and hyperparameter finetuning in notebooks,
  • fixed best-parameter pipelines in standalone Python scripts,
  • artifact outputs that can be re-generated from code.

What Is Implemented

1) Two single-file replication pipelines

  • pipelines/card_pipeline.py

    • End-to-end UCI credit-card default pipeline.
    • Trains Logistic Regression, Random Forest, and XGBoost (if available).
    • Writes results/card/metrics.json plus model diagnostics under results/card/plots/ and results/card/predictions/.
  • pipelines/loan_pipeline.py

    • End-to-end LendingClub loan default pipeline.
    • Chunked loading for large raw CSV.
    • Terminal-status target mapping with leakage-safe defaults.
    • Trains Logistic Regression, Random Forest, and XGBoost (if available).
    • Writes results/loan/metrics.json plus model diagnostics under results/loan/plots/ and results/loan/predictions/.

2) Unified execution entrypoint

  • main.py
    • Run one problem or both with CLI flags.
    • Writes merged run summary to results/metrics.json.

3) Notebook research workflow (4 per problem)

Credit card track

  • notebooks/01_card_default_EDA.ipynb
  • notebooks/02_card_preprocessing_feature_engineering.ipynb
  • notebooks/03_card_modeling.ipynb
  • notebooks/04_card_evaluation_analysis.ipynb

Loan track

  • notebooks/01_loan_default_EDA.ipynb
  • notebooks/02_loan_preprocessing_feature_engineering.ipynb
  • notebooks/03_loan_modeling.ipynb
  • notebooks/04_loan_evaluation_analysis.ipynb

Both modeling notebooks contain explicit hyperparameter finetuning sections and export tuned parameter files under results/.

Data Sources

  1. UCI Default of Credit Card Clients
  2. LendingClub historical loan records

Relevant data paths:

  • data/default_of_credit_card_clients.csv
  • data/raw/accepted_2007_to_2018Q4.csv
  • data/raw/rejected_2007_to_2018Q4.csv

Large files are expected to be tracked through Git LFS.

Reproducibility Protocol

Environment setup

uv sync

If you use pip instead of uv, install dependencies from pyproject.toml manually.

Run pipelines from CLI

Run both problems:

python main.py --problem both

Run only card default:

python main.py --problem card

Run only loan default with controlled sample size:

python main.py --problem loan --max-loan-rows 120000 --loan-chunksize 100000

Artifacts generated

  • results/card/metrics.json
  • results/loan/metrics.json
  • results/metrics.json
  • Diagnostic PNG visualizations per model and dataset under results/card/plots/ and results/loan/plots/
  • Validation prediction exports under results/card/predictions/ and results/loan/predictions/

Each metrics file includes class distribution, ranking metrics (ROC-AUC, PR-AUC), threshold metrics (F1/precision/recall), top-decile capture rate, and per-model classification reports.

Latest validated benchmark snapshot

The following values were generated by running:

python main.py --problem both --max-loan-rows 30000 --loan-chunksize 50000
Problem Best model ROC-AUC PR-AUC F1 Precision Recall Top-decile capture
Credit card default (30,000 rows) XGBoost 0.7761 0.5557 0.4666 0.6522 0.3632 0.3090
LendingClub loan default (30,000-row bounded run) Random Forest 0.7366 0.4207 0.4489 0.3570 0.6046 0.2549

These numbers are expected to shift when --max-loan-rows changes or when full-data notebook tuning is rerun.

Hyperparameter Strategy

Research workflow design:

  1. Finetune in notebooks (03_*_modeling.ipynb) using cross-validated PR-AUC.
  2. Persist best params to JSON in results/.
  3. Promote stable best params into the replication scripts:
    • pipelines/card_pipeline.py
    • pipelines/loan_pipeline.py

This keeps experimentation flexible while maintaining production-like reproducibility in script form.

Leakage and Imbalance Handling

Key principles used throughout this repo:

  • remove/avoid post-outcome leakage columns for loan modeling,
  • map targets using terminal outcomes only,
  • use stratified splits,
  • prioritize ranking metrics for imbalanced default prediction,
  • include top-decile capture as an operational risk metric.

Why This Repo Is Hiring-Grade

This project demonstrates both research thinking and engineering discipline:

  • cross-dataset validation mindset,
  • notebook-first experimentation and script-first replication,
  • explicit metric governance for imbalanced risk tasks,
  • deterministic seeds and structured output artifacts,
  • practical CLI controls for dataset scale and runtime.

Suggested Execution Order

  1. Run EDA notebooks (01_*).
  2. Run preprocessing notebooks (02_*).
  3. Run modeling notebooks (03_*) and tuning sections.
  4. Run evaluation notebooks (04_*).
  5. Execute script pipelines with main.py for clean replication.

License and Usage Notes

Please review the licensing terms of each upstream dataset before redistribution.

About

Implemented a supervised machine learning pipeline for loan default prediction using public loan-level data, inspired by prior ACM research. Performed leakage-aware preprocessing, feature engineering across borrower and loan attributes, and handled class imbalance using cost-sensitive learning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors