Our goal was to create a classification model to tell which borrowers would default and which would not. We also wanted to create data visualizations that demonstrate the model's findings as the basis for a dashboard for stakeholders.
We started by obtaining a dataset that contains information on credit applications, including gender, income, credit allocation, and default status. We performed data preprocessing, including handling missing values, converting categorical variables into numerical representations, and splitting the data into training and testing sets. Logistic regression and Histogram Gradient Boosting Classification and Regression models were utilized.
The results found a decision-tree model with 91% accuracy and AUC-ROC of 0.72.
Some of the challenges faced during this project include: Cleaning the data and determining what variables to include in our logisitic-regression model and the dataframe used for the training and testing sets.
Data quality: Many columns were complete, but dozens of variables, often about the characteristics of homes, were more than half null values. What began as 122 columns eventually became 43. Statistical analysis: Multicollinearity was an issue as many of the demographics correlated with each other. Interpretation:
If we had more time, we would consider the following: What other information, especially behavioural or at least non-demographic information, could we incorporate into our model? Such information could be more precise in what influences the chances of default and more prescriptive to individual borrowers to lower their risk.