A regression-based ML project that identifies the key drivers of credit card spending across 5,000 bank customers β enabling data-driven credit limit calculation for new applicants.
A global bank wants to understand what factors drive credit card spend across its customer base. This understanding is critical for:
- Setting accurate credit limits for new applicants
- Identifying high-value customer segments
- Reducing credit risk by predicting spend behaviour upfront
The bank surveyed 5,000 customers, collecting demographic, financial, and behavioural data. This project builds a predictive model to:
- Identify the top drivers of total credit card spend (Primary + Secondary card combined)
- Predict the total spend for new applicants given their profile
- Use those predictions to recommend credit limits at the individual level
| Stakeholder | Value Delivered |
|---|---|
| Credit Risk Team | Data-driven credit limit setting, reducing over/under-lending |
| Marketing Team | Identify high-spend customer profiles to target premium products |
| Operations | Predict customer lifetime value early in the relationship |
| Product Team | Understand which customer segments drive the most revenue |
- Source: Internal bank customer survey
- Size: 5,000 customers
- Format:
.xlsxwith accompanying data dictionary - Target Variable:
total_spent(Primary Card spend + Secondary Card spend, log-transformed)
| Category | Variables |
|---|---|
| Demographics | Age (agecat), Education (edcat), Marital status (spoused) |
| Financial | Income category (inccat), Credit limit, Minimum payments |
| Behavioural | Tenure (tenure, card2tenure), Equipment months (equipmon) |
| Geographic | Address duration (addresscat), Commute category (commutecat) |
| Communication | Toll months (tollmon), Wireless months (wiremon) |
Note: Data is numerically encoded. Categorical variables require careful separation before modelling.
Predicting-Credit-Card-Spend/
βββ data/
β βββ credit_card_data.xlsx # Raw customer survey data
β βββ Data_Dictionary.xlsx # Variable definitions
βββ outputs/
β βββ Decile_analysis_train.csv # Decile analysis on training set
β βββ Decile_analysis_test.csv # Decile analysis on test set
β βββ python_FA.xls # Factor analysis output
βββ Predicting Credit Card Spend.ipynb # Full EDA + Modelling notebook
βββ requirements.txt
βββ README.md
Key challenges tackled:
- Numerical encoding masks categorical variables β required careful type separation
- Missing value treatment: variables with >25% missing dropped; remainder imputed
- Outlier treatment using 1stβ99th percentile clipping to reduce skew
# Outlier clipping at 1st and 99th percentile
credit[numerical_var] = credit[numerical_var].apply(
lambda x: x.clip(lower=x.dropna().quantile(0.01),
upper=x.quantile(0.99))
)Variables dropped due to multicollinearity:
drop_col = ['addresscat', 'agecat', 'cardtenurecat', 'card2tenurecat',
'cardtenure', 'commutecat', 'edcat', 'equipmon', 'inccat',
'lnlongten', 'longten', 'spoused', 'spousedcat', 'tenure',
'tollmon', 'wiremon']- Log transformation on target variable (
total_spent_ln) to normalize right-skewed spend distribution - Dummy variable creation for categorical variables (excluding boolean variables which are already binary)
- Missing value analysis β calculated % missing per variable to guide imputation strategy
Missing_values = pd.DataFrame(
credit_conti_var.apply(continuous_var_summary).T.round(1)['NMISS'] / 5000 * 100
)
# Drop variables with >25% missing values
M_V = Missing_values[Missing_values['NMISS'] > 25]- Linear Regression as the primary model β interpretable, directly maps to business insight
- Train/test split for out-of-sample validation
- Focus on coefficient interpretation to identify key spend drivers
A key differentiator of this project β Decile Analysis validates model performance in business terms:
# Group predictions into deciles and compare predicted vs actual spend
Predicted_avg = train[['Deciles', 'pred_tot_spend']].groupby(
train.Deciles).mean().sort_index(ascending=False)['pred_tot_spend']
Actual_avg = train[['Deciles', 'total_spent_ln']].groupby(
train.Deciles).mean().sort_index(ascending=False)['total_spent_ln']This shows whether the model correctly ranks customers from highest to lowest spenders β critical for credit limit assignment.
- Income category is the single strongest predictor of total spend β higher income customers spend significantly more
- Card tenure (how long the customer has held the card) positively correlates with spend β loyal customers spend more
- Secondary card usage is a strong signal β customers using both primary and secondary cards show 2x higher total spend
- Customers with longer address tenure tend to be more financially stable with predictable spend patterns
- Log transformation of the target variable significantly improved model fit by reducing skew
| Category | Tools |
|---|---|
| Language | Python 3.8+ |
| Data Processing | Pandas, NumPy |
| Visualisation | Matplotlib, Seaborn |
| Modelling | Scikit-learn (Linear Regression) |
| Statistical Analysis | SciPy, StatsModels |
| Output | Excel (openpyxl / xlwt) for decile reports |
| Notebook | Jupyter Notebook |
# Clone the repository
git clone https://github.com/vicky60629/Predicting-Credit-Card-Spend.git
cd Predicting-Credit-Card-Spend
# Install dependencies
pip install -r requirements.txt
# Launch the notebook
jupyter notebook "Predicting Credit Card Spend.ipynb"The model's business performance is validated through decile-level comparison of predicted vs actual spend β a standard technique used in banking and insurance to evaluate regression models.
| Decile | Avg Predicted Spend | Avg Actual Spend |
|---|---|---|
| 1 (Top) | Highest | Highest |
| 2 | β | β |
| ... | ... | ... |
| 10 (Bottom) | Lowest | Lowest |
A well-performing model shows monotonically decreasing averages from Decile 1 to 10, confirming it correctly ranks customers by spend potential.
What worked well:
- Decile analysis goes beyond standard RMSE/RΒ² β validates the model in business language that stakeholders understand
- Careful missing value and outlier treatment significantly improved model stability
- Log transformation of spend target made residuals more normally distributed
Future enhancements:
- Try XGBoost / LightGBM for potentially better predictive accuracy
- Add SHAP values for better feature importance explainability
- Build a Streamlit dashboard for interactive credit limit prediction
- Experiment with Ridge / Lasso regression for automatic feature selection
- Extend to credit limit recommendation engine using predicted spend buckets
- Track experiments with MLflow
Vicky Gupta β Data Engineering Analyst @ Accenture (4.5 years) | Aspiring Data Scientist
Experienced in PySpark, ETL pipelines, and building end-to-end ML solutions across banking, retail, and NLP domains.
This project is licensed under the MIT License β see the LICENSE file for details.
β If you found this useful, please consider starring the repository!