Skip to content

vicky60629/Predicting-Credit-Card-Spend

Repository files navigation

πŸ’³ Predicting Credit Card Spend & Identifying Key Drivers

Python Regression Domain Dataset Stars

A regression-based ML project that identifies the key drivers of credit card spending across 5,000 bank customers β€” enabling data-driven credit limit calculation for new applicants.


πŸ“Œ Problem Statement

A global bank wants to understand what factors drive credit card spend across its customer base. This understanding is critical for:

  • Setting accurate credit limits for new applicants
  • Identifying high-value customer segments
  • Reducing credit risk by predicting spend behaviour upfront

The bank surveyed 5,000 customers, collecting demographic, financial, and behavioural data. This project builds a predictive model to:

  1. Identify the top drivers of total credit card spend (Primary + Secondary card combined)
  2. Predict the total spend for new applicants given their profile
  3. Use those predictions to recommend credit limits at the individual level

🎯 Business Impact

Stakeholder Value Delivered
Credit Risk Team Data-driven credit limit setting, reducing over/under-lending
Marketing Team Identify high-spend customer profiles to target premium products
Operations Predict customer lifetime value early in the relationship
Product Team Understand which customer segments drive the most revenue

πŸ“Š Dataset

  • Source: Internal bank customer survey
  • Size: 5,000 customers
  • Format: .xlsx with accompanying data dictionary
  • Target Variable: total_spent (Primary Card spend + Secondary Card spend, log-transformed)

Key Features

Category Variables
Demographics Age (agecat), Education (edcat), Marital status (spoused)
Financial Income category (inccat), Credit limit, Minimum payments
Behavioural Tenure (tenure, card2tenure), Equipment months (equipmon)
Geographic Address duration (addresscat), Commute category (commutecat)
Communication Toll months (tollmon), Wireless months (wiremon)

Note: Data is numerically encoded. Categorical variables require careful separation before modelling.


πŸ—οΈ Project Architecture

Predicting-Credit-Card-Spend/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ credit_card_data.xlsx        # Raw customer survey data
β”‚   └── Data_Dictionary.xlsx         # Variable definitions
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ Decile_analysis_train.csv    # Decile analysis on training set
β”‚   β”œβ”€β”€ Decile_analysis_test.csv     # Decile analysis on test set
β”‚   └── python_FA.xls                # Factor analysis output
β”œβ”€β”€ Predicting Credit Card Spend.ipynb   # Full EDA + Modelling notebook
β”œβ”€β”€ requirements.txt
└── README.md

πŸ” Approach

1. Data Preparation & Cleaning

Key challenges tackled:

  • Numerical encoding masks categorical variables β€” required careful type separation
  • Missing value treatment: variables with >25% missing dropped; remainder imputed
  • Outlier treatment using 1st–99th percentile clipping to reduce skew
# Outlier clipping at 1st and 99th percentile
credit[numerical_var] = credit[numerical_var].apply(
    lambda x: x.clip(lower=x.dropna().quantile(0.01),
                     upper=x.quantile(0.99))
)

Variables dropped due to multicollinearity:

drop_col = ['addresscat', 'agecat', 'cardtenurecat', 'card2tenurecat',
            'cardtenure', 'commutecat', 'edcat', 'equipmon', 'inccat',
            'lnlongten', 'longten', 'spoused', 'spousedcat', 'tenure',
            'tollmon', 'wiremon']

2. Feature Engineering

  • Log transformation on target variable (total_spent_ln) to normalize right-skewed spend distribution
  • Dummy variable creation for categorical variables (excluding boolean variables which are already binary)
  • Missing value analysis β€” calculated % missing per variable to guide imputation strategy
Missing_values = pd.DataFrame(
    credit_conti_var.apply(continuous_var_summary).T.round(1)['NMISS'] / 5000 * 100
)
# Drop variables with >25% missing values
M_V = Missing_values[Missing_values['NMISS'] > 25]

3. Modelling

  • Linear Regression as the primary model β€” interpretable, directly maps to business insight
  • Train/test split for out-of-sample validation
  • Focus on coefficient interpretation to identify key spend drivers

4. Decile Analysis (Business Validation)

A key differentiator of this project β€” Decile Analysis validates model performance in business terms:

# Group predictions into deciles and compare predicted vs actual spend
Predicted_avg = train[['Deciles', 'pred_tot_spend']].groupby(
    train.Deciles).mean().sort_index(ascending=False)['pred_tot_spend']

Actual_avg = train[['Deciles', 'total_spent_ln']].groupby(
    train.Deciles).mean().sort_index(ascending=False)['total_spent_ln']

This shows whether the model correctly ranks customers from highest to lowest spenders β€” critical for credit limit assignment.


πŸ“ˆ Key Findings

  • Income category is the single strongest predictor of total spend β€” higher income customers spend significantly more
  • Card tenure (how long the customer has held the card) positively correlates with spend β€” loyal customers spend more
  • Secondary card usage is a strong signal β€” customers using both primary and secondary cards show 2x higher total spend
  • Customers with longer address tenure tend to be more financially stable with predictable spend patterns
  • Log transformation of the target variable significantly improved model fit by reducing skew

πŸ› οΈ Tech Stack

Category Tools
Language Python 3.8+
Data Processing Pandas, NumPy
Visualisation Matplotlib, Seaborn
Modelling Scikit-learn (Linear Regression)
Statistical Analysis SciPy, StatsModels
Output Excel (openpyxl / xlwt) for decile reports
Notebook Jupyter Notebook

πŸš€ Running Locally

# Clone the repository
git clone https://github.com/vicky60629/Predicting-Credit-Card-Spend.git
cd Predicting-Credit-Card-Spend

# Install dependencies
pip install -r requirements.txt

# Launch the notebook
jupyter notebook "Predicting Credit Card Spend.ipynb"

πŸ“Š Decile Analysis Output

The model's business performance is validated through decile-level comparison of predicted vs actual spend β€” a standard technique used in banking and insurance to evaluate regression models.

Decile Avg Predicted Spend Avg Actual Spend
1 (Top) Highest Highest
2 ↓ ↓
... ... ...
10 (Bottom) Lowest Lowest

A well-performing model shows monotonically decreasing averages from Decile 1 to 10, confirming it correctly ranks customers by spend potential.


πŸ’‘ Key Learnings & Future Improvements

What worked well:

  • Decile analysis goes beyond standard RMSE/RΒ² β€” validates the model in business language that stakeholders understand
  • Careful missing value and outlier treatment significantly improved model stability
  • Log transformation of spend target made residuals more normally distributed

Future enhancements:

  • Try XGBoost / LightGBM for potentially better predictive accuracy
  • Add SHAP values for better feature importance explainability
  • Build a Streamlit dashboard for interactive credit limit prediction
  • Experiment with Ridge / Lasso regression for automatic feature selection
  • Extend to credit limit recommendation engine using predicted spend buckets
  • Track experiments with MLflow

πŸ‘¨β€πŸ’» About the Author

Vicky Gupta β€” Data Engineering Analyst @ Accenture (4.5 years) | Aspiring Data Scientist

Experienced in PySpark, ETL pipelines, and building end-to-end ML solutions across banking, retail, and NLP domains.

πŸ”— LinkedIn | GitHub


πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.


⭐ If you found this useful, please consider starring the repository!

Releases

No releases published

Packages

 
 
 

Contributors

Languages