This project is about solving business problem in retail business marketing campaign. Sci Mart is a retail business, having poor marketing campaign result on last campaign wants to solve this problem with help of data science knowledge. Avalaible resources for this problem are historical customer marketing campaign data.
Based on data Sci Mart having problem with low response rate 14.90% and low profit -$3.046
Goals:
- Increase response rate
- Increase profit
Objective :
- Create model to predict customer resonse
Business metric :
- Response rate
To run this project you will need Jupyter notebook. Project having some stages on it's process and combined in source_code_master.ipynb.
These are some library you need to run the project, i put the pip installation to make it easy for you.
- Pandas
pip install pandas
- Matplotlib
pip install matplotlib
- Seaborn
pip install seaborn
- sklearn
pip install sklearn
Data are having 2240 rows and 29 columns. Features are divide to 4 type : demographic, spending (monetary), frequency, campaign, parameter. demographic features contains customer information data such as birth date, kidhome, education, marital status, etc. monetary feature contain some total spending on secified product type. frequency feature contain customer activity data such as web visit, catalog purchased etc. campaign feature contains previous campaign customer accepted. paramater feature contain data like complain, response and recency.
There are 24 null data in Income feature that dropped in the process. Unrelevant feature like ID, Z_Cost,Z_Revenue and Dt_customer dropped in pre-processing step. Z_Cost and Z_Revenue are constant that used for profit calculation.
There are some outliers in Year_Birth and Income columns that will be processed in pre-processing step.
For Mnt (spending) columns mostly are distributed in positively skewed and need to transform in pre-processing step.
For categorical columns there are focused distribution of groups in Graduation education, Maried marital status, 0 Kidhome and Teenhome.
Target distribution (Response column) is not balanced it is mean that we need oversampling in pre-processing step to balance training data to avoid overfit/underfit.
Correlation features to target are mostly weak it is indicate non linear relation model needed for this case.
Distribution response in categorical feature. There are some pattern in kidhome and teenhome feature where less kid/teen is more chance to get response.
On numerical feature there are some difference distribution in response on some product and activity.
There are some process to prepare data for modelling needs.
- Cleansing Remove missing value, duplicated values, outliers.
- Encoding Encode categorical features.
- Transformation Log transformation for Mnt (spending) features
- Feature extraction Create new feature such as age, dependencies.
- Class imbalance balance data with oversampling techique.
For this classification task model metric used is precision. Consideration this model metric are because we try to maximize reduce False Positive to increase response rate. Model used in modelling step are : DecisionTree Classifier, RandomForest Classifier, XGB Classfier, LGBM Classifier, Logistic Regression. Model were tuning with some hyperparameter adjustment and this is the best 3 model.
From model we generate feature importance.
Simulation are used data from data test (X_test) in modelling. Disclaimer data size are different fromm 2240 (original data) and 103 (predicted 1 in data test). Response rate is increase to 51.61% and profit increase to $249.
There are some recommendation to incerase response rate. There are :
-
Targetted marketing There are some differet distributions in some feature that need to condisers as good target for increasing response. Good target are customer that have 0 - 20 recency or have total accepted campaign >= 2 or have income > 70,000.
-
Discount Discount is an action to stimulate customer to keep buy a product under < 20 days so customer will attract to response to campaign.
-
Product recommendation Recommend more product in meat and wines beacuse there are difference distribution in response.










