Final Project Ironhack Data Bootcamp

Paula Gual

Data Part Time Barcelona May 2019

Content

Project Description

In this project we will create a Machine Leaning Model to predict if a visit of the Google Merchandising store will buy or not.

Dataset

The dataset I will be using for the project is basically data from the Google Analytics account of the Google Merchandising store. This dataset is available on Kaggle.

[Google Analytics from Google Merchandising Store]https://www.kaggle.com/colinpearse/ga-analytics-with-json-columns

Workflow

Import Data

I've imported all the Dataset in a pandas dataframe, using the chunk method, as the original data csv is huge.

Preprocessing

I've classified all the variables on my dataset:

CATEGORICAL:

channelGrouping: Channel of arrival of the visit
fullVisitorId: Visitor ID
sessionId: Session ID (contains Visitor ID)
socialEngagementType: Type of engagement
deviceBrowser: Name of the Browser
deviceCategory: Name of the device category
deviceOS: The OS of the device
geoCity: City of the visit
geoContinent: Continent of the visit
geoCountry: Country of the visit
geoMetro: Metro of the visit
geoDomain: Domain of the visit
geoRegion: Region of the visit
geoSubContinent: Subcontinent of the visit
trafficSourceAdContent: Source of the Ads
adwordsClickAdNetworkType: Ad Source
adwordsClickGclId: Ad Click ID
adwordsClickInfo.slot: Place of the ad
trafficSourceCampaign: Name of the souce campaign
trafficSourceKeyword: Name of the source Keyword
trafficSourceMedium: Name of the source Medium
trafficSourceReferralPath: Referral Path
trafficSourceSource: Name of the source

BOOLEAN:

deviceIsMobile: is the device mobile
totalsBounces: Is bounce
totalsNewVisits: Is new visit
adwordsClickIsVideoAd: Is from ad video
trafficSourceIsTrueDirect: Is it a true direct visit

ORDINAL:

visitId: Visit ID integer (is the date in unix time)
visitNumber: The number of visit of one user

NUMERICAL:

INTEGER:

totalsHits: number of hits
totalsPageviews: number of pageviews
totalsTransactionRevenue: (TARGET) logaritm if the revenue
totalsVisits: Number of visits
adwordsClickInfoPage: click from ad in page

DATE:

date: Date of the visit
visitStartTime: unix time of the visit

I've preprocessed and cleaned all the data vars, manly replace the Nans using different strategies, depending on the data. I have also fixed some uncorrect types, of the variables.

EDA

I have divided the EDA in two parts: EDA of the numeric vars and the EDA of the categorical vars.

For the EDA of the numeric vars, I have first analysed the Target. Then the other numeric variables, by themselves (distribution) and comparing them to the target. Also I have visualized all the data, manly with scatterplots and boxplots.

I have also checked the correlation, and deleted a variable (hits) that was highly correlated with another (Pageviews).

For the EDA of the categorical vars I have studied the different values for each one, and also compared those number of values depending on the target var. Also I have visualized all the data, manly with bar charts.

In some cases, as the variable had many posible outcomes, I have created a new var with only the most repeated values, or the most related values to the target.

Feature Engineering and Feature Selection

I have selected the features that based on the EDA I belive are more related to the target.

The categorical vars I have encoded them, and craeted dummies.

Model

In the model selection process I will take into account: 1.- It is a binaty classification problem 2.- The data is Sparse 3.- The target is very unbalanced (1,29% vs 98,71) 4.- There is a lot of data 903455 rows and 48 columns

Model Selection 1.- I'll use a decision tree 2.- I'll try to balance the classes down sampling the data 3.- I'll use Under ROC Curve (AUROC) as metric

Algorithm Ramdom Forest Parameters {'bootstrap': True, 'criterion': 'gini', 'max_depth': 9, 'max_features': 'auto', 'n_estimators': 100}

Results

The final results of the model are:

F1: 0.96 Confusion Matrix: (2144, 159) (30, 2273) AUC: 0.96

Feature Importance Importance / Feature 0.5045 / totalsPageviews 0.1228 / totalsBounces 0.0820 / continet_Americas 0.0374 / source_Googleplex 0.0315 / referral_Home

Conclusions

The conclusion is that it is possible to predict the conversion in an ecommerce, only having the data from the Google Analytics data of the visits.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
1.ImportData.ipynb		1.ImportData.ipynb
2.Preprocessing.ipynb		2.Preprocessing.ipynb
3.EDA.ipynb		3.EDA.ipynb
3.EDA_categorical.ipynb		3.EDA_categorical.ipynb
4.Feature_Selection.ipynb		4.Feature_Selection.ipynb
5-1-b.Model_Improvement.ipynb		5-1-b.Model_Improvement.ipynb
5-1-c.Model_Selection.ipynb		5-1-c.Model_Selection.ipynb
5-1-d.Model_Selection.ipynb		5-1-d.Model_Selection.ipynb
5-1.Model_Selection.ipynb		5-1.Model_Selection.ipynb
5-2.Model_Selection.ipynb		5-2.Model_Selection.ipynb
5-3-Model_Selection.ipynb		5-3-Model_Selection.ipynb
README.md		README.md
myfunctions.py		myfunctions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Final Project Ironhack Data Bootcamp

Content

Project Description

Dataset

Workflow

Import Data

Preprocessing

EDA

Feature Engineering and Feature Selection

Model

Results

Conclusions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Final Project Ironhack Data Bootcamp

Content

Project Description

Dataset

Workflow

Import Data

Preprocessing

EDA

Feature Engineering and Feature Selection

Model

Results

Conclusions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages