🎯 Problem statement
To conduct a thorough exploratory data analysis (EDA) and hypothesis testing on the dataset which contains information on customers visiting the shopping site for purchase.
| Feature | Description |
|---|---|
| Administrative | Number of administrative pages visited (e.g., account, cart, orders). |
| Administrative_Duration | Time spent on administrative pages. |
| Informational | Number of informational pages visited. |
| Informational_Duration | Time spent on informational pages. |
| ProductRelated | Number of product-related pages visited. |
| ProductRelated_Duration | Time spent on product-related pages. |
| BounceRates | % of visitors who exit without further interaction. |
| ExitRates | % of pageviews ending on a specific page. |
| PageValues | Average value of the page relative to transaction completion. |
| SpecialDay | Proximity of browsing date to special days, e.g., holidays. |
| Month | Month of the pageview (string format). |
| OperatingSystems | Integer representing the user's operating system. |
| Browser | Integer representing the user's browser. |
| Region | Integer representing the user's location region. |
| TrafficType | Integer categorizing the traffic type. |
| VisitorType | Visitor status: New, Returning, or Other. |
| Weekend | Boolean indicating if the session occurred on a weekend. |
| Revenue | Boolean indicating if the user completed a purchase. |
-
Data Preprocessing: Handled missing values, formatted data types, and ensured all necessary transformations were made for consistency in the dataset.
-
Univariate Analysis: Plotted histograms and box plots for each numerical feature to identify distribution shapes and detect outliers in features like
PageValues,BounceRates, andExitRates. -
Correlation Analysis: Calculated correlations between numerical features to detect potential relationships, focusing on features like
PageValues,Revenue, andDuration. -
Visualizations: Created scatter plots, pair plots, and heatmaps to visualize relationships between key numerical variables, such as
PageViews,Duration, andRevenue. -
Class Distribution: Examined the distribution of the target variable (
Revenue) to assess class balance and evaluate potential bias in the dataset. -
Page Category Analysis: Summarized page views, session durations, and bounce/exit rates for different page categories to identify user behavior patterns on each page type.
-
SpecialDay Analysis: Investigated the distribution of the
SpecialDayfeature and analyzed its correlation withRevenueto understand how special events influence conversions. -
Binary Feature Creation: Generated a binary feature indicating whether a user visited all three page categories (
Informational,ProductRelated,Administrative) during their session. -
PageValues and Behavior Analysis: Explored the relationship between
PageValuesand factors likeTrafficType,VisitorType, andRegion, highlighting engagement and purchase behavior differences. -
User Session Length Impact: Analyzed user session lengths to determine their influence on conversion rates, identifying any trends between longer sessions and higher purchase likelihood.
-
User Grouping by Behavior: Grouped users based on
VisitorType,OperatingSystems, andRegionto identify behavioral differences and their impact on conversion rates. -
Traffic Type Segmentation: Segmented users by
TrafficTypeand analyzed engagement patterns, exploring the impact of different traffic sources on purchase probability and session behaviors.
This comprehensive approach prepared the data for in-depth analysis and built the foundation for actionable insights.