Machine Learning & Mathematical Models for eCommerce Marketing ROI
Standard analytics platforms usually credit 100% of a sale to the last advertisement a customer clicked before purchasing. This "Last-Touch" model severely undervalues "discovery" channels (like TikTok Ads or Display) and over-credits "closing" channels (like Direct Traffic or Retargeting Emails).
This system applies Data Science & Machine Learning to map every step of the customer journey, mathematically distributing revenue credit to the specific marketing ads that actually drove the sale.
By deploying algorithmic ML attribution over this dataset, we discovered a 42.5% Misallocation Index. Out of a $509,237 total portfolio revenue pool, $216,318 is currently being credited to the wrong channels.
How is this mathematically calculated?
- Absolute Disparity: We compute the absolute dollar difference between the Naive "Last Touch" model and the predictive "XGBoost + SHAP" model for every channel.
- Sum & Halve: We sum the absolute differences. Because rerouting $1 from Facebook to TikTok mathematically creates a -$1 difference for Facebook alongside a +$1 difference for TikTok (creating a $2 absolute shift), we divide the final sum exactly by 2.
- Index Percentage: This de-duplicated volume of physically misallocated dollars ($216k) is divided by the total portfolio cash flow ($509k) to get the final Impact Index.
If marketing decisions remain reliant on traditional analytics:
- The business will continue to operate blindly on over 40% of its revenue trajectory.
- TikTok Ads are driving +440% more top-of-funnel revenue than Last-Touch initially claims, but they are being starved of budget.
- Email and Display are massively undervalued (+318% and +161% true predictive lift), having their influence stolen by organically closing channels.
- Facebook Ads and Organic Search are taking excessive credit (often incorrectly claiming ~60-70% more causal revenue than they legitimately generated).
This precise causal intelligence dictates exactly where the next $100k in marketing budget should be allocated to properly maximize ROAS.
The project sequentially builds mathematical complexity to evaluate benchmark deviations across models:
-
Heuristics (The Analytics Baseline)
- Applying industry-standard logic: First-Touch, Last-Touch, Linear, U-Shape.
- Used to understand raw disparities between "Awareness" drivers vs. "Closing" drivers.
-
Absorbing Markov Chains (Graph Theory)
- Maps all customer traffic into a massive probability transition network.
- Calculates the true "Removal Effect": If we turn off Facebook Ads, what exact percentage of total conversions completely collapses across our network?
-
Cooperative Game Theory (Shapley Values)
- Computes all
$2^N$ marketing combinations (coalitions) mathematically to define the exact marginal contribution of an individual channel toward shared revenue pools.
- Computes all
-
Data-Driven ML (Hyper-Tuned XGBoost + SHAP Values) 🚀
- A binary classification Machine Learning model (XGBoost) predicts conversion likelihoods by locking onto complex non-linear synergy (e.g. Organic Search -> Email -> Sale).
- Organically combats massive class imbalances (95% non-conversions) using
scale_pos_weightand dynamicRandomizedSearchCVstructural F1-score parameter tuning. - SHAP (SHapley Additive exPlanations) is applied over an intentionally engineered feature space (channel frequency, first-touch flags) to calculate localized predictive lifts on specifically converted users.
A presentation-ready analytical suite completely automated by the pipeline outputs highly actionable charts and fully interactive web-based storytelling tools.
1a_user_journey_sankey.html&1b_user_journey_sunburst.html: InteractivePlotlydiagrams allowing you to explore hierarchically exactly how users flow through complex discovery and closing channels over time.3b_markov_network_interactive.html: A full physics-engine poweredPyVisNetwork Sandbox. Channel nodes structurally pull toward each other on your screen based on their transition probabilities inside the Markov Matrix.
- Marketing Attribution Baseline Comparison: A strictly business-focused benchmark contrasting ML & Markov against standard Last-Touch.
- Diverging ROI Shift Chart: A stakeholder-friendly horizontal diverging bar chart visually translating the Misallocation Index into exact dollars overfunded and underfunded per channel.
- Markov Removal Effects: Waterfall breakdown computing the exact drop in total pipeline conversion probability if a channel node literally vanishes.
- SHAP Value Beeswarm: Every dot is a user's probability lift, visualizing whether a specific touchpoint was a positive or negative driving anchor in predicting a sale.
- Kaplan-Meier Survival Curves & Hazard Ratios: Forest plots generated via
lifelinesshowcasing the coefficient of channels functioning as "Accelerators" (shortening sales cycles) versus "Draggers" (elongating the timeline).
Key Finding Breakdowns:
- Email is a Force-Multiplier: The ML model detects massive non-linear synergy. While an email blast alone converts poorly, when combined with top-of-funnel discovery campaigns, it mathematically multiplies conversion probability, causing XGBoost to allocate it massive predictive value.
- Direct is a Byproduct of Prior Marketing: Up to 26% of observed "Direct Traffic Sales" fundamentally belong to the Paid and Organic campaigns that drove the brand discovery steps ahead of the user manually typing the URL to checkout.
Constructed following robust Python Machine Learning standards:
colors.py: Centralized universal color styling applied to every channel ensuring cross-chart coherence.data_loader.py: Data ingestion, traversal trajectory sorting, and user aggregation.ml_model.py/markov_model.py/shapley_model.py: Algorithmic sub-classes cleanly abstracting the math.evaluator.py: Dataframe alignment rendering exact financial deviations.visualizer.py&story_visualizer.py: End-to-end HTML/PNG asset generators powering the final slide-decks.
Clone the repository and execute the pipeline:
# Make sure to setup a Python virtual environment (e.g. uv or base venv)
pip install -r requirements.txt
# Run purely statistical validations (cross-validation, Accuracy, F1, C-Index)
python run_performance.py
# Run standard structural pipeline tracing and visualizations
python run_pipeline.py
python run_survival.pyThe script will output all model deviations locally to data/processed/ and auto-generate the presentation charts in reports/figures/.
Documentation Note: Read
docs/attribution_model_decisions.mdto review the entire development journal containing statistical rationale, formula layouts, and codebase decisions.

