Skip to content

Stargix/Data-Adds

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data-Adds: IAP Revenue Prediction Challenge

A machine learning solution for predicting 7-day In-App Purchase (IAP) revenue using advanced gradient boosting models. This project was developed for a datathon competition focused on user monetization prediction in mobile advertising.

📊 Project Overview

The goal is to predict iap_revenue_d7 (7-day IAP revenue) for mobile app users based on:

  • User behavior features (session activity, retention, engagement)
  • Device information (make, model, OS)
  • Advertiser data (bundle, category, taxonomy)
  • Historical purchase patterns
  • Temporal features (time of day, weekday patterns)
  • Geographic data (country, region)

Final Performance: MSLE of 0.160197 on validation set

🎯 Solution Approach

Model Architecture

The final solution uses XGBoost (Extreme Gradient Boosting) as the primary model, achieving state-of-the-art performance through:

  1. Optimized Hyperparameters:

    • Learning rate: 0.01
    • Max depth: 8
    • Min child weight: 20
    • Subsample: 0.7
    • Column sample by tree: 0.7
    • Tree method: histogram-based
    • Early stopping: 100 rounds
  2. Target Transformation: Log1p transformation (np.log1p()) to handle skewed revenue distribution

  3. Training Strategy:

    • 2000 boost rounds with early stopping
    • RMSE as evaluation metric
    • ~1215 iterations to convergence

Data Processing Pipeline

1. Data Loading

  • Efficient parquet file loading with Dask
  • Selective column loading (62 most relevant features)
  • Sampling strategy: 10% of training data for faster iteration

2. Feature Engineering

Numerical Features:

  • List-type columns aggregation (summing tuple values):
    • iap_revenue_usd_bundle
    • num_buys_bundle
    • rwd_prank
    • whale_users_bundle_num_buys_prank
    • whale_users_bundle_revenue_prank

Categorical Features (Label Encoding):

  • advertiser_bundle
  • advertiser_category
  • advertiser_subcategory
  • country, region
  • dev_make, dev_model, dev_os, dev_osv
  • carrier

Temporal Features:

  • hour, weekday
  • weekend_ratio, hour_ratio

User Behavior Features:

  • avg_act_days, avg_daily_sessions
  • avg_days_ins, avg_duration
  • weeks_since_first_seen
  • wifi_ratio
  • retentiond7

3. Preprocessing Steps

  1. Handle list/tuple columns by summing values
  2. Label encoding for categorical variables
  3. Fill missing values with 0
  4. Convert to categorical dtype for XGBoost optimization
  5. Log1p transformation on target variable

4. Train/Validation Split

  • Time-based split using datetime column
  • Ensures temporal consistency (no data leakage)

📁 Project Structure

datathon/
├── bestprepro.ipynb              # LightGBM preprocessing experiments
├── sergi/
│   ├── GXBOOST.ipynb            # Final XGBoost implementation ⭐
│   ├── GXBOOST copy.ipynb       # XGBoost variant
│   └── ...
├── submission_xgboost_fast2.csv # Final submission file
├── train/                        # Training data (parquet files)
└── test/                         # Test data (parquet files)

🔧 Key Technologies

  • XGBoost: Primary gradient boosting framework
  • Pandas & Dask: Data manipulation and distributed computing
  • NumPy: Numerical operations
  • Scikit-learn: Preprocessing and evaluation metrics
  • PyArrow: Fast parquet file reading

📈 Model Performance

Metric Value
Validation MSLE 0.160197
Validation RMSE $330,317.43
Best Iteration ~1215
Training Time ~2-3 minutes per fold

🚀 Usage

Training the Model

import xgboost as xgb
import numpy as np

# Optimized parameters
params = {
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    'learning_rate': 0.01,
    'max_depth': 8,
    'min_child_weight': 20,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'tree_method': 'hist',
    'device': 'cpu'
}

# Create DMatrix with categorical features
dtrain = xgb.DMatrix(X_train, label=y_train_log, enable_categorical=True)
dval = xgb.DMatrix(X_val, label=y_val_log, enable_categorical=True)

# Train model
model = xgb.train(
    params,
    dtrain,
    num_boost_round=2000,
    evals=[(dtrain, 'train'), (dval, 'val')],
    early_stopping_rounds=100,
    verbose_eval=50
)

Making Predictions

# Predict on test set
dtest = xgb.DMatrix(X_test, enable_categorical=True)
pred_log = model.predict(dtest)

# Inverse log transformation
pred_revenue = np.expm1(pred_log).clip(0, None)

# Create submission
submission = pd.DataFrame({
    'row_id': test_row_ids,
    'iap_revenue_d7': pred_revenue
})

📊 Feature Importance

Top features contributing to predictions:

  1. Historical purchase data (num_buys_bundle, iap_revenue_usd_bundle)
  2. User engagement metrics (avg_daily_sessions, avg_duration)
  3. Retention indicators (retentiond7, weeks_since_first_seen)
  4. Device characteristics (dev_make, dev_model)
  5. Geographic location (country, region)
  6. Temporal patterns (hour, weekday, weekend_ratio)

🎓 Key Learnings

  1. Data Quality Over Quantity: Using 10% of data with better feature engineering outperformed larger datasets
  2. Target Transformation: Log1p transformation crucial for handling heavy-tailed revenue distribution
  3. Categorical Handling: XGBoost's native categorical support (via enable_categorical=True) improved performance
  4. Early Stopping: Prevented overfitting and reduced training time
  5. List Aggregation: Properly aggregating list-type features captured more signal than dropping them

🔬 Alternative Approaches Explored

LightGBM

  • Tested alongside XGBoost
  • Similar performance but XGBoost had edge on this dataset
  • Faster training but required more hyperparameter tuning

Teacher-Student Distillation

  • PyTorch-based neural network approach
  • Teacher model: Larger network with dual heads (regression + classification)
  • Student model: Compressed version for faster inference
  • Not used in final submission but valuable for deployment scenarios

⚙️ Requirements

pandas>=1.3.0
numpy>=1.21.0
dask>=2021.10.0
xgboost>=1.6.0
scikit-learn>=1.0.0
pyarrow>=6.0.0

👥 Team

Project developed for the Data-Adds datathon competition.

📝 License

This project was developed for educational and competition purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors