A machine learning solution for predicting 7-day In-App Purchase (IAP) revenue using advanced gradient boosting models. This project was developed for a datathon competition focused on user monetization prediction in mobile advertising.
The goal is to predict iap_revenue_d7 (7-day IAP revenue) for mobile app users based on:
- User behavior features (session activity, retention, engagement)
- Device information (make, model, OS)
- Advertiser data (bundle, category, taxonomy)
- Historical purchase patterns
- Temporal features (time of day, weekday patterns)
- Geographic data (country, region)
Final Performance: MSLE of 0.160197 on validation set
The final solution uses XGBoost (Extreme Gradient Boosting) as the primary model, achieving state-of-the-art performance through:
-
Optimized Hyperparameters:
- Learning rate: 0.01
- Max depth: 8
- Min child weight: 20
- Subsample: 0.7
- Column sample by tree: 0.7
- Tree method: histogram-based
- Early stopping: 100 rounds
-
Target Transformation: Log1p transformation (
np.log1p()) to handle skewed revenue distribution -
Training Strategy:
- 2000 boost rounds with early stopping
- RMSE as evaluation metric
- ~1215 iterations to convergence
- Efficient parquet file loading with Dask
- Selective column loading (62 most relevant features)
- Sampling strategy: 10% of training data for faster iteration
Numerical Features:
- List-type columns aggregation (summing tuple values):
iap_revenue_usd_bundlenum_buys_bundlerwd_prankwhale_users_bundle_num_buys_prankwhale_users_bundle_revenue_prank
Categorical Features (Label Encoding):
advertiser_bundleadvertiser_categoryadvertiser_subcategorycountry,regiondev_make,dev_model,dev_os,dev_osvcarrier
Temporal Features:
hour,weekdayweekend_ratio,hour_ratio
User Behavior Features:
avg_act_days,avg_daily_sessionsavg_days_ins,avg_durationweeks_since_first_seenwifi_ratioretentiond7
- Handle list/tuple columns by summing values
- Label encoding for categorical variables
- Fill missing values with 0
- Convert to categorical dtype for XGBoost optimization
- Log1p transformation on target variable
- Time-based split using
datetimecolumn - Ensures temporal consistency (no data leakage)
datathon/
├── bestprepro.ipynb # LightGBM preprocessing experiments
├── sergi/
│ ├── GXBOOST.ipynb # Final XGBoost implementation ⭐
│ ├── GXBOOST copy.ipynb # XGBoost variant
│ └── ...
├── submission_xgboost_fast2.csv # Final submission file
├── train/ # Training data (parquet files)
└── test/ # Test data (parquet files)
- XGBoost: Primary gradient boosting framework
- Pandas & Dask: Data manipulation and distributed computing
- NumPy: Numerical operations
- Scikit-learn: Preprocessing and evaluation metrics
- PyArrow: Fast parquet file reading
| Metric | Value |
|---|---|
| Validation MSLE | 0.160197 |
| Validation RMSE | $330,317.43 |
| Best Iteration | ~1215 |
| Training Time | ~2-3 minutes per fold |
import xgboost as xgb
import numpy as np
# Optimized parameters
params = {
'objective': 'reg:squarederror',
'eval_metric': 'rmse',
'learning_rate': 0.01,
'max_depth': 8,
'min_child_weight': 20,
'subsample': 0.7,
'colsample_bytree': 0.7,
'tree_method': 'hist',
'device': 'cpu'
}
# Create DMatrix with categorical features
dtrain = xgb.DMatrix(X_train, label=y_train_log, enable_categorical=True)
dval = xgb.DMatrix(X_val, label=y_val_log, enable_categorical=True)
# Train model
model = xgb.train(
params,
dtrain,
num_boost_round=2000,
evals=[(dtrain, 'train'), (dval, 'val')],
early_stopping_rounds=100,
verbose_eval=50
)# Predict on test set
dtest = xgb.DMatrix(X_test, enable_categorical=True)
pred_log = model.predict(dtest)
# Inverse log transformation
pred_revenue = np.expm1(pred_log).clip(0, None)
# Create submission
submission = pd.DataFrame({
'row_id': test_row_ids,
'iap_revenue_d7': pred_revenue
})Top features contributing to predictions:
- Historical purchase data (
num_buys_bundle,iap_revenue_usd_bundle) - User engagement metrics (
avg_daily_sessions,avg_duration) - Retention indicators (
retentiond7,weeks_since_first_seen) - Device characteristics (
dev_make,dev_model) - Geographic location (
country,region) - Temporal patterns (
hour,weekday,weekend_ratio)
- Data Quality Over Quantity: Using 10% of data with better feature engineering outperformed larger datasets
- Target Transformation: Log1p transformation crucial for handling heavy-tailed revenue distribution
- Categorical Handling: XGBoost's native categorical support (via
enable_categorical=True) improved performance - Early Stopping: Prevented overfitting and reduced training time
- List Aggregation: Properly aggregating list-type features captured more signal than dropping them
- Tested alongside XGBoost
- Similar performance but XGBoost had edge on this dataset
- Faster training but required more hyperparameter tuning
- PyTorch-based neural network approach
- Teacher model: Larger network with dual heads (regression + classification)
- Student model: Compressed version for faster inference
- Not used in final submission but valuable for deployment scenarios
pandas>=1.3.0
numpy>=1.21.0
dask>=2021.10.0
xgboost>=1.6.0
scikit-learn>=1.0.0
pyarrow>=6.0.0
Project developed for the Data-Adds datathon competition.
This project was developed for educational and competition purposes.