This repository contains the analysis pipeline for the SenseWhy study, focusing on predicting, clustering, and visualizing patterns of overeating behavior using sensor features and EMA (Ecological Momentary Assessment) data.
The workflow integrates:
- Classification models (XGBoost, SVM, Naive Bayes, MLP) to predict overeating events.
- Representation learning + clustering (MLP hidden layers → UMAP → KMeans) to identify latent overeating clusters.
- Post-hoc analysis (z-score polar plots) to interpret principal indicators driving overeating clusters.
SenseWhy/
│
├── classification/ # Predictive modeling pipeline
│ └── overeating_sensor_ema_model_eval.ipynb
│
├── clustering/ # Representation learning + UMAP + clustering
│ └── sensewhy_overeating_umap_clusters.ipynb
│
├── posthoc/ # Z-score analysis & polar plots
│ ├── overeating_clusters_zscore_posthoc_analysis.ipynb
│
│
├── data/ (not tracked) # Place input CSV/XLSX files here
│ ├── overeating_sensor_features.csv
│ ├── overeating_ema_features.csv
│ └── ema_posthoc.xlsx
│
└── README.mdNotebook: classification/overeating_sensor_ema_model_eval.ipynb
-
Input:
overeating_sensor_features.csvovereating_ema_features.csv
-
Models implemented:
- XGBoost (Bayesian optimization for hyperparameters)
- Support Vector Machine (GridSearchCV tuned)
- Logistic Regression, Naive Bayes (baseline)
-
Evaluation metrics:
- 5-fold Stratified CV
- ROC (AUROC ± SD)
- Precision-Recall (AUPRC ± SD)
- Calibration curve + Brier score
-
Outputs:
- Comparative plots of classifiers (
result/val_test_*.pdf/png)
- Comparative plots of classifiers (
Notebook: clustering/sensewhy_overeating_umap_clusters.ipynb
-
Steps:
- Train an MLP on EMA features with 10-fold CV (hidden layers:
(200, 100, 50, 25, 5)). - Extract hidden representations from the penultimate layer (
Dim_1–Dim_5). - Project embeddings into 2D using UMAP.
- Cluster UMAP embeddings using KMeans (default: 30 clusters).
- Cluster labels stored in
res_df_nnout_umap2d_clustered.csv.
- Cluster labels stored in
- Train an MLP on EMA features with 10-fold CV (hidden layers:
-
Outputs:
- 2D UMAP embeddings colored by cluster.
- Clustered DataFrame with overeating labels + cluster IDs.
Notebook: posthoc/overeating_clusters_zscore_posthoc_analysis.ipynb
-
Objective:
Interpret overeating clusters using z-scores of EMA features to identify principal indicators. -
Steps:
- Compute z-scores for each feature across clusters.
- Identify features with |Z-score| ≥ 1.
- Visualize per-cluster feature profiles using pyCirclize polar plots.
- Add annotations & thresholds (Z = ±1).
-
Outputs:
polarplot_updated_10_21.pdf→ circos-style visualization of cluster-specific principal indicators.
Clone this repository and install dependencies:
git clone https://github.com/HAbitsLab/SenseWhy.git
cd SenseWhy
This project was developed and tested on Python 3.9+.
Core dependencies:
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- imbalanced-learn
- xgboost
- bayesian-optimization
- umap-learn
- pyCirclize
- joblib
Install all dependencies with:
pip install -r requirements.txt-
Run classification models (XGBoost, SVM, Naive Bayes, Logistic Regression):
jupyter notebook classification/overeating_sensor_ema_model_eval.ipynb
-
Run UMAP + clustering (MLP hidden representations → UMAP → KMeans):
jupyter notebook clustering/sensewhy_overeating_umap_clusters.ipynb
-
Run post-hoc z-score analysis (polar plots of principal indicators):
jupyter notebook posthoc/overeating_clusters_zscore_posthoc_analysis.ipynb
If you use this code, please cite both the SenseWhy paper and the underlying tools/packages.
Shahabi F, Wei B, Romano C, McCloskey R, Lin AW, Pedram M, Schauer J, Stump T, Alshurafa N.
Unveiling overeating patterns within digital longitudinal data on eating behaviors and contexts.
npj Digital Medicine. 2025 [date TBD].
BibTeX:
@article{Shahabi, F., Wei, B., Romano, C. et al.
Unveiling overeating patterns within digital longitudinal data on eating behaviors and contexts.
npj Digit. Med. 8, 567 (2025).
https://doi.org/10.1038/s41746-025-01698-9}
-
pyCirclize
moshi4. (2025, August 23). pyCirclize (Version 1.10.0) [Computer software].
GitHub repository: https://github.com/moshi4/pyCirclize -
UMAP
McInnes, L., Healy, J., & Melville, J. (2018).
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.
arXiv preprint: arXiv:1802.03426 -
XGBoost
Chen, T., & Guestrin, C. (2016).
XGBoost: A scalable tree boosting system.
In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).
https://doi.org/10.1145/2939672.2939785
This project is released under the MIT License.
See the LICENSE file for full details.