View the Oink Oink Project on Devpost
This repository contains our final demand forecasting pipeline for Mango, developed for the Datathon FME 2025.
Goal: predict the optimal production quantity of garments for the next season.
The final implementation is in model.py (cleaned and modularized), with EDA.py and inference.py helpers. The training pipeline produces an ensemble that was used in the original experiments.
.
├── data/
│ ├── train.csv # Historical training data (semicolon-separated)
│ └── test.csv # Test data for prediction
├── EDA.py # Exploratory Data Analysis script (plots & visualization)
├── model.py # Main training pipeline (single-run script)
├── inference.py # Simple CLI to predict using saved models from a JSON input
└── outputs/ # Output models, artifacts and submissions
-
Library imports
- pandas, numpy, sklearn, catboost, etc.
-
Global configuration
- Paths, PCA parameters, cross-validation, ensemble weights
-
Feature engineering
- Data cleaning and aggregation
- Parsing and PCA of image embeddings
- Aggregated features by family, category, and attributes
- Logarithmic normalization of numerical features
-
Training of finalist models
- Model A: Alpha=0.78, learning_rate=0.01 (more stable)
- Model B: Alpha=0.75, learning_rate=0.03 (more aggressive)
- CatBoost with K-Fold CV to select optimal iterations
-
Weighted ensemble
- 60% Model A + 40% Model B
- Inverse log1p transformation to obtain real predictions
-
Submission generation
- Writes
submission.csvat the repository root
- Writes
- Python >= 3.9
- pandas
- numpy
- scikit-learn
- catboost
pip install pandas numpy scikit-learn catboost- Place
train.csvandtest.csvin thedata/folder - Run the training pipeline (saves models to
outputs/and writessubmission.csvat repo root):
python model.py- You will get
submission.csvand models/artifacts underoutputs/.
EDA.py generates a set of plots to inspect the training dataset. To execute it (and open plots), run:
python EDA.pyIf running in a headless environment, you may redirect or save each plot; the script prints status messages as it runs.
If you want to test the model on a JSON payload mirroring data/test.csv, use inference.py:
python inference.py --input-json sample_input.json --output-json predictions.jsonThis outputs a list of { ID, TARGET }. The script reuses the same feature pipeline and needs data/train.csv present for group aggregations.
- Install requirements:
pip install -r requirements.txt- Place CSVs under
data/(semicolon-separated):train.csv,test.csv. - Train and generate submission:
python model.py- The submission is saved as
submission.csvat the project root.
- CatBoost model ensemble achieved 55.57900 accuracy
- Robust feature engineering was more decisive than hypertuning complex models
- Combination of image embeddings, categorical attributes, and multi-season historical data was key
- Temporal validation (TimeSeriesSplit) avoided data leakage and enabled generalizable models
- Train our own visual embeddings
- Explore TabNet or LightGBM with automatic tuning
- Add interpretability to the pipeline to understand which attributes generate more demand
- Automate the entire workflow for real production
Team Oink Oink – AI Students UPC, Datathon FME 2025.