This project explores the problem of rank aggregation—combining multiple ranked lists into a single consensus ranking—under conditions of partial or missing data. It implements and evaluates classical algorithms such as:
- Naive Aggregation
- Fagin’s Algorithm
- Threshold Algorithm (TA)
- No Random Access (NRA)
To address inefficiencies in NRA, particularly when the number of desired results ( k ) is large, we introduce a variant called NRA w/ Imputer, which uses imputation to estimate missing scores and enable earlier stopping.
To get started, ensure you're using Python 3.8, and follow these steps:
- Create a virtual environment (recommended):
python3.8 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
This will install all necessary libraries including pandas, scikit-learn, matplotlib, numpy, and more as needed for running aggregation experiments and visualizations.
You can run any experiment using a YAML config file or by explicitly passing arguments on the command line.
python main.py --config configs\config_toy_fagin.yaml
Example for NRA w/impute algorithm:
python main.py --config configs\exp_all_algs\config_nra_impute.yaml
This will train the imputers and will save them in the .cache directory.
python main.py \
--agg_function_name avg \
--aggregator_name fagin \
--dataset_name toy \
--imputer_name basic \
--p_erase 0.5
--k 3 \
--seed 42 \
--output_path outputs/toy_fagin_p_0_5
💡 For a full list of supported arguments and defaults, refer to
config.py, which defines all experiment options using a structureddataclass.
After running an experiment, results will be saved inside the folder specified by output_path. You will find:
results.csv: Contains the number of sorted accesses and random accesses for each algorithm and value of ( k ).metrics.csv: Includes evaluation metrics such as set accuracy (correct top-$k$ elements) and exact match rate (correct order among top-$k$).
To generate infographics about distribution, you can run:
python data/visualize.py
The dataset is taken from https://www.kaggle.com/datasets/gregorut/videogamesales/data.
The visualizations are saved in the
outputs/datadirectory.
If you find this work useful, please consider citing:
@misc{yehezkel2025mlrankagg,
title={Rank Aggregation with ML-Based Imputation},
author={Shai Yehekzel},
year={2025},
howpublished={\url{https://github.com/kariander1/RankAggML}},
note={Accessed: 2025-04-21}
}