Machine learning pipeline to predict NIRF (National Institutional Ranking Framework) scores/ranks of Indian institutions using historical data (2017–2024), OCR-extracted parameters from images, fuzzy matching and XGBoost models. Includes SHAP and permutation importance for explainability.
- End-to-End Pipeline: Automated data scraping, cleaning, fuzzy matching, and multi-year merging for consistent analytics.
- Predictive Modeling: XGBoost regression trained on NIRF parameters to estimate scores and ranks with high accuracy.
- Explainability: SHAP & permutation importance visualizations to interpret feature contributions to rankings.
- Scrape NIRF HTML tables and parse ranking tables (scripts/scraper.py + scripts/parser.py)
- Download parameter images and extract sub-parameters via OCR (scripts/img_download.py + scripts/image_data_extract.py)
- Merge OCR-derived parameter scores into yearly CSVs (scripts/merge_parameter_scores.py)
- Consolidate multi-year data and create a combined CSV (scripts/main.py)
- Train an XGBoost model and predict 2024 scores/ranks (scripts/nirf_rank_prediction.py)
- Explain model predictions using SHAP and permutation importance (scripts/nirf_shap_and_permutation.py)
Important execution order (what you used):
python scripts/img_download.py# download images, extract with image_data_extractpython scripts/main.py# scrape/merge/generate combined CSV (calls merge_parameter_scores)python scripts/nirf_rank_prediction.py# train & predict (your original script)python scripts/nirf_shap_and_permutation.py# SHAP & permutation analysis
nirf-rank-predictor/
├── README.md
├── requirements.txt
├── LICENSE
├── .gitignore
├── data/ # small sample dataset (NOT full/raw data)
├── scripts/ # all python scripts (scraper, parser, OCR, models)
├── outputs/ # generated outputs (predictions, shap figures)
└── notebooks/ # optional analysis notebooks
# 1) Clone after creating the GitHub repo (or use local folder)
git clone https://github.com/Amang2711/Nirf-Rank-Predictor.git
cd Nirf-Rank-Predictor
# 2) Create virtual env & install
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# 3) System dependencies (for OCR)
# Ubuntu / Debian:
sudo apt-get install tesseract-ocr libtesseract-dev
# macOS (brew):
brew install tesseract
# 4) Run the pipeline (order matters):
python scripts/img_download.py
python scripts/main.py
python scripts/nirf_rank_prediction.py
python scripts/nirf_shap_and_permutation.py- Languages: Python
- Libraries: Pandas, XGBoost, SHAP, scikit-learn, Matplotlib
- Data Processing: OCR (pytesseract), FuzzyWuzzy for name matching
- Data Source: Official NIRF website (2017–2024)
- Add only sample data to the repo. Do not push large CSVs, downloaded images, or outputs — .gitignore already handles this.
- For pushing to GitHub via HTTPS you'll need a Personal Access Token (PAT) instead of a password; or configure SSH keys.
- Tweak
scripts/nirf_rank_prediction.pyif you want to change model hyperparameters or training years. - The
scripts/image_data_extract.pycontains OCR heuristics for edge cases — preserve if it's working for your dataset. - For future executions, change the year in all the files accordingly, as the current file takes into consideration data only till year 2024.
This project is licensed under the MIT License — see the LICENSE file for details.