Skip to content

Amang2711/NIRF-Rank-Predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NIRF Rank Predictor & Analysis

Machine learning pipeline to predict NIRF (National Institutional Ranking Framework) scores/ranks of Indian institutions using historical data (2017–2024), OCR-extracted parameters from images, fuzzy matching and XGBoost models. Includes SHAP and permutation importance for explainability.

Key Highlights

  • End-to-End Pipeline: Automated data scraping, cleaning, fuzzy matching, and multi-year merging for consistent analytics.
  • Predictive Modeling: XGBoost regression trained on NIRF parameters to estimate scores and ranks with high accuracy.
  • Explainability: SHAP & permutation importance visualizations to interpret feature contributions to rankings.

📊 Project Overview

  • Scrape NIRF HTML tables and parse ranking tables (scripts/scraper.py + scripts/parser.py)
  • Download parameter images and extract sub-parameters via OCR (scripts/img_download.py + scripts/image_data_extract.py)
  • Merge OCR-derived parameter scores into yearly CSVs (scripts/merge_parameter_scores.py)
  • Consolidate multi-year data and create a combined CSV (scripts/main.py)
  • Train an XGBoost model and predict 2024 scores/ranks (scripts/nirf_rank_prediction.py)
  • Explain model predictions using SHAP and permutation importance (scripts/nirf_shap_and_permutation.py)

Important execution order (what you used):

  1. python scripts/img_download.py # download images, extract with image_data_extract
  2. python scripts/main.py # scrape/merge/generate combined CSV (calls merge_parameter_scores)
  3. python scripts/nirf_rank_prediction.py # train & predict (your original script)
  4. python scripts/nirf_shap_and_permutation.py # SHAP & permutation analysis

📂 Project Structure

nirf-rank-predictor/
├── README.md
├── requirements.txt
├── LICENSE
├── .gitignore
├── data/                  # small sample dataset (NOT full/raw data)
├── scripts/               # all python scripts (scraper, parser, OCR, models)
├── outputs/               # generated outputs (predictions, shap figures)
└── notebooks/             # optional analysis notebooks

Quick setup (Linux / macOS / WSL)

# 1) Clone after creating the GitHub repo (or use local folder)
git clone https://github.com/Amang2711/Nirf-Rank-Predictor.git
cd Nirf-Rank-Predictor

# 2) Create virtual env & install
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# 3) System dependencies (for OCR)
# Ubuntu / Debian:
sudo apt-get install tesseract-ocr libtesseract-dev
# macOS (brew):
brew install tesseract

# 4) Run the pipeline (order matters):
python scripts/img_download.py
python scripts/main.py
python scripts/nirf_rank_prediction.py
python scripts/nirf_shap_and_permutation.py

🛠️ Tech Stack

  • Languages: Python
  • Libraries: Pandas, XGBoost, SHAP, scikit-learn, Matplotlib
  • Data Processing: OCR (pytesseract), FuzzyWuzzy for name matching
  • Data Source: Official NIRF website (2017–2024)

Notes & tips

  • Add only sample data to the repo. Do not push large CSVs, downloaded images, or outputs — .gitignore already handles this.
  • For pushing to GitHub via HTTPS you'll need a Personal Access Token (PAT) instead of a password; or configure SSH keys.
  • Tweak scripts/nirf_rank_prediction.py if you want to change model hyperparameters or training years.
  • The scripts/image_data_extract.py contains OCR heuristics for edge cases — preserve if it's working for your dataset.
  • For future executions, change the year in all the files accordingly, as the current file takes into consideration data only till year 2024.

License

This project is licensed under the MIT License — see the LICENSE file for details.

About

NIRF Rank Predictor: Web scraping, OCR, Model training using XGBoost, feature extraction using SHAP and permutation analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages