Skip to content

rasyosef/amharic-neural-ir

Repository files navigation

AmharicIR: A Unified Resource for Amharic Neural Retrieval

SIGIR 2026 Submission Hugging Face License

This repository accompanies the SIGIR 2026 paper “AmharicIR: A Unified Resource for Amharic Neural Retrieval Models and Benchmarks.” It provides notebook-based training and evaluation workflows for dense retrieval, late interaction (ColBERT-style), sparse retrieval (SPLADE-style), and cross-encoder reranking in Amharic.

Core artifacts

  • Benchmark: Amharic Passage Retrieval Dataset V2 with a fixed 90/10 train–test split (68,000 query–passage pairs).
  • Model suite: Amharic-specific checkpoints spanning dense bi-encoders, late-interaction (ColBERT-style), learned sparse retrievers (SPLADE-style), and cross-encoder rerankers.
  • Workflows: notebook implementations for preprocessing, training, and evaluation.

Hugging Face resources

Models used in current notebooks (examples)

Notebook-first workflow

This codebase is organized primarily as Jupyter notebooks (rather than standalone .py scripts). The goal is to keep the full pipeline easy to follow and modify step-by-step, especially for practitioners. Because the dataset used in these workflows is relatively small, we keep the main experiments and analysis in notebook format for clarity and quick iteration.

Practical notes

  • Run notebooks from the repository root so relative paths resolve correctly.
  • Each notebook is intended to be runnable end-to-end, in the order described below.
  • If you prefer scripts, you can export notebooks with:
    • jupyter nbconvert --to script <notebook-path>.ipynb

Quickstart

Recommended (Conda, GPU-friendly)

Create a conda environment from amharicir-environment.yml:

conda env create -f amharicir-environment.yml
conda activate amharicir
jupyter lab

Then open one of:

  • evaluation/evaluate-amharic-embedding-passage-retrieval.ipynb
  • evaluation/evaluate-amharic-colbert-passage-retrieval.ipynb
  • evaluation/evaluate-amharic-splade-passage-retrieval.ipynb
  • evaluation/evaluate-amharic-rerankers-passage-retrieval.ipynb

Optional (venv + requirements.txt, pip-only)

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install jupyter
jupyter lab

Then open one of:

  • evaluation/evaluate-amharic-embedding-passage-retrieval.ipynb
  • evaluation/evaluate-amharic-colbert-passage-retrieval.ipynb
  • evaluation/evaluate-amharic-splade-passage-retrieval.ipynb
  • evaluation/evaluate-amharic-rerankers-passage-retrieval.ipynb

Installation

Canonical environment path for this repo:

  • conda with amharicir-environment.yml (Python 3.10)
conda env create -f amharicir-environment.yml
conda activate amharicir

For pip-only workflows, use requirements.txt with a virtual environment.

Python version:

  • The environment file pins Python to 3.10.
  • Notebook metadata in this repo includes multiple runtime versions (3.10.12, 3.11, 3.12.12), so exact results may vary across runtimes.
  • Dependencies are pinned in both amharicir-environment.yml and requirements.txt.

Usage

1) Evaluate pretrained retrieval models

Dense embedding retrieval:

jupyter lab evaluation/evaluate-amharic-embedding-passage-retrieval.ipynb

ColBERT-style retrieval:

jupyter lab evaluation/evaluate-amharic-colbert-passage-retrieval.ipynb

SPLADE-style retrieval:

jupyter lab evaluation/evaluate-amharic-splade-passage-retrieval.ipynb

Two-stage retrieval + reranking:

jupyter lab evaluation/evaluate-amharic-rerankers-passage-retrieval.ipynb

2) Preprocess / hard-negative mining

jupyter lab preprocessing/hard-negatives-mining-amharic-retrieval-dataset.ipynb

3) Train models

Embeddings:

  • training/embeddings-amharic/train-roberta-amharic-embed-base.ipynb
  • training/embeddings-amharic/train-roberta-amharic-embed-medium.ipynb

ColBERT:

  • training/colbert-amharic/train-colbert-amharic-base.ipynb
  • training/colbert-amharic/train-colbert-amharic-medium.ipynb

SPLADE:

  • training/splade-amharic/train-splade-roberta-amharic-base.ipynb
  • training/splade-amharic/train-splade-roberta-amharic-medium.ipynb

Cross-encoder reranker:

  • training/crossencoder-amharic/train-roberta-amharic-reranker-base.ipynb
  • training/crossencoder-amharic/train-roberta-amharic-reranker-medium.ipynb

Reproducibility

Data contract (as used in notebooks)

  • Evaluation notebooks load: rasyosef/Amharic-Passage-Retrieval-Dataset-V2
  • Training/preprocessing notebooks load: yosefw/amharic-news-retrieval-dataset-v2-with-negatives-V2 or rasyosef/amharic-passage-retrieval-dataset-v2
  • Common ID fields in workflows: query_id, passage_id

Seeds

  • Multiple training notebooks use seed=42 (for example in dataset shuffling and training arguments).
  • A few evaluation paths are notebook-interactive and rely on per-cell execution order; rerunning from top to bottom is recommended.

Runtime and hardware notes

  • Experiments in this repository were run in GPU-backed environments, ( A100 and T4 GPUs).
  • Some code paths explicitly set device="cuda" or device="cuda" if torch.cuda.is_available() else "cpu".
  • A recorded run in evaluation/evaluate-amharic-colbert-passage-retrieval.ipynb shows a corpus chunk stage of about 14:52.
  • Runtime depends on hardware, model choice, and batch size.

Known nondeterminism / caveats

  • GPU execution, FAISS-based retrieval, and notebook-interactive execution can introduce run-to-run variation.
  • Results can vary slightly across hardware and drivers even with fixed seeds and pinned software.

Project Structure

.
├── evaluation/
│   ├── evaluate-amharic-colbert-passage-retrieval.ipynb
│   ├── evaluate-amharic-embedding-passage-retrieval.ipynb
│   ├── evaluate-amharic-rerankers-passage-retrieval.ipynb
│   └── evaluate-amharic-splade-passage-retrieval.ipynb
├── preprocessing/
│   └── hard-negatives-mining-amharic-retrieval-dataset.ipynb
├── training/
│   ├── colbert-amharic/
│   │   ├── train-colbert-amharic-base.ipynb
│   │   └── train-colbert-amharic-medium.ipynb
│   ├── crossencoder-amharic/
│   │   ├── train-roberta-amharic-reranker-base.ipynb
│   │   └── train-roberta-amharic-reranker-medium.ipynb
│   ├── embeddings-amharic/
│   │   ├── train-roberta-amharic-embed-base.ipynb
│   │   └── train-roberta-amharic-embed-medium.ipynb
│   └── splade-amharic/
│       ├── train-splade-roberta-amharic-base.ipynb
│       └── train-splade-roberta-amharic-medium.ipynb
├── LICENSE
├── CITATION.cff
├── README.md
├── amharicir-environment.yml
└── requirements.txt

License

This project is released under the MIT License. See LICENSE.

Citation

GitHub citation metadata is available in CITATION.cff.

If you use this repository, please cite:

@misc{alemneh2026amharicir,
  title        = {AmharicIR: A Unified Resource for Amharic Neural Retrieval Models and Benchmarks},
  author       = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
  year         = {2026},
  note         = {Manuscript under review},
  howpublished = {GitHub repository},
}

Troubleshooting / FAQ

Q: ModuleNotFoundError when opening notebooks.
A: Activate the virtual environment and reinstall dependencies:

source .venv/bin/activate
python -m pip install -r requirements.txt

Q: Notebook tries to run on CUDA but no GPU is available.
A: Use notebooks/cells that already support fallback (torch.cuda.is_available()), or set model/device cells to CPU explicitly where needed.

Q: Results differ from previous runs.
A: Check seed usage (seed=42 appears in several training notebooks), hardware differences (A100/T4/CPU), and CUDA/driver differences.

Q: Where are CLI scripts/config files for end-to-end pipeline runs?
A: They are not present in the current repository layout.

About

Training code for Amharic Neural Information Retrieval models: Bi-Encoders, Cross Encoders, ColBERT, and SPLADE

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors