This repository accompanies the SIGIR 2026 paper “AmharicIR: A Unified Resource for Amharic Neural Retrieval Models and Benchmarks.” It provides notebook-based training and evaluation workflows for dense retrieval, late interaction (ColBERT-style), sparse retrieval (SPLADE-style), and cross-encoder reranking in Amharic.
Core artifacts
- Benchmark: Amharic Passage Retrieval Dataset V2 with a fixed 90/10 train–test split (68,000 query–passage pairs).
- Model suite: Amharic-specific checkpoints spanning dense bi-encoders, late-interaction (ColBERT-style), learned sparse retrievers (SPLADE-style), and cross-encoder rerankers.
- Workflows: notebook implementations for preprocessing, training, and evaluation.
Hugging Face resources
- Dataset: rasyosef/Amharic-Passage-Retrieval-Dataset-V2
- Model collection: rasyosef/amharic-neural-ir-models
Models used in current notebooks (examples)
- rasyosef/RoBERTa-Amharic-Embed-Base
- rasyosef/RoBERTa-Amharic-Embed-Medium
- rasyosef/ColBERT-Amharic-Base
- rasyosef/ColBERT-Amharic-Medium
- rasyosef/SPLADE-RoBERTa-Amharic-Base
- rasyosef/SPLADE-RoBERTa-Amharic-Medium
- rasyosef/RoBERTa-Amharic-Reranker-Base
- rasyosef/RoBERTa-Amharic-Reranker-Medium
This codebase is organized primarily as Jupyter notebooks (rather than standalone .py scripts). The goal is to keep the full pipeline easy to follow and modify step-by-step, especially for practitioners. Because the dataset used in these workflows is relatively small, we keep the main experiments and analysis in notebook format for clarity and quick iteration.
Practical notes
- Run notebooks from the repository root so relative paths resolve correctly.
- Each notebook is intended to be runnable end-to-end, in the order described below.
- If you prefer scripts, you can export notebooks with:
jupyter nbconvert --to script <notebook-path>.ipynb
Create a conda environment from amharicir-environment.yml:
conda env create -f amharicir-environment.yml
conda activate amharicir
jupyter labThen open one of:
evaluation/evaluate-amharic-embedding-passage-retrieval.ipynbevaluation/evaluate-amharic-colbert-passage-retrieval.ipynbevaluation/evaluate-amharic-splade-passage-retrieval.ipynbevaluation/evaluate-amharic-rerankers-passage-retrieval.ipynb
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install jupyter
jupyter labThen open one of:
evaluation/evaluate-amharic-embedding-passage-retrieval.ipynbevaluation/evaluate-amharic-colbert-passage-retrieval.ipynbevaluation/evaluate-amharic-splade-passage-retrieval.ipynbevaluation/evaluate-amharic-rerankers-passage-retrieval.ipynb
Canonical environment path for this repo:
condawithamharicir-environment.yml(Python 3.10)
conda env create -f amharicir-environment.yml
conda activate amharicirFor pip-only workflows, use requirements.txt with a virtual environment.
Python version:
- The environment file pins Python to
3.10. - Notebook metadata in this repo includes multiple runtime versions (
3.10.12,3.11,3.12.12), so exact results may vary across runtimes. - Dependencies are pinned in both
amharicir-environment.ymlandrequirements.txt.
Dense embedding retrieval:
jupyter lab evaluation/evaluate-amharic-embedding-passage-retrieval.ipynbColBERT-style retrieval:
jupyter lab evaluation/evaluate-amharic-colbert-passage-retrieval.ipynbSPLADE-style retrieval:
jupyter lab evaluation/evaluate-amharic-splade-passage-retrieval.ipynbTwo-stage retrieval + reranking:
jupyter lab evaluation/evaluate-amharic-rerankers-passage-retrieval.ipynbjupyter lab preprocessing/hard-negatives-mining-amharic-retrieval-dataset.ipynbEmbeddings:
training/embeddings-amharic/train-roberta-amharic-embed-base.ipynbtraining/embeddings-amharic/train-roberta-amharic-embed-medium.ipynb
ColBERT:
training/colbert-amharic/train-colbert-amharic-base.ipynbtraining/colbert-amharic/train-colbert-amharic-medium.ipynb
SPLADE:
training/splade-amharic/train-splade-roberta-amharic-base.ipynbtraining/splade-amharic/train-splade-roberta-amharic-medium.ipynb
Cross-encoder reranker:
training/crossencoder-amharic/train-roberta-amharic-reranker-base.ipynbtraining/crossencoder-amharic/train-roberta-amharic-reranker-medium.ipynb
- Evaluation notebooks load:
rasyosef/Amharic-Passage-Retrieval-Dataset-V2 - Training/preprocessing notebooks load:
yosefw/amharic-news-retrieval-dataset-v2-with-negatives-V2orrasyosef/amharic-passage-retrieval-dataset-v2 - Common ID fields in workflows:
query_id,passage_id
- Multiple training notebooks use
seed=42(for example in dataset shuffling and training arguments). - A few evaluation paths are notebook-interactive and rely on per-cell execution order; rerunning from top to bottom is recommended.
- Experiments in this repository were run in GPU-backed environments, ( A100 and T4 GPUs).
- Some code paths explicitly set
device="cuda"ordevice="cuda" if torch.cuda.is_available() else "cpu". - A recorded run in
evaluation/evaluate-amharic-colbert-passage-retrieval.ipynbshows a corpus chunk stage of about14:52. - Runtime depends on hardware, model choice, and batch size.
- GPU execution, FAISS-based retrieval, and notebook-interactive execution can introduce run-to-run variation.
- Results can vary slightly across hardware and drivers even with fixed seeds and pinned software.
.
├── evaluation/
│ ├── evaluate-amharic-colbert-passage-retrieval.ipynb
│ ├── evaluate-amharic-embedding-passage-retrieval.ipynb
│ ├── evaluate-amharic-rerankers-passage-retrieval.ipynb
│ └── evaluate-amharic-splade-passage-retrieval.ipynb
├── preprocessing/
│ └── hard-negatives-mining-amharic-retrieval-dataset.ipynb
├── training/
│ ├── colbert-amharic/
│ │ ├── train-colbert-amharic-base.ipynb
│ │ └── train-colbert-amharic-medium.ipynb
│ ├── crossencoder-amharic/
│ │ ├── train-roberta-amharic-reranker-base.ipynb
│ │ └── train-roberta-amharic-reranker-medium.ipynb
│ ├── embeddings-amharic/
│ │ ├── train-roberta-amharic-embed-base.ipynb
│ │ └── train-roberta-amharic-embed-medium.ipynb
│ └── splade-amharic/
│ ├── train-splade-roberta-amharic-base.ipynb
│ └── train-splade-roberta-amharic-medium.ipynb
├── LICENSE
├── CITATION.cff
├── README.md
├── amharicir-environment.yml
└── requirements.txt
This project is released under the MIT License. See LICENSE.
GitHub citation metadata is available in CITATION.cff.
If you use this repository, please cite:
@misc{alemneh2026amharicir,
title = {AmharicIR: A Unified Resource for Amharic Neural Retrieval Models and Benchmarks},
author = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
year = {2026},
note = {Manuscript under review},
howpublished = {GitHub repository},
}Q: ModuleNotFoundError when opening notebooks.
A: Activate the virtual environment and reinstall dependencies:
source .venv/bin/activate
python -m pip install -r requirements.txtQ: Notebook tries to run on CUDA but no GPU is available.
A: Use notebooks/cells that already support fallback (torch.cuda.is_available()), or set model/device cells to CPU explicitly where needed.
Q: Results differ from previous runs.
A: Check seed usage (seed=42 appears in several training notebooks), hardware differences (A100/T4/CPU), and CUDA/driver differences.
Q: Where are CLI scripts/config files for end-to-end pipeline runs?
A: They are not present in the current repository layout.