A word-level inverted index builder and search tool written in Python.
This is an inverted index implementation I wrote for an Information Retrieval course. It builds a word-level index from a corpus of HTML or text documents, enabling fast full-text search. The pipeline includes HTML parsing with BeautifulSoup, tokenization and stopword removal with NLTK, and stemming with Snowball. A sample corpus is included in the corpus/ directory.
- Builds a word-level inverted index from a directory of documents
- Parses HTML documents using BeautifulSoup
- Tokenization, stopword removal, and Snowball stemming via NLTK
- Ranked retrieval using TF-IDF weighting
- Interactive menu: search an existing index, rebuild from a corpus, or exit
- Supports nested corpus directories (subdirectories are merged during preprocessing)
- Rich terminal UI with progress bars and formatted output
- Python 3
- NLTK — tokenization, stemming, stopwords
- Beautiful Soup — HTML parsing
- Rich — terminal formatting and progress bars
git clone https://github.com/mosamaasif/Inverted_Indexer.git
cd Inverted_Indexer
pip3 install nltk rich beautifulsoup4
python3 indexer.pyOn first run, you may need to download the NLTK data:
import nltk
nltk.download('punkt')
nltk.download('stopwords')The program presents a menu with three options:
- Search Only — search an existing index (must have been built previously)
- Rebuild Index and Search — point it to a corpus directory (e.g.,
corpus/), rebuild the index, then search - Exit
Note: The corpus can have subdirectories. Provide the path to the root directory and each subdirectory will be processed and merged automatically.
Distributed under the MIT License. See LICENSE for details.




