Inverted Indexer

A word-level inverted index builder and search tool written in Python.

About

This is an inverted index implementation I wrote for an Information Retrieval course. It builds a word-level index from a corpus of HTML or text documents, enabling fast full-text search. The pipeline includes HTML parsing with BeautifulSoup, tokenization and stopword removal with NLTK, and stemming with Snowball. A sample corpus is included in the corpus/ directory.

Features

Builds a word-level inverted index from a directory of documents
Parses HTML documents using BeautifulSoup
Tokenization, stopword removal, and Snowball stemming via NLTK
Ranked retrieval using TF-IDF weighting
Interactive menu: search an existing index, rebuild from a corpus, or exit
Supports nested corpus directories (subdirectories are merged during preprocessing)
Rich terminal UI with progress bars and formatted output

Built With

Python 3
NLTK — tokenization, stemming, stopwords
Beautiful Soup — HTML parsing
Rich — terminal formatting and progress bars

Getting Started

Prerequisites

Python 3

Installation & Running

git clone https://github.com/mosamaasif/Inverted_Indexer.git
cd Inverted_Indexer
pip3 install nltk rich beautifulsoup4
python3 indexer.py

On first run, you may need to download the NLTK data:

import nltk
nltk.download('punkt')
nltk.download('stopwords')

Usage

The program presents a menu with three options:

Search Only — search an existing index (must have been built previously)
Rebuild Index and Search — point it to a corpus directory (e.g., corpus/), rebuild the index, then search
Exit

Note: The corpus can have subdirectories. Provide the path to the root directory and each subdirectory will be processed and merged automatically.

Screenshots

License

Distributed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
corpus		corpus
data		data
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
indexer.py		indexer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inverted Indexer

About

Features

Built With

Getting Started

Prerequisites

Installation & Running

Usage

Screenshots

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Inverted Indexer

About

Features

Built With

Getting Started

Prerequisites

Installation & Running

Usage

Screenshots

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages