Assignment1

InvertedIndexCreation

To build a unigram inverted index on 20newsgroups dataset[https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups]

Task - 1

Provide support for the following commands:

x OR y
x AND y
x AND NOT y
x OR NOT y where x and y will be taken as input from the user. Output:

the number of docs retrieved
the minimum number of total comparisons done (if any)
the list of documents retrieved.

METHODOLOGY:

Reading of all the files from all the folders and storing them as list of tuples with each tuple as a (filename, data as a string).
For each file stored in the list, performing the following steps:
- Tokenizing the data of each file.
- From these tokens, removing the punctuation.
- Lemmatizing each of the words.
- Removal of stop words.
- Keeping a list of unique words per document.
- Iterating over the unique word list and creating the inverted index. Inverted index was created using a dictionary with words as keys of dictionary and corresponding to each word, we are storing a tuple of structure (frequency, posting list) The final structure of inverted index thus obtained is: {word, (frequency, posting list)}
- At last, sorting the posting list corresponding to each of the word.
Lemmatization of input string query.
Evaluating the query on the basis of priority of the operators.

PREPROCESSING STEPS:

Tokenization
Punctuation Removal
Lemmatization
Stopword Removal

ASSUMPTIONS:

The folder contains all the files with different names. No duplicacy of name is present in the folders as a whole.
As operator NOT cannot exist as a unary operator, the priority of operators considered is: AND NOT > OR NOT > AND > OR
No stopwords have been entered into the input string.
The AND, NOT, OR have all been input in uppercase in the input string.
The input query consists of the words present in the documents.

Task - 2

Provide support for searching for phrase queries using Positional Indexes. (Build index only on comp.graphics and rec.motorcycles)

METHODOLOGY:

Reading of all the files from the folders (comp.graphics and rec.motorcycles) and storing them as list of tuples with each tuple as a (filename, data as a string).
For each file stored in the list, performing the following steps:
- Tokenizing the data of each file.
- From these tokens, removing the punctuation.
- Lemmatizing each of the words.
- Iterating over the lemmatized word list and creating the positional index. Positional index was created using a dictionary with words as keys of dictionary and corresponding to each word, we are storing a tuple of structure (frequency, dictionary of documents) The dictionary of documents contains docID as key and the positions within the document as the value. The final structure of positional index thus obtained is: {word, (frequency, {docId, positional List})}
- Removal of empty key from the dictionary
Evaluating the query two words at a time when they are only next to each other.

PREPROCESSING STEPS:

Tokenization
Punctuation Removal
Lemmatization

ASSUMPTIONS:

The folder contains all the files with different names. No duplicacy of name is present in the folders as a whole.
The input phrase query contains atleast 2 words.

Name		Name	Last commit message	Last commit date
parent directory ..
IR_Assignment1.pdf		IR_Assignment1.pdf
README.md		README.md
Task1.py		Task1.py
Task2.py		Task2.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

InvertedIndexCreation

Task - 1

METHODOLOGY:

PREPROCESSING STEPS:

ASSUMPTIONS:

Task - 2

METHODOLOGY:

PREPROCESSING STEPS:

ASSUMPTIONS:

FilesExpand file tree

Assignment1

Directory actions

More options

Directory actions

More options

Latest commit

History

Assignment1

Folders and files

parent directory

README.md

InvertedIndexCreation

Task - 1

METHODOLOGY:

PREPROCESSING STEPS:

ASSUMPTIONS:

Task - 2

METHODOLOGY:

PREPROCESSING STEPS:

ASSUMPTIONS: