To build a unigram inverted index on 20newsgroups dataset[https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups]
Provide support for the following commands:
- x OR y
- x AND y
- x AND NOT y
- x OR NOT y where x and y will be taken as input from the user. Output:
- the number of docs retrieved
- the minimum number of total comparisons done (if any)
- the list of documents retrieved.
- Reading of all the files from all the folders and storing them as list of tuples with each tuple as a (filename, data as a string).
- For each file stored in the list, performing the following steps:
- Tokenizing the data of each file.
- From these tokens, removing the punctuation.
- Lemmatizing each of the words.
- Removal of stop words.
- Keeping a list of unique words per document.
- Iterating over the unique word list and creating the inverted index. Inverted index was created using a dictionary with words as keys of dictionary and corresponding to each word, we are storing a tuple of structure (frequency, posting list) The final structure of inverted index thus obtained is: {word, (frequency, posting list)}
- At last, sorting the posting list corresponding to each of the word.
- Lemmatization of input string query.
- Evaluating the query on the basis of priority of the operators.
- Tokenization
- Punctuation Removal
- Lemmatization
- Stopword Removal
- The folder contains all the files with different names. No duplicacy of name is present in the folders as a whole.
- As operator NOT cannot exist as a unary operator, the priority of operators considered is: AND NOT > OR NOT > AND > OR
- No stopwords have been entered into the input string.
- The AND, NOT, OR have all been input in uppercase in the input string.
- The input query consists of the words present in the documents.
Provide support for searching for phrase queries using Positional Indexes. (Build index only on comp.graphics and rec.motorcycles)
- Reading of all the files from the folders (comp.graphics and rec.motorcycles) and storing them as list of tuples with each tuple as a (filename, data as a string).
- For each file stored in the list, performing the following steps:
- Tokenizing the data of each file.
- From these tokens, removing the punctuation.
- Lemmatizing each of the words.
- Iterating over the lemmatized word list and creating the positional index. Positional index was created using a dictionary with words as keys of dictionary and corresponding to each word, we are storing a tuple of structure (frequency, dictionary of documents) The dictionary of documents contains docID as key and the positions within the document as the value. The final structure of positional index thus obtained is: {word, (frequency, {docId, positional List})}
- Removal of empty key from the dictionary
- Evaluating the query two words at a time when they are only next to each other.
- Tokenization
- Punctuation Removal
- Lemmatization
- The folder contains all the files with different names. No duplicacy of name is present in the folders as a whole.
- The input phrase query contains atleast 2 words.