Assignment2

DocumentRanking

For first 3 tasks, Download [http://archives.textfiles.com/stories.zip] dataset

Task - 1 : Jaccard Coefficient based document retrieval

For each query, your system will output top k documents based on jaccard score.

METHODOLOGY:

Reading of all the files (except .descs, .header, .musings and index.html) from all the folders (except FARNON) and storing them as list of tuples with each tuple as a (filename, set of words in each file).
For each word set stored in the tuple, performing the following steps:
- Check for each of the word present in the input query string.
  - If the word is present in the word set, increment the counter.
- The counter will represent the no. of the words present in the document.
- Find the union of the words in document and
- The Jaccard coefficient can be then be stored corresponding to each filename in a dictionary. {filename, Jaccard_coefficient_val}
Find out the top K documents on the basis of highest to lowest Jaccard coefficient values.

PREPROCESSING STEPS:

Tokenization
Punctuation Removal
Lemmatization

ASSUMPTIONS:

The folder contains all the files with different names. No duplicacy of name is present in the folders as a whole.

Task - 2 : Tf-Idf based document retrieval

For each query, your system will output top k documents based on tf-idf-matching-score. Implement different versions of Tf-Idf based document retrieval then compare and analyze which performs better and why. Give special attention to the terms in the document title and analyze the change in result with and without attention to terms in title.

METHODOLOGY:

Reading of all the files (except .descs, .header, .musings and index.html) from all the folders (except FARNON) and storing them as dictionary of dictionary to store the term frequency with respect to each term and document. The data structure would something be like this: {filename, {term, term_frequency}}
For each document, store the document length in a dictionary:{filename, document_length}
Find the document frequency for all the terms present in all the documents and storing in the dictionary: {word, doc_freq}
For each of the term present in the input query, find the unweighted tf-idf score using the formula:
- Tf-idf = (tf/doc_length)X(1+log(Total no of docs/df)) {Log is base 10}
Now summing up all the tf-idf score for all the terms present in the query and store the tf-idf score w.r.t each document.
Find out the top K documents on the basis of highest to lowest Jaccard coefficient values. This will be unweighted tf-idf based document retrieval with normalized tf and normalized idf formula.
To give weightage to title, parse the document index.html in SRE folder as well as outside to keep track of the document title in a dictionary.
For each term in input query, if that term is present in the title, then increase the tf-idf score to 1.5 times the original tf-idf w.r.t each document. This will give us weighted tf-idf based document retrieval.
This process is repeated with different tf-idf variants in Step 4.
Variant – 2
- For each term in a document, log normalize the tf:
  - Tf = 1 + log(tf)
- Modify the idf to idf smooth:
  - Idf = log(Total docs/(1 + df))
Variant – 3
- For each term in a document, double normalize K the tf:
  - Tf = K + (1-K)(tf/max(tf in doc.)) [K = 0.5]
- Modify the idf to idf max:
  - Idf = log((max(df))/(1+df))

PREPROCESSING STEPS:

Tokenization
Punctuation Removal
Lemmatization

ASSUMPTIONS:

The folder contains all the files with different names. No duplicacy of name is present in the folders as a whole.

Task - 3 : Tf-Idf based vector space document retrieval

For each query, your system will output top k documents based on a cosine similarity between query and document vector. Give special attention to the terms in the document title and analyze the change in result with and without attention to terms in title.

METHODOLOGY:

Reading of all the files (except .descs, .header, .musings and index.html) from all the folders (except FARNON) and storing them as dictionary of dictionary to store the term frequency with respect to each term and document. The data structure would something be like this: {filename, {term, term_frequency}}
For each document, store the document length in a dictionary:{filename, document_length}
Find the document frequency for all the terms present in all the documents and storing in the dictionary: {word, doc_freq}
For each of the term present in the input query, find the unweighted tf-idf score using the formula:
- Tf-idf = (tf/doc_length)X(1+log(Total no of docs/df)) {Log is base 10}
Now store all the tf-idf score for all the terms present in the query w.r.t each document and each term in a dictionary. This will work as a document vector.
Find the query vector.
Find the cosine similarity of the query with each of the document and store the cosine similarity w.r.t. each of the document.
Find out the top K documents on the basis of highest to lowest Jaccard coefficient values. This will be unweighted tf-idf based document retrieval with normalized tf and normalized idf formula.
To give weightage to title, parse the document index.html in SRE folder as well as outside to keep track of the document title in a dictionary.
For each term in input query, if that term is present in the title, then increase the tf-idf score to 1.5 times the original tf-idf w.r.t each document and then find out the cosine similarity as in Step 7. This will give us weighted tf-idf based document retrieval.

PREPROCESSING STEPS:

Tokenization
Punctuation Removal
Lemmatization

ASSUMPTIONS:

The folder contains all the files with different names. No duplicacy of name is present in the folders as a whole.
The input query contains each word only once in the query.

Task - 4 : Edit Distance

Download the dictionary from [http://www.gwicks.net/dictionaries.htm] (UK ENGLISH - 65,000 words) Take a sentence as input from user. For each non dictionary words present in the sentence suggest top k words on the basis of minimum edit distance. Cost of operations is defined as:

Insert: 2
Delete: 1
Replace: 3

METHODOLOGY:

Reading of the file from the folder (english2) and storing them in a list (namely, wordList).
For the input query, do the tokenization, removal of punctuation and lemmatization.
For each word in the input string:
- Check whether the word is stored in the wordlist.
- If the word is not stored in the word list, then calculate the edit distance from all the words present in the wordList and store in the dictionary having the word and their corresponding edit distance with the current input word
- For all the words in the dictionary, sort the dictionary on the basis of the value (editDistance value) in the ascending order.
- From the sorted dictionary, select the top K values and display them.

PREPROCESSING STEPS:

Tokenization
Punctuation Removal
Lemmatization

Name		Name	Last commit message	Last commit date
parent directory ..
IR_Assignment2.pdf		IR_Assignment2.pdf
README.md		README.md
Task1.py		Task1.py
Task2.py		Task2.py
Task3.py		Task3.py
Task4.py		Task4.py
readHTML.py		readHTML.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

DocumentRanking

Task - 1 : Jaccard Coefficient based document retrieval

METHODOLOGY:

PREPROCESSING STEPS:

ASSUMPTIONS:

Task - 2 : Tf-Idf based document retrieval

METHODOLOGY:

PREPROCESSING STEPS:

ASSUMPTIONS:

Task - 3 : Tf-Idf based vector space document retrieval

METHODOLOGY:

PREPROCESSING STEPS:

ASSUMPTIONS:

Task - 4 : Edit Distance

METHODOLOGY:

PREPROCESSING STEPS:

FilesExpand file tree

Assignment2

Directory actions

More options

Directory actions

More options

Latest commit

History

Assignment2

Folders and files

parent directory

README.md

DocumentRanking

Task - 1 : Jaccard Coefficient based document retrieval

METHODOLOGY:

PREPROCESSING STEPS:

ASSUMPTIONS:

Task - 2 : Tf-Idf based document retrieval

METHODOLOGY:

PREPROCESSING STEPS:

ASSUMPTIONS:

Task - 3 : Tf-Idf based vector space document retrieval

METHODOLOGY:

PREPROCESSING STEPS:

ASSUMPTIONS:

Task - 4 : Edit Distance

METHODOLOGY:

PREPROCESSING STEPS: