For first 3 tasks, Download [http://archives.textfiles.com/stories.zip] dataset
For each query, your system will output top k documents based on jaccard score.
- Reading of all the files (except .descs, .header, .musings and index.html) from all the folders (except FARNON) and storing them as list of tuples with each tuple as a (filename, set of words in each file).
- For each word set stored in the tuple, performing the following steps:
- Check for each of the word present in the input query string.
- If the word is present in the word set, increment the counter.
- The counter will represent the no. of the words present in the document.
- Find the union of the words in document and
- The Jaccard coefficient can be then be stored corresponding to each filename in a dictionary. {filename, Jaccard_coefficient_val}
- Check for each of the word present in the input query string.
- Find out the top K documents on the basis of highest to lowest Jaccard coefficient values.
- Tokenization
- Punctuation Removal
- Lemmatization
- The folder contains all the files with different names. No duplicacy of name is present in the folders as a whole.
For each query, your system will output top k documents based on tf-idf-matching-score. Implement different versions of Tf-Idf based document retrieval then compare and analyze which performs better and why. Give special attention to the terms in the document title and analyze the change in result with and without attention to terms in title.
- Reading of all the files (except .descs, .header, .musings and index.html) from all the folders (except FARNON) and storing them as dictionary of dictionary to store the term frequency with respect to each term and document. The data structure would something be like this: {filename, {term, term_frequency}}
- For each document, store the document length in a dictionary:{filename, document_length}
- Find the document frequency for all the terms present in all the documents and storing in the dictionary: {word, doc_freq}
- For each of the term present in the input query, find the unweighted tf-idf score using the formula:
- Tf-idf = (tf/doc_length)X(1+log(Total no of docs/df)) {Log is base 10}
- Now summing up all the tf-idf score for all the terms present in the query and store the tf-idf score w.r.t each document.
- Find out the top K documents on the basis of highest to lowest Jaccard coefficient values. This will be unweighted tf-idf based document retrieval with normalized tf and normalized idf formula.
- To give weightage to title, parse the document index.html in SRE folder as well as outside to keep track of the document title in a dictionary.
- For each term in input query, if that term is present in the title, then increase the tf-idf score to 1.5 times the original tf-idf w.r.t each document. This will give us weighted tf-idf based document retrieval.
- This process is repeated with different tf-idf variants in Step 4.
- Variant – 2
- For each term in a document, log normalize the tf:
- Tf = 1 + log(tf)
- Modify the idf to idf smooth:
- Idf = log(Total docs/(1 + df))
- For each term in a document, log normalize the tf:
- Variant – 3
- For each term in a document, double normalize K the tf:
- Tf = K + (1-K)(tf/max(tf in doc.)) [K = 0.5]
- Modify the idf to idf max:
- Idf = log((max(df))/(1+df))
- For each term in a document, double normalize K the tf:
- Tokenization
- Punctuation Removal
- Lemmatization
- The folder contains all the files with different names. No duplicacy of name is present in the folders as a whole.
For each query, your system will output top k documents based on a cosine similarity between query and document vector. Give special attention to the terms in the document title and analyze the change in result with and without attention to terms in title.
- Reading of all the files (except .descs, .header, .musings and index.html) from all the folders (except FARNON) and storing them as dictionary of dictionary to store the term frequency with respect to each term and document. The data structure would something be like this: {filename, {term, term_frequency}}
- For each document, store the document length in a dictionary:{filename, document_length}
- Find the document frequency for all the terms present in all the documents and storing in the dictionary: {word, doc_freq}
- For each of the term present in the input query, find the unweighted tf-idf score using the formula:
- Tf-idf = (tf/doc_length)X(1+log(Total no of docs/df)) {Log is base 10}
- Now store all the tf-idf score for all the terms present in the query w.r.t each document and each term in a dictionary. This will work as a document vector.
- Find the query vector.
- Find the cosine similarity of the query with each of the document and store the cosine similarity w.r.t. each of the document.
- Find out the top K documents on the basis of highest to lowest Jaccard coefficient values. This will be unweighted tf-idf based document retrieval with normalized tf and normalized idf formula.
- To give weightage to title, parse the document index.html in SRE folder as well as outside to keep track of the document title in a dictionary.
- For each term in input query, if that term is present in the title, then increase the tf-idf score to 1.5 times the original tf-idf w.r.t each document and then find out the cosine similarity as in Step 7. This will give us weighted tf-idf based document retrieval.
- Tokenization
- Punctuation Removal
- Lemmatization
- The folder contains all the files with different names. No duplicacy of name is present in the folders as a whole.
- The input query contains each word only once in the query.
Download the dictionary from [http://www.gwicks.net/dictionaries.htm] (UK ENGLISH - 65,000 words) Take a sentence as input from user. For each non dictionary words present in the sentence suggest top k words on the basis of minimum edit distance. Cost of operations is defined as:
- Insert: 2
- Delete: 1
- Replace: 3
- Reading of the file from the folder (english2) and storing them in a list (namely, wordList).
- For the input query, do the tokenization, removal of punctuation and lemmatization.
- For each word in the input string:
- Check whether the word is stored in the wordlist.
- If the word is not stored in the word list, then calculate the edit distance from all the words present in the wordList and store in the dictionary having the word and their corresponding edit distance with the current input word
- For all the words in the dictionary, sort the dictionary on the basis of the value (editDistance value) in the ascending order.
- From the sorted dictionary, select the top K values and display them.
- Tokenization
- Punctuation Removal
- Lemmatization