rastogi-s/Search-Engine-Implementation
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
Final Project ============== The main task of the project is to design and build our own information retrieval systems, evaluate and compare their performance levels in terms of retrieval effectiveness. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; SYNOPSIS ========= Four distinct retrieval models have been used to design our search engines which are as follows: - • BM25 Retrieval Model • TF-IDF Retrieval Model • Smoothed Query Likelihood Model • Lucene’s default Retrieval Model We then reported top 100 retrieved ranked lists (one list per run per retrieval system) and use the ranked list generated by BM25 retrieval model and pseudo relevance feedback on top of it using modified version of Rocchio's algorithm to perform query enrichment. We also performed stopping and stemming text transformation techniques on the given CASM corpus and then use the BM25, TF-IDF and Query Likelihood retrieval models to generate top 100 ranked documents. Snippet generation and query term highlighting mechanism was also implemented in this project Performance analysis in terms of retrieval effectiveness measures of various runs on retrieval models was also performed by computing the following measures: • Mean Average Precision • Mean Reciprocal Rank • Precision @ rank K where K = 5 and K = 20 • Precision and Recall ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; GENERAL USAGE NOTES ==================== 1. This file contains steps and instructions about what software needs to be installed and how the programs need to be run in Windows environment. 2. The project is implemented in Windows environment and may or may not work in other operating systems. 3. The instructions in the file might not match with the instruction procedure for other operating systems such as Mac OS, Ubuntu OS, etc. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; INSTALLATION GUIDE =================== > Download Python 2.7 from : "https://www.python.org/download/releases/2.7/" > Set Environment variables for Python [for detailed steps refer : "https://docs.python.org/2/using/windows.html" ] > Install BeautifulSoup by the following the below steps: 1. Open command prompt (cmd) in Windows. 2. Run Command : 'pip install BeautifulSoup4' >Install lxml by the following the below steps: 1. Open command prompt (cmd) in Windows. 2. Run Command : 'pip install lxml' > Lucene 4.7.2 Download and install Lucene from https://lucene.apache.org/ https://archive.apache.org/dist/lucene/java/4.7.2/ ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; COMPILE AND RUN INSTRUCTIONS ============================= Open the project and all relevant files will be present there PHASE 1 ========= TASK 1 ====== 1.Open Windows PowerShell 2.Perform the following steps in sequence 3.Run Pipeline.py using the command "python Pipeline.py" 4.The console then prints the following Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus without any Text Transformation(no stemming and no stopping) Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and perform query enrichment" Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stopping) Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stemming) Enter 5 : Display snippets and query highlighting for results of one of the BaseLine Run (Default BM25) Enter 6 : Evaluate the performance of each run performed above in terms of effectiveness(except stemming run) Enter 7 : Induce noise in the casm queries and perform one of the baseline run and compare the overall effectiveness with other baseline runs Enter 8 : Perform Soft matching query handling Enter 9 : Perform all tasks and generate all outputs" Enter 10 : To Initialize the project i.e. delete all output files and folders " Enter 11 : To exit!!!!" 5.We enter 1 to perform Task 1 6.The programe will generate tokenized corpus in the path ../SearchEngiene/CorpusGeneration/ TokenizedCorpus and it will generate 3 files in the path ../SearchEngine/Retrieval/No Text Transformation Runs Output The folder will consist of four baseline runs named as follows BM25 - "Top_100_Query_Result_BM25.txt" TF-IDF - "Top_100_Query_Result_TF-IDF.txt" SQLM - "Top_100_Query_Result_QueryLikelihoodModel.txt" Lucene ====== 1.For this, we need to make a new project in Java using Eclipse or any IDE. 2.Create this project in the same directory where the previous project(SearchEngiene project) is located and we need to add 3 referenced libraries. 1.lucene-core-VERSION.jar 2.lucene-queryparser-VERSION.jar 3.lucene-analyzers-common-VERSION.jar 3.Then we need to run LuceneImplementation.java which is placed in "Retrieval" folder by using Eclipse run or by using the command javac LuceneImplementation.java 4.The console will ask for three paths a.The folder where the index wil be created b.The location of the corpus path (same path as in SearchEngiene project) c.The location of the query log (same path as in SearchEngiene project) 5.After getting the input from the user the programe will take the textual corpus and generate tokenised corpus by indexing them and rank them according to the query. 6.The output will be generated in a file named "Top_100_Query_Result_Lucene.txt". TASK 2 ====== 1.Open Windows PowerShell 2.Perform the following steps in sequence 3.Run Pipeline.py using the command "python Pipeline.py" 4.The console then prints the following Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus without any Text Transformation(no stemming and no stopping) Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and perform query enrichment" Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stopping) Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stemming) Enter 5 : Display snippets and query highlighting for results of one of the BaseLine Run (Default BM25) Enter 6 : Evaluate the performance of each run performed above in terms of effectiveness(except stemming run) Enter 7 : Induce noise in the casm queries and perform one of the baseline run and compare the overall effectiveness with other baseline runs Enter 8 : Perform Soft matching query handling Enter 9 : Perform all tasks and generate all outputs" Enter 10 : To Initialize the project i.e. delete all output files and folders " Enter 11 : To exit!!!!" 5.We enter 2 to perform Task 2 6.The programe then fetches the query map from "RetrievalModels.py" from the "Retrieval" folder, generates the document scores per query and performs Pseudo relevance on the model BM25. 7.It generates the ouput in the folder "../Search Engiene/SearchEngine/Retrieval/Query Enrichment Runs Output". It will consist one file named "Top_100_Query_Result_BM25.txt". 8.It also generates a file named "EnrichedCasmQueries.txt" in the path ../SearchEngine/QueryEnhancement which contains the enriched or modified queries. TASK 3A ======= 1.Open Windows PowerShell 2.Perform the following steps in sequence 3.Run Pipeline.py using the command "python Pipeline.py" 4.The console then prints the following Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus without any Text Transformation(no stemming and no stopping) Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and perform query enrichment" Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stopping) Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stemming) Enter 5 : Display snippets and query highlighting for results of one of the BaseLine Run (Default BM25) Enter 6 : Evaluate the performance of each run performed above in terms of effectiveness(except stemming run) Enter 7 : Induce noise in the casm queries and perform one of the baseline run and compare the overall effectiveness with other baseline runs Enter 8 : Perform Soft matching query handling Enter 9 : Perform all tasks and generate all outputs" Enter 10 : To Initialize the project i.e. delete all output files and folders " Enter 11 : To exit!!!!" 5.We enter 3 to perform Task 3A 6.The programe would run for BM25,TF-IDF and SQLM Model and would produce the following files in the programe generated folder "Stopped Baseline Runs Output" in the path ../SearchEngine/Retrieval/Stopped Baseline Runs Output. 1.For BM25 - "Top_100_Query_Result_BM25.txt" 2.For SQLM - "Top_100_Query_Result_QueryLikelihoodModel.txt" 3.For TF-IDF - "Top_100_Query_Result_TF-IDF.txt" 7.It also generates another folder named "TokenizedCorpusWithStopping" which contains the tokenized stopped corpus in the path ../SearchEngine/CorpusGeneration/ TokenizedCorpusWithStopping. TASK 3B ======= 1.Open Windows PowerShell 2.Perform the following steps in sequence 3.Run Pipeline.py using the command "python Pipeline.py" 4.The console then prints the following Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus without any Text Transformation(no stemming and no stopping) Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and perform query enrichment" Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stopping) Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stemming) Enter 5 : Display snippets and query highlighting for results of one of the BaseLine Run (Default BM25) Enter 6 : Evaluate the performance of each run performed above in terms of effectiveness(except stemming run) Enter 7 : Induce noise in the casm queries and perform one of the baseline run and compare the overall effectiveness with other baseline runs Enter 8 : Perform Soft matching query handling Enter 9 : Perform all tasks and generate all outputs" Enter 10 : To Initialize the project i.e. delete all output files and folders " Enter 11 : To exit!!!!" 5.We enter 4 to perform Task 3B 6.The programe would run for BM25,TF-IDF and SQLM Model and would produce the following files in the programe generated folder "Stemmed Baseline Runs Output" in the path ../SearchEngine/Retrieval/Stemmed Baseline Runs Output. 1.For BM25 - "Top_100_Query_Result_BM25.txt" 2.For SQLM - "Top_100_Query_Result_QueryLikelihoodModel.txt" 3.For TF-IDF - "Top_100_Query_Result_TF-IDF.txt" 7.It also generates another folder named "TokenizedCorpusWithStemming" which contains the tokenized stemmed corpus in the path ../SearchEngine/CorpusGeneration/ TokenizedCorpusWithStemming. PHASE 2 ======== 1.Open Windows PowerShell 2.Perform the following steps in sequence 3.Run Pipeline.py using the command "python Pipeline.py" 4.The console then prints the following Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus without any Text Transformation(no stemming and no stopping) Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and perform query enrichment" Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stopping) Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stemming) Enter 5 : Display snippets and query highlighting for results of one of the BaseLine Run (Default BM25) Enter 6 : Evaluate the performance of each run performed above in terms of effectiveness(except stemming run) Enter 7 : Induce noise in the casm queries and perform one of the baseline run and compare the overall effectiveness with other baseline runs Enter 8 : Perform Soft matching query handling Enter 9 : Perform all tasks and generate all outputs" Enter 10 : To Initialize the project i.e. delete all output files and folders " Enter 11 : To exit!!!!" 5.We enter 5 to perform Task in Phase 2 6.The programe will internally run DisplayResult.py and generate a folder named "Retrieved_Docments_with_snippets" in the Display folder in path ../SearchEngine/Display/ Retrieved_Docments_with_snippets. "Retrieved_Docments_with_snippets" folder would contain the snippets for each document per query alongwith query highlighting and the highlighted words are shown in UpperCase. 7.It was run on BM25 model. PHASE 3 ======= 1.Open Windows PowerShell 2.Perform the following steps in sequence 3.Run Pipeline.py using the command "python Pipeline.py" 4.The console then prints the following Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus without any Text Transformation(no stemming and no stopping) Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and perform query enrichment" Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stopping) Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stemming) Enter 5 : Display snippets and query highlighting for results of one of the BaseLine Run (Default BM25) Enter 6 : Evaluate the performance of each run performed above in terms of effectiveness(except stemming run) Enter 7 : Induce noise in the casm queries and perform one of the baseline run and compare the overall effectiveness with other baseline runs Enter 8 : Perform Soft matching query handling Enter 9 : Perform all tasks and generate all outputs" Enter 10 : To Initialize the project i.e. delete all output files and folders " Enter 11 : To exit!!!!" 5.We enter 6 to perform Task in Phase 3. 6.It will generate the following folder "OutputFiles" in path ../SearchEngine/Evaluation/OutputFiles. It will contain the following files 1.EvalFileFor-NoTextTran-BM25 2.EvalFileFor-NoTextTran-Lucene 3.EvalFileFor-NoTextTran-SQLM 4.EvalFileFor-NoTextTran-TF-IDF 5.EvalFileFor-QueryRefinement-BM25 6.EvalFileFor-Stopped-BM25 7.EvalFileFor-Stopped-SQLM 8.EvalFileFor-Stopped-TF-IDF 7.Each file will contain the rank of each document per query, its precision, recall, P@5, P@20 and overall MAP and MRR values for all the 64 queries for each file. EXTRA CREDIT(Task1) =================== 1.Open Windows PowerShell 2.Perform the following steps in sequence 3.Run Pipeline.py using the command "python Pipeline.py" 4.The console then prints the following Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus without any Text Transformation(no stemming and no stopping) Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and perform query enrichment" Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stopping) Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stemming) Enter 5 : Display snippets and query highlighting for results of one of the BaseLine Run (Default BM25) Enter 6 : Evaluate the performance of each run performed above in terms of effectiveness(except stemming run) Enter 7 : Induce noise in the casm queries and perform one of the baseline run and compare the overall effectiveness with other baseline runs Enter 8 : Perform Soft matching query handling Enter 9 : Perform all tasks and generate all outputs" Enter 10 : To Initialize the project i.e. delete all output files and folders " Enter 11 : To exit!!!!" 5.We enter 7 to perform Task 1 of Extra credit 6.It will run the file SpellingErrorGenerator.py 7.It will generate a text file called "SpellingErrorInducedQueries.txt" inside a folder called SpellingErrorGenerator in the path ../SearchEngine/SpellErrorGenerator. 8.It will also generate a file named "EvalFileFor-ErrorInducedQuery-BM25" in the folder path ../SearchEngine/Evaluation/OutputFiles. EXTRA CREDIT(Task2) ================== 1.Open Windows PowerShell 2.Perform the following steps in sequence 3.Run Pipeline.py using the command "python Pipeline.py" 4.The console then prints the following Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus without any Text Transformation(no stemming and no stopping) Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and perform query enrichment" Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stopping) Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM corpus with Text Transformation(only stemming) Enter 5 : Display snippets and query highlighting for results of one of the BaseLine Run (Default BM25) Enter 6 : Evaluate the performance of each run performed above in terms of effectiveness(except stemming run) Enter 7 : Induce noise in the casm queries and perform one of the baseline run and compare the overall effectiveness with other baseline runs Enter 8 : Perform Soft matching query handling Enter 9 : Perform all tasks and generate all outputs" Enter 10 : To Initialize the project i.e. delete all output files and folders " Enter 11 : To exit!!!!" 5.We enter 8 to perform Task 2 of Extra credit. 6.It will run the file SoftMatchingQuerHandler.py 7.It will generate a file named "EvalFileFor-SoftQueryMatching-BM25" in the folder path ../SearchEngine/Evaluation/OutputFiles. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; CONTRIBUTORS AND CITATIONS =========================== https://en.wikipedia.org/wiki/Stemming https://en.wikipedia.org/wiki/Search_engine_indexing https://en.wikipedia.org/wiki/Apache_Lucene https://en.wikipedia.org/wiki/Okapi_BM25 https://en.wikipedia.org/wiki/Information_retrieval#Mean_average_precision https://en.wikipedia.org/?title=Mean_average_precision&redirect=no https://en.wikipedia.org/wiki/Tf%E2%80%93idf https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser) https://en.wikipedia.org/wiki/BM25 https://en.wikipedia.org/wiki/Mean_reciprocal_rank https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html http://www.pythonforbeginners.com/beautifulsoup/ https://dl.acm.org/citation.cfm?id=1165776 https://lucene.apache.org/core/ https://lucene.apache.org/core/3_5_0/scoring.html https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html https://www.python.org/downloads/ http://www.cs.ucr.edu/~vagelis/classes/CS172/publications/jasistSalton1990.pdf https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; Shubham Rastogi Phone - +1-857-352-3693 Email - [email protected] Abhidipta Sengupta Phone - +1-617-817-3640 Email - [email protected]