GitHub - rastogi-s/Search-Engine-Implementation: Search engine implementation using BM25, TF-IDF, Smoothed Query Likelihood Retrieval models in python

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
CASM-Files		CASM-Files
CorpusGeneration		CorpusGeneration
Display		Display
Evaluation		Evaluation
Indexing		Indexing
MainDriver		MainDriver
OutputFilesTaskReports		OutputFilesTaskReports
QueryEnhancement		QueryEnhancement
Retrieval		Retrieval
SoftMatching		SoftMatching
SpellErrorGenerator		SpellErrorGenerator
Utility		Utility
ProjectReport.pdf		ProjectReport.pdf
README.txt		README.txt
Repository files navigation

Final Project
==============
The main task of the project is to design and build our own information retrieval systems, evaluate and 
compare their performance levels in terms of retrieval effectiveness.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

SYNOPSIS
=========

Four distinct retrieval models have been used to design our search engines which are as follows: -

•	BM25 Retrieval Model
•	TF-IDF Retrieval Model
•	Smoothed Query Likelihood Model
•	Lucene’s default Retrieval Model

We then reported top 100 retrieved ranked lists (one list per run per retrieval system) and use the ranked 
list generated by BM25 retrieval model and pseudo relevance feedback on top of it using modified version of
Rocchio's algorithm to perform query enrichment.

We also performed stopping and stemming text transformation techniques on the given CASM corpus and then
use the BM25, TF-IDF and Query Likelihood retrieval models to generate top 100 ranked documents.

Snippet generation and query term highlighting mechanism was also implemented in this project

Performance analysis in terms of retrieval effectiveness measures of various runs on retrieval models 
was also performed by computing the following measures:

•	Mean Average Precision
•	Mean Reciprocal Rank
•	Precision @ rank K where K = 5 and K = 20
•	Precision and Recall


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

GENERAL USAGE NOTES
====================
1. This file contains steps and instructions about what software needs to be installed and how the programs
need to be run in Windows environment.
2. The project is implemented in Windows environment and may or may not work in other operating systems.
3. The instructions in the file might not match with the instruction procedure for other operating 
systems such as Mac OS, Ubuntu OS, etc.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

INSTALLATION GUIDE
===================

		> Download Python 2.7 from : "https://www.python.org/download/releases/2.7/"
		> Set Environment variables for Python [for detailed steps refer : 
		  "https://docs.python.org/2/using/windows.html" ]
		> Install BeautifulSoup  by the following the below steps:
		       1. Open command prompt (cmd) in Windows.
		       2. Run Command : 'pip install BeautifulSoup4'
			   
		>Install lxml  by the following the below steps:
				1. Open command prompt (cmd) in Windows.
				2. Run Command : 'pip install lxml'
				
		
		> Lucene 4.7.2
			Download and install Lucene from 
			https://lucene.apache.org/
			https://archive.apache.org/dist/lucene/java/4.7.2/
			
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

COMPILE AND RUN INSTRUCTIONS
=============================
Open the project and all relevant files will be present there

PHASE 1
=========
	TASK 1
	======
	1.Open Windows PowerShell
	2.Perform the following steps in sequence
	3.Run Pipeline.py using the command "python Pipeline.py"
	4.The console then prints the following
			Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) 
                                  on CASM corpus without any Text Transformation(no stemming and no stopping)
			Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and perform query
                                  enrichment"
			Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM
                                  corpus with Text Transformation(only stopping)
			Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on CASM
                                  corpus with Text Transformation(only stemming)
			Enter 5 : Display snippets and query highlighting for results of one of the BaseLine
                                  Run (Default BM25)
			Enter 6 : Evaluate the performance of each run performed above in terms of 
                                  effectiveness(except stemming run)
			Enter 7 : Induce noise in the casm queries and perform one of the baseline run and 
                                  compare the overall effectiveness with other baseline runs
			Enter 8 : Perform Soft matching query handling
			Enter 9 : Perform all tasks and generate all outputs"
			Enter 10 : To Initialize the project i.e. delete all output files and folders "
			Enter 11 : To exit!!!!"
	5.We enter 1 to perform Task 1
	6.The programe will generate tokenized corpus in the path ../SearchEngiene/CorpusGeneration/
          TokenizedCorpus and it will generate 3 files in the path ../SearchEngine/Retrieval/No Text
          Transformation Runs Output
	  The folder will consist of four baseline runs named as follows
	  BM25 - "Top_100_Query_Result_BM25.txt"
	  TF-IDF - "Top_100_Query_Result_TF-IDF.txt"
	  SQLM - "Top_100_Query_Result_QueryLikelihoodModel.txt"

	  
	Lucene
	======
	1.For this, we need to make a new project in Java using Eclipse or any IDE.
	2.Create this project in the same directory where the previous project(SearchEngiene project)
          is located and we need to add 3 referenced libraries.
		1.lucene-core-VERSION.jar
		2.lucene-queryparser-VERSION.jar
		3.lucene-analyzers-common-VERSION.jar
	3.Then we need to run LuceneImplementation.java which is placed in "Retrieval" folder by 
          using Eclipse run or by using the command javac LuceneImplementation.java
	4.The console will ask for three paths
		a.The folder where the index wil be created
		b.The location of the corpus path (same path as in SearchEngiene project)
		c.The location of the query log  (same path as in SearchEngiene project)
	5.After getting the input from the user the programe will take the textual corpus
         and generate tokenised corpus by indexing them and rank them according to the query.
	6.The output will be generated in a file named "Top_100_Query_Result_Lucene.txt".
		
	 
	TASK 2
	======
	1.Open Windows PowerShell
	2.Perform the following steps in sequence
	3.Run Pipeline.py using the command "python Pipeline.py"
	4.The console then prints the following
			Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query
                                  Likelihood) on CASM corpus without any Text Transformation(no 
                                  stemming and no stopping)
			Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and 
                                  perform query enrichment"
			Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query 
                                  Likelihood) on CASM corpus with Text Transformation(only stopping)
			Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) 
                                  on CASM corpus with Text Transformation(only stemming)
			Enter 5 : Display snippets and query highlighting for results of one of 
                                  the BaseLine Run (Default BM25)
			Enter 6 : Evaluate the performance of each run performed above in 
                                  terms of effectiveness(except stemming run)
			Enter 7 : Induce noise in the casm queries and perform one of the
                                  baseline run and compare the overall effectiveness with 
                                  other baseline runs
			Enter 8 : Perform Soft matching query handling
			Enter 9 : Perform all tasks and generate all outputs"
			Enter 10 : To Initialize the project i.e. delete all output files and folders "
			Enter 11 : To exit!!!!"
	5.We enter 2 to perform Task 2
	6.The programe then fetches the query map from "RetrievalModels.py" from the "Retrieval" 
          folder, generates the document scores per query and performs Pseudo relevance on the model BM25.
	7.It generates the ouput in the folder "../Search Engiene/SearchEngine/Retrieval/Query Enrichment
          Runs Output". It will consist one file named "Top_100_Query_Result_BM25.txt".
	8.It also generates a file named "EnrichedCasmQueries.txt" in the path ../SearchEngine/QueryEnhancement
          which contains the enriched or modified queries.
	
	TASK 3A
	=======
	1.Open Windows PowerShell
	2.Perform the following steps in sequence
	3.Run Pipeline.py using the command "python Pipeline.py"
	4.The console then prints the following
			Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on
                                  CASM corpus without any Text Transformation(no stemming and no stopping)
			Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and perform 
                                  query enrichment"
			Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on
                                  CASM corpus with Text Transformation(only stopping)
			Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) 
                                  on CASM corpus with Text Transformation(only stemming)
			Enter 5 : Display snippets and query highlighting for results of one of the 
                                  BaseLine Run (Default BM25)
			Enter 6 : Evaluate the performance of each run performed above in terms of 
                                  effectiveness(except stemming run)
			Enter 7 : Induce noise in the casm queries and perform one of the baseline 
                                  run and compare the overall effectiveness with other baseline runs
			Enter 8 : Perform Soft matching query handling
			Enter 9 : Perform all tasks and generate all outputs"
			Enter 10 : To Initialize the project i.e. delete all output files and folders "
			Enter 11 : To exit!!!!"
	5.We enter 3 to perform Task 3A
	6.The programe would run for BM25,TF-IDF and SQLM Model and would produce the following 
          files in the programe generated folder "Stopped Baseline Runs Output" in the path 
	../SearchEngine/Retrieval/Stopped Baseline Runs Output.
		1.For BM25 - "Top_100_Query_Result_BM25.txt"
		2.For SQLM - "Top_100_Query_Result_QueryLikelihoodModel.txt"
		3.For TF-IDF - "Top_100_Query_Result_TF-IDF.txt"
	7.It also generates another folder named "TokenizedCorpusWithStopping" which contains 
          the tokenized stopped corpus in the path ../SearchEngine/CorpusGeneration/
          TokenizedCorpusWithStopping.
	
		
	TASK 3B
	=======
	1.Open Windows PowerShell
	2.Perform the following steps in sequence
	3.Run Pipeline.py using the command "python Pipeline.py"
	4.The console then prints the following
			Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query
                                  Likelihood) on CASM corpus without any Text 
                                  Transformation(no stemming and no stopping)
			Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM
                                  corpus and perform query enrichment"
			Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood)
                                  on CASM corpus with Text Transformation(only stopping)
			Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood)
                                  on CASM corpus with Text Transformation(only stemming)
			Enter 5 : Display snippets and query highlighting for results of 
                                  one of the BaseLine Run (Default BM25)
			Enter 6 : Evaluate the performance of each run performed above in
                                  terms of effectiveness(except stemming run)
			Enter 7 : Induce noise in the casm queries and perform one of the
                                  baseline run and compare the overall effectiveness with other
                                  baseline runs
			Enter 8 : Perform Soft matching query handling
			Enter 9 : Perform all tasks and generate all outputs"
			Enter 10 : To Initialize the project i.e. delete all output files and folders "
			Enter 11 : To exit!!!!"
	5.We enter 4 to perform Task 3B
	6.The programe would run for BM25,TF-IDF and SQLM Model and would produce the 
          following files in the programe generated folder "Stemmed Baseline Runs Output" in the path 
	../SearchEngine/Retrieval/Stemmed Baseline Runs Output.
		1.For BM25 - "Top_100_Query_Result_BM25.txt"
		2.For SQLM - "Top_100_Query_Result_QueryLikelihoodModel.txt" 
		3.For TF-IDF - "Top_100_Query_Result_TF-IDF.txt"
	7.It also generates another folder named "TokenizedCorpusWithStemming" which contains 
          the tokenized stemmed corpus in the path ../SearchEngine/CorpusGeneration/
          TokenizedCorpusWithStemming.
		
PHASE 2
========
	1.Open Windows PowerShell
	2.Perform the following steps in sequence
	3.Run Pipeline.py using the command "python Pipeline.py"
	4.The console then prints the following
			Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query
                                  Likelihood) on CASM corpus without any Text 
                                  Transformation(no stemming and no stopping)
			Enter 2 : To Perform One BaseLine Run(Default BM25) on 
                                  CASM corpus and perform query enrichment"
			Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query
                                  Likelihood) on CASM corpus with Text Transformation(only stopping)
			Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood)
                                  on CASM corpus with Text Transformation(only stemming)
			Enter 5 : Display snippets and query highlighting for results of 
                                  one of the BaseLine Run (Default BM25)
			Enter 6 : Evaluate the performance of each run performed above 
                                  in terms of effectiveness(except stemming run)
			Enter 7 : Induce noise in the casm queries and perform one of
                                  the baseline run and compare the overall effectiveness with
                                  other baseline runs
			Enter 8 : Perform Soft matching query handling
			Enter 9 : Perform all tasks and generate all outputs"
			Enter 10 : To Initialize the project i.e. delete all output files and folders "
			Enter 11 : To exit!!!!"
	5.We enter 5 to perform Task in Phase 2
	6.The programe will internally run DisplayResult.py and generate a folder named 
        "Retrieved_Docments_with_snippets" in the Display folder in path ../SearchEngine/Display/
         Retrieved_Docments_with_snippets. "Retrieved_Docments_with_snippets" folder would contain 
         the snippets for each document per query alongwith query highlighting and the highlighted 
         words are shown in UpperCase.
	7.It was run on BM25 model.
	  
PHASE 3
=======
	1.Open Windows PowerShell
	2.Perform the following steps in sequence
	3.Run Pipeline.py using the command "python Pipeline.py"
	4.The console then prints the following
			Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood)
                                  on CASM corpus without any Text Transformation(no stemming and no stopping)
			Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and perform
                                  query enrichment"
			Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) on 
                                  CASM corpus with Text Transformation(only stopping)
			Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood)
                                  on CASM corpus with Text Transformation(only stemming)
			Enter 5 : Display snippets and query highlighting for results of one
                                  of the BaseLine Run (Default BM25)
			Enter 6 : Evaluate the performance of each run performed above in 
                                  terms of effectiveness(except stemming run)
			Enter 7 : Induce noise in the casm queries and perform one of the 
                                  baseline run and compare the overall effectiveness with other 
                                  baseline runs
			Enter 8 : Perform Soft matching query handling
			Enter 9 : Perform all tasks and generate all outputs"
			Enter 10 : To Initialize the project i.e. delete all output files and 
                                   folders "
			Enter 11 : To exit!!!!"
	5.We enter 6 to perform Task in Phase 3.
	6.It will generate the following folder "OutputFiles" in path ../SearchEngine/Evaluation/OutputFiles. 
          It will contain the following files
			1.EvalFileFor-NoTextTran-BM25
			2.EvalFileFor-NoTextTran-Lucene
			3.EvalFileFor-NoTextTran-SQLM
			4.EvalFileFor-NoTextTran-TF-IDF
			5.EvalFileFor-QueryRefinement-BM25
			6.EvalFileFor-Stopped-BM25
			7.EvalFileFor-Stopped-SQLM
			8.EvalFileFor-Stopped-TF-IDF
	7.Each file will contain the rank of each document per query, its precision, recall, P@5, P@20 
          and overall MAP and MRR values for all the 64 queries for each file.
	
EXTRA CREDIT(Task1)
===================

	1.Open Windows PowerShell
	2.Perform the following steps in sequence
	3.Run Pipeline.py using the command "python Pipeline.py"
	4.The console then prints the following
			Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) 
                                  on CASM corpus without any Text Transformation(no stemming and no 
                                  stopping)
			Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and 
                                  perform query enrichment"
			Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) 
                                  on CASM corpus with Text Transformation(only stopping)
			Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) 
                                  on CASM corpus with Text Transformation(only stemming)
			Enter 5 : Display snippets and query highlighting for results of one of 
                                  the BaseLine Run (Default BM25)
			Enter 6 : Evaluate the performance of each run performed above in 
                                  terms of effectiveness(except stemming run)
			Enter 7 : Induce noise in the casm queries and perform one of the baseline 
                                  run and compare the overall effectiveness with other baseline runs
			Enter 8 : Perform Soft matching query handling
			Enter 9 : Perform all tasks and generate all outputs"
			Enter 10 : To Initialize the project i.e. delete all output files and folders "
			Enter 11 : To exit!!!!"
			
	5.We enter 7 to perform Task 1 of Extra credit
	6.It will run the file SpellingErrorGenerator.py
	7.It will generate a text file called "SpellingErrorInducedQueries.txt" inside a folder
          called SpellingErrorGenerator in the path ../SearchEngine/SpellErrorGenerator.
	8.It will also generate a file named "EvalFileFor-ErrorInducedQuery-BM25" in the 
          folder path ../SearchEngine/Evaluation/OutputFiles.
	
EXTRA CREDIT(Task2)
==================
	1.Open Windows PowerShell
	2.Perform the following steps in sequence
	3.Run Pipeline.py using the command "python Pipeline.py"
	4.The console then prints the following
			Enter 1 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood)
                                  on CASM corpus without any Text Transformation(no stemming and no stopping)
			Enter 2 : To Perform One BaseLine Run(Default BM25) on CASM corpus and 
                                  perform query enrichment"
			Enter 3 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) 
                                  on CASM corpus with Text Transformation(only stopping)
			Enter 4 : To Perform BaseLine Runs(BM25, TF-IDF, Smoothed Query Likelihood) 
                                  on CASM corpus with Text Transformation(only stemming)
			Enter 5 : Display snippets and query highlighting for results of one of the
                                  BaseLine Run (Default BM25)
			Enter 6 : Evaluate the performance of each run performed above in terms of
                                  effectiveness(except stemming run)
			Enter 7 : Induce noise in the casm queries and perform one of the baseline run 
                                  and compare the overall effectiveness with other baseline runs
			Enter 8 : Perform Soft matching query handling
			Enter 9 : Perform all tasks and generate all outputs"
			Enter 10 : To Initialize the project i.e. delete all output files and folders "
			Enter 11 : To exit!!!!"
			
	5.We enter 8 to perform Task 2 of Extra credit.
	6.It will run the file SoftMatchingQuerHandler.py
	7.It will generate a file named "EvalFileFor-SoftQueryMatching-BM25" in the folder
          path ../SearchEngine/Evaluation/OutputFiles.
	


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

CONTRIBUTORS AND CITATIONS
===========================
https://en.wikipedia.org/wiki/Stemming
https://en.wikipedia.org/wiki/Search_engine_indexing
https://en.wikipedia.org/wiki/Apache_Lucene
https://en.wikipedia.org/wiki/Okapi_BM25
https://en.wikipedia.org/wiki/Information_retrieval#Mean_average_precision
https://en.wikipedia.org/?title=Mean_average_precision&redirect=no
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)
https://en.wikipedia.org/wiki/BM25
https://en.wikipedia.org/wiki/Mean_reciprocal_rank
https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html
http://www.pythonforbeginners.com/beautifulsoup/
https://dl.acm.org/citation.cfm?id=1165776
https://lucene.apache.org/core/
https://lucene.apache.org/core/3_5_0/scoring.html
https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
https://www.python.org/downloads/
http://www.cs.ucr.edu/~vagelis/classes/CS172/publications/jasistSalton1990.pdf
https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


Shubham Rastogi
Phone - +1-857-352-3693
Email - [email protected]

Abhidipta Sengupta
Phone - +1-617-817-3640
Email - [email protected]