This system is built for the Precision Medicine track in TREC, which focuses on the scientific abstract retrieval. The system uses external knowledge bases.
The dataset is from TREC 2017 PM track, with 889 xml files and a collection of extra topics in the txt format.
There are some duplicate documents with the same ID in the collection. The only difference is the publishing date (the changes in abstract contents are small enought to be ignored). So we just use the abstract content at the first hit. If we do not pay attention to these duplicate document IDs, the retrieved result could contain these duplicates and make the evaluation script unable to run.
- Call GetDuplicateDocumentID.GetAllDocumentIDCollection() to get all the document IDs.
- Call GetDuplicateDocumentID.GetDuplicateDocumentIDs() to get a file with all the duplicate document IDS, i.e. these IDs appear more than once in the collection.
Call GetFilePath.GetFilePaths(url) to get an ArrayList of the file paths.
- Call XMLParser.ReadIDAndAbstract(url) to get a <ID, Abstract> map.
- Call XMLParser.ReadIDAndTitle(url) to get a a <ID, Title> map.
Call CreateIndexWithTitle to create indexes.
- Import the duplicate ID collection while indexing to make sure these duplicate documents are indexed only once.
- Set the similarity at this stage.
- Set parameters for each filed, e.g. whether stored, tokenized, index options and etc.
After checking all the extra topics files, it was found that all the first lines are meeting info. So we just skip the first line in parsing files. Simply read titles (remember to remove "Title:" for each title) and abstracts. Index these files.
See Python scripts under the query expansion folder. Generate expansion terms from two aspects: disease and gene. Resources of knowledge bases: Wikidata, NCI, MeSH, UniProt
Create an ArrayList of queries. Call BM25Retrieval.SearchMethod to run the queries. Be sure to set correct clause parameters.
Please see the ReRanking repo.