Skip to content

arthur0804/TREC-PrecisionMedicine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Precision Medicine

Overview

This system is built for the Precision Medicine track in TREC, which focuses on the scientific abstract retrieval. The system uses external knowledge bases.

Data Description

The dataset is from TREC 2017 PM track, with 889 xml files and a collection of extra topics in the txt format.

Preparation

There are some duplicate documents with the same ID in the collection. The only difference is the publishing date (the changes in abstract contents are small enought to be ignored). So we just use the abstract content at the first hit. If we do not pay attention to these duplicate document IDs, the retrieved result could contain these duplicates and make the evaluation script unable to run.

Get Duplicate Document ID Collection

  1. Call GetDuplicateDocumentID.GetAllDocumentIDCollection() to get all the document IDs.
  2. Call GetDuplicateDocumentID.GetDuplicateDocumentIDs() to get a file with all the duplicate document IDS, i.e. these IDs appear more than once in the collection.

Indexing

Loading File Paths

Call GetFilePath.GetFilePaths(url) to get an ArrayList of the file paths.

Parse XML

  1. Call XMLParser.ReadIDAndAbstract(url) to get a <ID, Abstract> map.
  2. Call XMLParser.ReadIDAndTitle(url) to get a a <ID, Title> map.

Create Index

Call CreateIndexWithTitle to create indexes.

  1. Import the duplicate ID collection while indexing to make sure these duplicate documents are indexed only once.
  2. Set the similarity at this stage.
  3. Set parameters for each filed, e.g. whether stored, tokenized, index options and etc.

Extra Topics

After checking all the extra topics files, it was found that all the first lines are meeting info. So we just skip the first line in parsing files. Simply read titles (remember to remove "Title:" for each title) and abstracts. Index these files.

Query

Query Expansion

See Python scripts under the query expansion folder. Generate expansion terms from two aspects: disease and gene. Resources of knowledge bases: Wikidata, NCI, MeSH, UniProt

Run retrievals

Create an ArrayList of queries. Call BM25Retrieval.SearchMethod to run the queries. Be sure to set correct clause parameters.

ReRanking

Please see the ReRanking repo.

About

semantically enhanced information retrieval system developed for TREC PM track

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages