Precision Medicine

Overview

This system is built for the Precision Medicine track in TREC, which focuses on the scientific abstract retrieval. The system uses external knowledge bases.

Data Description

The dataset is from TREC 2017 PM track, with 889 xml files and a collection of extra topics in the txt format.

Preparation

There are some duplicate documents with the same ID in the collection. The only difference is the publishing date (the changes in abstract contents are small enought to be ignored). So we just use the abstract content at the first hit. If we do not pay attention to these duplicate document IDs, the retrieved result could contain these duplicates and make the evaluation script unable to run.

Get Duplicate Document ID Collection

Call GetDuplicateDocumentID.GetAllDocumentIDCollection() to get all the document IDs.
Call GetDuplicateDocumentID.GetDuplicateDocumentIDs() to get a file with all the duplicate document IDS, i.e. these IDs appear more than once in the collection.

Indexing

Loading File Paths

Call GetFilePath.GetFilePaths(url) to get an ArrayList of the file paths.

Parse XML

Call XMLParser.ReadIDAndAbstract(url) to get a <ID, Abstract> map.
Call XMLParser.ReadIDAndTitle(url) to get a a <ID, Title> map.

Create Index

Call CreateIndexWithTitle to create indexes.

Import the duplicate ID collection while indexing to make sure these duplicate documents are indexed only once.
Set the similarity at this stage.
Set parameters for each filed, e.g. whether stored, tokenized, index options and etc.

Extra Topics

After checking all the extra topics files, it was found that all the first lines are meeting info. So we just skip the first line in parsing files. Simply read titles (remember to remove "Title:" for each title) and abstracts. Index these files.

Query

Query Expansion

See Python scripts under the query expansion folder. Generate expansion terms from two aspects: disease and gene. Resources of knowledge bases: Wikidata, NCI, MeSH, UniProt

Run retrievals

Create an ArrayList of queries. Call BM25Retrieval.SearchMethod to run the queries. Be sure to set correct clause parameters.

ReRanking

Please see the ReRanking repo.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
QueryExpansion		QueryExpansion
jar		jar
src		src
.DS_Store		.DS_Store
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Precision Medicine

Overview

Data Description

Preparation

Get Duplicate Document ID Collection

Indexing

Loading File Paths

Parse XML

Create Index

Extra Topics

Query

Query Expansion

Run retrievals

ReRanking

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Precision Medicine

Overview

Data Description

Preparation

Get Duplicate Document ID Collection

Indexing

Loading File Paths

Parse XML

Create Index

Extra Topics

Query

Query Expansion

Run retrievals

ReRanking

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages