Skip to content

Commit 28de8f1

Browse files
committed
Merge remote-tracking branch 'origin/dev' into 22
2 parents 363829b + 9679e20 commit 28de8f1

78 files changed

Lines changed: 38108 additions & 1592 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/pull-request-testing.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,4 +35,4 @@ jobs:
3535
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
3636
- name: Test with pytest
3737
run: |
38-
pytest
38+
pytest --cov=src -v

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
# ignore files in models folder but keep .gitkeep
22
models/ner/*
3-
results/ner/*
43
!.gitkeep
54

65
# exclude all txt files in data

CODE_OF_CONDUCT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ This Code of Conduct applies both within project spaces and in public spaces whe
3434

3535
## Enforcement
3636

37-
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
37+
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at [email protected]. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
3838

3939
Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
4040

README.md

Lines changed: 79 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,24 +6,53 @@
66

77
# **MetaExtractor: Finding Fossils in the Literature**
88

9-
This project aims to identify research articles which are relevant to the [*Neotoma Paleoecological Database*](http://neotomadb.org) (Neotoma), extract data relevant to Neotoma from the article, and provide a mechanism for the data to be reviewed by Neotoma data stewards then submitted to Neotoma. It is being completed as part of the *University of British Columbia (UBC)* [*Masters of Data Science (MDS)*](https://masterdatascience.ubc.ca/) program in partnership with the [*Neotoma Paleoecological Database*](http://neotomadb.org).
9+
This project aims to identify research articles which are relevant to the [_Neotoma Paleoecological Database_](http://neotomadb.org) (Neotoma), extract data relevant to Neotoma from the article, and provide a mechanism for the data to be reviewed by Neotoma data stewards then submitted to Neotoma. It is being completed as part of the _University of British Columbia (UBC)_ [_Masters of Data Science (MDS)_](https://masterdatascience.ubc.ca/) program in partnership with the [_Neotoma Paleoecological Database_](http://neotomadb.org).
10+
11+
**Table of Contents**
12+
13+
- [**MetaExtractor: Finding Fossils in the Literature**](#metaextractor-finding-fossils-in-the-literature)
14+
- [**Article Relevance Prediction**](#article-relevance-prediction)
15+
- [**Data Extraction Pipeline**](#data-extraction-pipeline)
16+
- [**Data Review Tool**](#data-review-tool)
17+
- [How to use this repository](#how-to-use-this-repository)
18+
- [Entity Extraction Model Training](#entity-extraction-model-training)
19+
- [Data Review Tool](#data-review-tool-1)
20+
- [Data Requirements](#data-requirements)
21+
- [Article Relevance Prediction](#article-relevance-prediction-1)
22+
- [Data Extraction Pipeline](#data-extraction-pipeline-1)
23+
- [Development Workflow Overview](#development-workflow-overview)
24+
- [Analysis Workflow Overview](#analysis-workflow-overview)
25+
- [System Requirements](#system-requirements)
26+
- [**Directory Structure and Description**](#directory-structure-and-description)
27+
- [**Contributors**](#contributors)
28+
- [Tips for Contributing](#tips-for-contributing)
1029

1130
There are 3 primary components to this project:
31+
1232
1. **Article Relevance Prediction** - get the latest articles published, predict which ones are relevant to Neotoma and submit for processing.
13-
2. **MetaData Extraction Pipeline** - extract relevant entities from the article including geographic locations, taxa, etc.
33+
2. **MetaData Extraction Pipeline** - extract relevant entities from the article including geographic locations, taxa, etc.
1434
3. **Data Review Tool** - this takes the extracted data and allows the user to review and correct it for submission to Neotoma.
1535

1636
![](assets/project-flow-diagram.png)
1737

1838
## **Article Relevance Prediction**
1939

20-
The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not. The predicted articles are then submitted to the Data Extraction Pipeline for processing.
40+
The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not.
41+
42+
The model was trained on ~900 positive examples (a sample of articles currently contributing to Neotoma) and ~3500 negative examples (a sample of articles unrrelated or closely related to Neotoma). Logistic regression model was chosen for its outstanding performance and interpretability.
43+
44+
Articles predicted to be relevant will then be submitted to the Data Extraction Pipeline for processing.
45+
46+
![](assets/article_prediction_flow.png)
47+
48+
To run the Docker image for article relevance prediction pipeline, please refer to the instruction [here](docker/article-relevance/README.md)
2149

2250
## **Data Extraction Pipeline**
2351

24-
The full text is provided by the xDD team for the articles that are deemed to be relevant and a custom trained **Named Entity Recognition (NER)** model is used to extract entities of interest from the article.
52+
The full text is provided by the xDD team for the articles that are deemed to be relevant and a custom trained **Named Entity Recognition (NER)** model is used to extract entities of interest from the article.
2553

2654
The entities extracted by this model are:
55+
2756
- **SITE**: name of the excavation site
2857
- **REGION**: more general regions names to provide context for where sites are located
2958
- **TAXA**: plant or animal fossil names
@@ -45,8 +74,41 @@ Finally, the extracted data is loaded into the Data Review Tool where members of
4574

4675
## How to use this repository
4776

48-
WIP
77+
First, begin by installing the requirements and Docker if not already installed ([Docker install instructions](https://docs.docker.com/get-docker/))
78+
79+
```bash
80+
pip install -r requirements.txt
81+
```
82+
83+
A conda environment file will be provided in the final release.
84+
85+
### Entity Extraction Model Training
86+
87+
The Entity Extraction Models can be trained using the HuggingFace API by following the instructions in the [Entity Extraction Training README](src/entity_extraction/training/hf_token_classification/README.md).
88+
89+
The spaCy model training documentation is a WIP.
90+
91+
### Data Review Tool
92+
93+
The Data Review Tool can be launched by running the following command from the root directory of this repository:
4994

95+
```bash
96+
docker-compose up --build data-review-tool
97+
```
98+
99+
Once the image is built and the container is running, the Data Review Tool can be accessed at http://localhost:8050/. There is a sample "extracted entities" JSON file provided for demo purposes.
100+
101+
### Data Requirements
102+
103+
Each of the components of this project have different data requirements. The data requirements for each component are outlined below.
104+
105+
#### Article Relevance Prediction
106+
107+
The article relevance prediction component requires a list of journals that are relevant to Neotoma. This dataset used to train and develop the model is available for download HERE. TODO: Setup public link for data download from project GDrive.
108+
109+
#### Data Extraction Pipeline
110+
111+
As the full text articles provided by the xDD team are not publicly available we cannot create a public link to download the labelled training data. For access requests please contact Ty Andrews at [email protected].
50112

51113
### Development Workflow Overview
52114

@@ -60,16 +122,16 @@ WIP
60122

61123
WIP
62124

63-
### Data Requirements
64-
65-
WIP
66-
67125
### **Directory Structure and Description**
68126

69127
```
70128
├── .github/ <- Directory for GitHub files
71129
│ ├── workflows/ <- Directory for workflows
72130
├── assets/ <- Directory for assets
131+
├── docker/ <- Directory for docker files
132+
│ ├── article-relevance/ <- Directory for docker files related to article relevance prediction
133+
│ ├── data-review-tool/ <- Directory for docker files related to data review tool
134+
│ ├── entity-extraction/ <- Directory for docker files related to named entity recognition
73135
├── data/ <- Directory for data
74136
│ ├── entity-extraction/ <- Directory for named entity extraction data
75137
│ │ ├── raw/ <- Raw unprocessed data
@@ -94,15 +156,16 @@ WIP
94156
├── src/ <- Directory for source code
95157
│ ├── entity_extraction/ <- Directory for named entity recognition code
96158
│ ├── article_relevance/ <- Directory for article relevance prediction code
97-
│ └── data_review_tool/ <- Directory for data review tool code
159+
│ └── data_review_tool/ <- Directory for data review tool code
98160
├── reports/ <- Directory for reports
99161
├── tests/ <- Directory for tests
100162
├── Makefile <- Makefile with commands to perform analysis
101163
└── README.md <- The top-level README for developers using this project.
102164
```
165+
103166
## **Contributors**
104167

105-
This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a [code of conduct](https://github.com/NeotomaDB/MetaExtractor/blob/main/CODE_OF_CONDUCT.md). Please review and follow this code of conduct as part of your contribution.
168+
This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a [code of conduct](https://github.com/NeotomaDB/MetaExtractor/blob/main/CODE_OF_CONDUCT.md). Please review and follow this code of conduct as part of your contribution.
106169

107170
The UBC MDS project team consists of:
108171

@@ -112,12 +175,13 @@ The UBC MDS project team consists of:
112175
- **Shaun Hutchinson**
113176

114177
Sponsors from Neotoma supporting the project are:
115-
* [![ORCID](https://img.shields.io/badge/orcid-0000--0002--7926--4935-brightgreen.svg)](https://orcid.org/0000-0002-7926-4935) [Socorro Dominguez Vidana](https://ht-data.com/)
116-
* [![ORCID](https://img.shields.io/badge/orcid-0000--0002--2700--4605-brightgreen.svg)](https://orcid.org/0000-0002-2700-4605) [Simon Goring](http://www.goring.org)
178+
179+
- [![ORCID](https://img.shields.io/badge/orcid-0000--0002--7926--4935-brightgreen.svg)](https://orcid.org/0000-0002-7926-4935) [Socorro Dominguez Vidana](https://ht-data.com/)
180+
- [![ORCID](https://img.shields.io/badge/orcid-0000--0002--2700--4605-brightgreen.svg)](https://orcid.org/0000-0002-2700-4605) [Simon Goring](http://www.goring.org)
117181

118182
### Tips for Contributing
119183

120-
Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to [project forks](https://github.com/NeotomaDB/MetaExtractor/network/members) or [project branches](https://github.com/NeotomaDB/MetaExtractor/branches).
184+
Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to [project forks](https://github.com/NeotomaDB/MetaExtractor/network/members) or [project branches](https://github.com/NeotomaDB/MetaExtractor/branches).
121185

122186
All products of the Neotoma Paleoecology Database are licensed under an [MIT License](LICENSE) unless otherwise noted.
123187

@@ -130,4 +194,4 @@ All products of the Neotoma Paleoecology Database are licensed under an [MIT Lic
130194
[issues-shield]: https://img.shields.io/github/issues/NeotomaDB/MetaExtractor.svg?style=for-the-badge
131195
[issues-url]: https://github.com/NeotomaDB/MetaExtractor/issues
132196
[license-shield]: https://img.shields.io/github/license/NeotomaDB/MetaExtractor.svg?style=for-the-badge
133-
[license-url]: https://github.com/NeotomaDB/MetaExtractor/blob/master/LICENSE.txt
197+
[license-url]: https://github.com/NeotomaDB/MetaExtractor/blob/master/LICENSE.txt

0 commit comments

Comments
 (0)