Skip to content

Commit aaa4875

Browse files
authored
Merge branch 'dev' into 30-build-ner-extraction-pipeline
2 parents d64c8f2 + 213c8a9 commit aaa4875

37 files changed

+7737
-124
lines changed

.github/workflows/pull-request-testing.yml

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -17,22 +17,22 @@ jobs:
1717
runs-on: ubuntu-latest
1818

1919
steps:
20-
- uses: actions/checkout@v3
21-
- name: Set up Python 3.10
22-
uses: actions/setup-python@v3
23-
with:
24-
python-version: "3.10"
25-
- name: Install dependencies
26-
run: |
27-
python -m pip install --upgrade pip
28-
pip install flake8 pytest
29-
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
30-
- name: Lint with flake8
31-
run: |
32-
# stop the build if there are Python syntax errors or undefined names
33-
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
34-
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
35-
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
36-
- name: Test with pytest
37-
run: |
38-
pytest
20+
- uses: actions/checkout@v3
21+
- name: Set up Python 3.10
22+
uses: actions/setup-python@v3
23+
with:
24+
python-version: "3.10"
25+
- name: Install dependencies
26+
run: |
27+
python -m pip install --upgrade pip
28+
pip install flake8 pytest
29+
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
30+
- name: Lint with flake8
31+
run: |
32+
# stop the build if there are Python syntax errors or undefined names
33+
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
34+
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
35+
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
36+
- name: Test with pytest
37+
run: |
38+
pytest

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
# ignore files in models folder but keep .gitkeep
2+
models/ner/*
3+
results/ner/*
4+
!.gitkeep
5+
16
# exclude all txt files in data
27
data/**/*.txt
38
# include all json files in data

README.md

Lines changed: 61 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,47 @@
1+
[![Contributors][contributors-shield]][contributors-url]
2+
[![Forks][forks-shield]][forks-url]
3+
[![Stargazers][stars-shield]][stars-url]
4+
[![Issues][issues-shield]][issues-url]
5+
[![MIT License][license-shield]][license-url]
6+
17
# MetaExtractor: Finding Fossils in the Literature
28

39
This project aims to identify research articles which are relevant to the [Neotoma Paleoecological Database](http://neotomadb.org) (Neotoma), extract data relevant to Neotoma from the article, and provide a mechanism for the data to be reviewed by Neotoma data stewards then submitted to Neotoma. It is being completed as part of the University of British Columbia (UBC) [Masters of Data Science (MDS) program](https://masterdatascience.ubc.ca/) in partnership with the [Neotoma Paleoecological Database](http://neotomadb.org).
410

11+
There are 3 primary components to this project:
12+
1. **Article Relevance Prediction** - get the latest articles published, predict which ones are relevant to Neotoma and submit for processing
13+
2. **MetaData Extraction Pipeline** - extract relevant metadata from the article including geographic locations, taxa present, etc.
14+
3. **Data Review Tool** - this takes the extracted data and allows a user to review and correct it for submission to Neotoma
15+
16+
![](assets/project-flow-diagram.png)
17+
18+
## Article Relevance Prediction
19+
20+
The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not. The predicted articles are then submitted to the MetaData Extraction Pipeline for processing.
21+
22+
## MetaData Extraction Pipeline
23+
24+
The predicted relevant articles have their full text provided by the xDD team and a custom trained Named Entity Recognition (NER) model is used to extract relevant data from the article.
25+
26+
The entities detected by this model are:
27+
- **AGE**: when historical ages are mentioned such as 1234 AD or 4567 BP (before present)
28+
- **TAXA**: plant or animal taxa names indicating what samples contained
29+
- **GEOG**: geographic coordinates indicating where samples were excavated from, e.g. 12'34"N 34'23"W
30+
- **SITE**: site names for where samples were excavated from
31+
- **REGION**: more general regions to provide context for where sites are located
32+
- **EMAIL**: researcher emails in the articles able to be used for follow-up contact
33+
- **ALTI**: altitudes of sites from where samples were excavated, e.g. 123 m a.s.l (above sea level)
34+
35+
The model was trained on ~40 existing Paleoecology articles manually annotated by the team consisting of ~60,000 tokens with ~4,500 tagged entities.
36+
37+
The trained model is available for inference and re-use on huggingface.co [here](https://huggingface.co/finding-fossils/metaextractor).
38+
![](assets/hugging-face-metaextractor.png)
39+
40+
## Data Review Tool
41+
42+
Finally, the extracted data is loaded into the Data Review Tool where members of the Neotoma community can review the data and make any corrections necessary before submitting to Neotoma. The Data Review Tool is a web application built using the [Plotly Dash](https://dash.plotly.com/) framework. The tool allows users to view the extracted data, make corrections, and submit the data to be entered into Neotoma.
43+
44+
![](assets/data-review-tool.png)
545
## Contributors
646

747
This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a [code of conduct](https://github.com/NeotomaDB/MetaExtractor/blob/main/CODE_OF_CONDUCT.md). Please review and follow this code of conduct as part of your contribution.
@@ -24,30 +64,31 @@ All products of the Neotoma Paleoecology Database are licensed under an [MIT Lic
2464

2565
## How to use this repository
2666

27-
TBD
67+
WIP
2868

2969

3070
### Development Workflow Overview
3171

32-
TBD
72+
WIP
3373

3474
### Analysis Workflow Overview
3575

36-
TBD
76+
WIP
3777

3878
### System Requirements
3979

40-
TBD
80+
WIP
4181

4282
### Data Requirements
4383

44-
TBD
84+
WIP
4585

4686
### Directory Structure and Description
4787

4888
```
4989
├── .github/ <- Directory for GitHub files
5090
│ ├── workflows/ <- Directory for workflows
91+
├── assets/ <- Directory for assets
5192
├── data/ <- Directory for data
5293
│ ├── entity-extraction/ <- Directory for named entity extraction data
5394
│ │ ├── raw/ <- Raw unprocessed data
@@ -65,6 +106,9 @@ TBD
65106
│ ├── article-relevance/ <- Directory for results related to article relevance prediction
66107
│ ├── ner/ <- Directory for results related to named entity recognition
67108
│ └── data-review-tool/ <- Directory for results related to data review tool
109+
├── models/ <- Directory for models
110+
│ ├── entity-extraction/ <- Directory for named entity recognition models
111+
│ ├── article-relevance/ <- Directory for article relevance prediction models
68112
├── notebooks/ <- Directory for notebooks
69113
├── src/ <- Directory for source code
70114
│ ├── entity_extraction/ <- Directory for named entity recognition code
@@ -74,4 +118,15 @@ TBD
74118
├── tests/ <- Directory for tests
75119
├── Makefile <- Makefile with commands to perform analysis
76120
└── README.md <- The top-level README for developers using this project.
77-
```
121+
```
122+
123+
[contributors-shield]: https://img.shields.io/github/contributors/NeotomaDB/MetaExtractor.svg?style=for-the-badge
124+
[contributors-url]: https://github.com/NeotomaDB/MetaExtractor/graphs/contributors
125+
[forks-shield]: https://img.shields.io/github/forks/NeotomaDB/MetaExtractor.svg?style=for-the-badge
126+
[forks-url]: https://github.com/NeotomaDB/MetaExtractor/network/members
127+
[stars-shield]: https://img.shields.io/github/stars/NeotomaDB/MetaExtractor.svg?style=for-the-badge
128+
[stars-url]: https://github.com/NeotomaDB/MetaExtractor/stargazers
129+
[issues-shield]: https://img.shields.io/github/issues/NeotomaDB/MetaExtractor.svg?style=for-the-badge
130+
[issues-url]: https://github.com/NeotomaDB/MetaExtractor/issues
131+
[license-shield]: https://img.shields.io/github/license/NeotomaDB/MetaExtractor.svg?style=for-the-badge
132+
[license-url]: https://github.com/NeotomaDB/MetaExtractor/blob/master/LICENSE.txt

assets/data-review-tool.png

332 KB
Loading

assets/ffossils-logo-text.png

137 KB
Loading
118 KB
Loading

assets/project-flow-diagram.png

63.1 KB
Loading

models/.gitkeep

Whitespace-only changes.

models/ner/.gitkeep

Whitespace-only changes.

0 commit comments

Comments
 (0)