Skip to content

Commit eb1193d

Browse files
authored
Merge branch 'dev' into data-review-tool-jenit
2 parents f8437ee + 96a45e4 commit eb1193d

File tree

8 files changed

+133
-57
lines changed

8 files changed

+133
-57
lines changed

.github/workflows/pull-request-testing.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,4 +35,10 @@ jobs:
3535
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
3636
- name: Test with pytest
3737
run: |
38-
pytest --cov=src -v
38+
pytest --cov=src --cov-report=xml
39+
- name: Upload coverage reports to Codecov
40+
uses: codecov/codecov-action@v3
41+
env:
42+
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
43+
with:
44+
files: ./coverage.xml # coverage report

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
[![Stargazers][stars-shield]][stars-url]
44
[![Issues][issues-shield]][issues-url]
55
[![MIT License][license-shield]][license-url]
6+
[![codecov][codecov-shield]][codecov-url]
67

78
# **MetaExtractor: Finding Fossils in the Literature**
89

@@ -192,3 +193,5 @@ All products of the Neotoma Paleoecology Database are licensed under an [MIT Lic
192193
[issues-url]: https://github.com/NeotomaDB/MetaExtractor/issues
193194
[license-shield]: https://img.shields.io/github/license/NeotomaDB/MetaExtractor.svg?style=for-the-badge
194195
[license-url]: https://github.com/NeotomaDB/MetaExtractor/blob/master/LICENSE.txt
196+
[codecov-shield]: https://img.shields.io/codecov/c/github/NeotomaDB/MetaExtractor?style=for-the-badge
197+
[codecov-url]: https://codecov.io/gh/NeotomaDB/MetaExtractor

docker-compose.yml

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,10 @@ services:
1313
volumes:
1414
- ./data/data-review-tool:/MetaExtractor/inputs:rw
1515
entity-extraction-pipeline:
16-
image: metaextractor-entity-extraction-pipeline:v0.0.2
17-
build:
16+
image: metaextractor-entity-extraction-pipeline:v0.0.3
17+
build:
1818
dockerfile: ./docker/entity-extraction-pipeline/Dockerfile
1919
context: .
20-
args:
21-
HF_NER_MODEL_NAME: "roberta-finetuned-v3"
22-
SPACY_NER_MODEL_NAME: "spacy-transformer-v3"
2320
ports:
2421
- "5000:5000"
2522
volumes:

docker/data-review-tool/README.md

Lines changed: 23 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,22 @@
22

33
This docker image contains `Finding Fossils`, a data review tool built using Dash, Python. It is used to visualize the outputs of the models and verify the extracted entities for inclusion in the Neotoma Database.
44

5-
## Docker Compose Setup
5+
The expected inputs are mounted onto the newly created container as volumes and can be dumped in any folder. An environment variable is setup to provide the path to this folder. It assumes the following:
6+
1. A parquet file containing the outputs from the article relevance prediction component.
7+
2. A zipped file containing the outputs from the named entity extraction component.
8+
3. Once the articles have been verified we update the same parquet file referenced using the environment variable `ARTICLE_RELEVANCE_BATCH` with the entities verified by the steward and the status of review for the article.
69

7-
We first build the docker image to install the required dependencies that can be run using `docker-compose` as follows:
8-
```bash
9-
docker-compose build
10-
docker-compose up data-review-tool
11-
```
10+
## Additional Options Enabled by Environment Variables
11+
12+
The following environment variables can be set to change the behavior of the pipeline:
13+
- `ARTICLE_RELEVANCE_BATCH`: This variable gives the name of the article relevance output parquet file.
14+
- `ENTITY_EXTRACTION_BATCH`: This variable gives the name of the entity extraction compressed output file.
15+
16+
## Sample Docker Compose Setup
17+
18+
Update the environment variables and the volume paths defined under the `data-review-tool` service in the `docker-compose.yml` file under the root directory. The volume paths are:
1219

13-
This is the basic docker compose configuration for running the image.
20+
`INPUT_PATH`: The path to the directory where the data is dumped. eg. `./data/data-review-tool` (recommended)
1421

1522
```yaml
1623
version: "3.9"
@@ -21,13 +28,13 @@ services:
2128
ports:
2229
- "8050:8050"
2330
volumes:
24-
- ./data/data-review-tool:/MetaExtractor/data/data-review-tool
31+
- {INPUT_PATH}:/MetaExtractor/inputs
32+
environment:
33+
- ARTICLE_RELEVANCE_BATCH=sample_parquet_output.parquet
34+
- ENTITY_EXTRACTION_BATCH=sample_ner_output.zip
35+
```
36+
Then build and run the docker image to install the required dependencies using `docker-compose` as follows:
37+
```bash
38+
docker-compose build
39+
docker-compose up data-review-tool
2540
```
26-
27-
### Input
28-
The expected inputs are mounted onto the newly created container as volumes and can be dumped in the `data/data-review-tool` folder. The artifacts required by the data review tool to verify a batch of processed articles are:
29-
- A parquet file containing the outputs from the article relevance prediction component.
30-
- A zipped file containing the outputs from the named entity extraction component.
31-
32-
### Output
33-
Once the articles have been verified and the container has been destroyed, we update the same parquet file referenced in the `Input` with the extracted (predicted by the model) and verified (correct by data steward) entities.

docker/entity-extraction-pipeline/Dockerfile

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,25 +10,23 @@ COPY docker/entity-extraction-pipeline/requirements.txt .
1010
# Install the required Python packages
1111
RUN pip install --no-cache-dir -r requirements.txt
1212
RUN python -m nltk.downloader stopwords
13+
RUN pip install https://huggingface.co/finding-fossils/metaextractor-spacy/resolve/main/en_metaextractor_spacy-any-py3-none-any.whl
14+
# install git-lfs to be able to clone model weights from huggingface
15+
RUN apt-get update && apt-get install -y git-lfs
16+
# download the HF model into /app/models/ner/metaextractor
17+
RUN mkdir -p ./models/ner/ \
18+
&& cd ./models/ner/ \
19+
&& git lfs install \
20+
&& git clone https://huggingface.co/finding-fossils/metaextractor
1321

1422
# Copy the entire repository folder into the container
1523
COPY src ./src
1624

17-
# Build args
18-
ARG HF_NER_MODEL_NAME
19-
ARG SPACY_NER_MODEL_NAME
20-
21-
# Set env variables for when running the container
22-
ENV HF_NER_MODEL_NAME=${HF_NER_MODEL_NAME}
23-
ENV SPACY_NER_MODEL_NAME=${SPACY_NER_MODEL_NAME}
25+
# Set default env variables for when running the container
2426
ENV USE_NER_MODEL_TYPE=huggingface
2527
ENV MAX_ARTICLES=-1
2628
ENV MAX_SENTENCES=-1
2729

28-
# Copy in the model defined by the env variable NER_MODEL_NAME from models folder
29-
COPY models/ner/${HF_NER_MODEL_NAME} ./models/ner/${HF_NER_MODEL_NAME}
30-
COPY models/ner/${SPACY_NER_MODEL_NAME} ./models/ner/${SPACY_NER_MODEL_NAME}
31-
3230
# non-root user control inspired from here: https://stackoverflow.com/questions/66349101/docker-non-root-user-does-not-have-writing-permissions-when-using-volumes
3331
# Create a non-root user that owns the input/outputs directory by default
3432
RUN useradd -r extraction-user # no specific user ID

docker/entity-extraction-pipeline/README.md

Lines changed: 51 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,41 +6,81 @@ This docker image contains the models and code required to run entity extraction
66
2. The raw input data is mounted as a volume to the docker folder `/app/inputs/`
77
3. The expected output location is mounted as a volume to the docker folder `/app/outputs/`
88
4. A single JSON file per article is exported into the output folder along with a `.log` file for the processing run.
9-
5. An environment variable `LOG_OUTPUT_DIR` is set to the path of the output folder. This is used to write the log file. Default is the directory from which the docker container is run.
109

1110
## Additional Options Enabled by Environment Variables
1211

1312
The following environment variables can be set to change the behavior of the pipeline:
1413
- `USE_NER_MODEL_TYPE`: This variable can be set to `spacy` or `huggingface` to change the NER model used. The default is `huggingface`. This will be used to run batches with each model to evaluate final performance.
14+
- `HF_NER_MODEL_NAME`: The name of the `huggingface-hub` repository hosting the huggingface model artifacts.
15+
- `SPACY_NER_MODEL_NAME`: The name of the `huggingface-hub` repository hosting the spacy model artifacts.
1516
- `MAX_SENTENCES`: This variable can be set to a number to limit the number of sentences processed per article. This is useful for testing and debugging. The default is `-1` which means no limit.
1617
- `MAX_ARTICLES`: This variable can be set to a number to limit the number of articles processed. This is useful for testing and debugging. The default is `-1` which means no limit.
18+
- `LOG_OUTPUT_DIR`: This variable is set to the path of the output folder to write the log file. Default is the directory from which the docker container is run.
1719

18-
## Sample Docker Run & Compose Setup
20+
## Testing the Docker Image to Run on xDD
21+
22+
The docker image must be able to be run without root permissions. To test that this is correctly setup, run the following command and ensure it completes without error.
23+
24+
```bash
25+
docker run -u $(id -u) -p 5000:5000 -v /${PWD}/data/entity-extraction/raw/original_files/:/inputs/ -v /${PWD}/data/entity-extraction/processed/processed_articles/:/outputs/ --env LOG_OUTPUT_DIR="../outputs/" metaextractor-entity-extraction-pipeline:v0.0.3
26+
```
27+
28+
**Details**:
29+
- the $(id -u) is used to run the docker container as the current user so that the output files are not owned by root
30+
- the LOG_OUTPUT_DIR="../outputs/" is different from the docker compose as it is relative to the current directory which from Docker run starts in app folder
31+
- for git bash on windows the /${PWD} is used to get the current directory and the forward slash is important to get the correct path
32+
33+
## Sample Docker Compose Setup
34+
35+
Update the environment variables defined under the `entity-extraction-pipeline` service in the `docker-compose.yml` file under the root directory. The volume paths are:
36+
- `INPUT_FOLDER`: The folder containing the raw text `nlp352` TSV file, eg. `./data/entity-extraction/raw/original_files/` (recommended)
37+
- `OUTPUT_FOLDER`: The folder to dump the final JSON files, eg. `./data/entity-extraction/processed/processed_articles/` (recommended)
38+
39+
Then build and run the docker image to install the required dependencies using `docker-compose` as follows:
1940

20-
Below is a sample docker run command for running the image:
21-
- the `$(id -u)` is used to run the docker container as the current user so that the output files are not owned by root
22-
- the `LOG_OUTPUT_DIR="../outputs/"` is different from the docker compose as it is relative to the current directory which from Docker run starts in `app` folder
23-
- for git bash on windows the `/${PWD}` is used to get the current directory and the forward slash is important to get the correct path
2441
```bash
25-
docker run -u $(id -u) -p 5000:5000 -v /${PWD}/data/entity-extraction/raw/original_files/:/inputs/ -v /${PWD}/data/entity-extraction/processed/processed_articles/:/outputs/ --env LOG_OUTPUT_DIR="../outputs/" metaextractor-entity-extraction-pipeline:v0.0.2
42+
docker-compose build
43+
docker-compose up entity-extraction-pipeline
2644
```
2745

2846
Below is a sample docker compose configuration for running the image:
2947
```yaml
3048
version: "0.0.1"
3149
services:
3250
entity-extraction-pipeline:
33-
image: metaextractor-entity-extraction-pipeline:v0.0.1
51+
image: metaextractor-entity-extraction-pipeline:v0.0.3
3452
build:
3553
...
3654
ports:
3755
- "5000:5000"
3856
volumes:
39-
- ./data/raw/:/app/inputs/
40-
- ./data/processed/:/app/outputs/
57+
- ./data/entity-extraction/raw/<INPUT_FOLDER>:/inputs/
58+
- ./data/entity-extraction/processed/<OUTPUT_FOLDER>:/outputs/
4159
environment:
4260
- USE_NER_MODEL_TYPE=huggingface
43-
- LOG_OUTPUT_DIR=/app/outputs/
61+
- LOG_OUTPUT_DIR=/outputs/
4462
- MAX_SENTENCES=20
4563
- MAX_ARTICLES=1
64+
```
65+
## Pushing the Docker Image to Docker Hub
66+
67+
To push the docker image to docker hub, first login to docker hub using the following command:
68+
69+
```bash
70+
docker login
71+
```
72+
73+
Then tag the docker image with the following two commands:
74+
75+
```bash
76+
# to update the "latest" tag image
77+
docker tag metaextractor-entity-extraction-pipeline:v<VERSION NUMBER> <DOCKER HUB USER ID>/metaextractor-entity-extraction-pipeline
78+
# to upload a specific version tagged image
79+
docker tag metaextractor-entity-extraction-pipeline:v<VERSION NUMBER> <DOCKER HUB USER ID>/metaextractor-entity-extraction-pipeline:v<VERSION NUMBER>
80+
```
81+
82+
Finally, push the docker image to docker hub using the following command:
83+
84+
```bash
85+
docker push <DOCKER HUB USER ID>/metaextractor-entity-extraction-pipeline
4686
```

src/entity_extraction/spacy_entity_extraction.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,13 @@ def spacy_extract_all(
3232
"""
3333

3434
if ner_model == None:
35-
logger.info("Empty model passed, return 0 labels.")
36-
return []
35+
try:
36+
import en_metaextractor_spacy
37+
ner_model = en_metaextractor_spacy.load()
38+
except:
39+
logger.error(f"Spacy model en_metaextractor_spacy not found.")
40+
logger.info("Empty model passed, return 0 labels.")
41+
return []
3742

3843
entities = []
3944
doc = ner_model(text)

src/pipeline/entity_extraction_pipeline.py

Lines changed: 31 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Author: Ty Andrews
22
# Date: 2023-06-05
33
"""
4-
Usage: entity_extraction.py --article_text_path=<article_text_path> --output_path=<output_path> [--max_sentences=<max_sentences>] [--max_articles=<max_articles>]
4+
Usage: entity_extraction_pipeline.py --article_text_path=<article_text_path> --output_path=<output_path> [--max_sentences=<max_sentences>] [--max_articles=<max_articles>]
55
66
Options:
77
--article_text_path=<article_text_path> The path to the article text data file.
@@ -34,9 +34,11 @@
3434
load_dotenv(find_dotenv())
3535

3636
# get the MODEL_NAME from environment variables
37-
HF_NER_MODEL_NAME = os.getenv("HF_NER_MODEL_NAME", "roberta-finetuned-v3")
38-
SPACY_NER_MODEL_NAME = os.getenv("SPACY_NER_MODEL_NAME", "spacy-transformer-v3")
37+
HF_NER_MODEL_PATH = os.getenv("HF_NER_MODEL_PATH", "./models/ner/metaextractor")
38+
SPACY_NER_MODEL_NAME = os.getenv("SPACY_NER_MODEL_NAME", "en_metaextractor_spacy")
3939
USE_NER_MODEL_TYPE = os.getenv("USE_NER_MODEL_TYPE", "huggingface")
40+
MAX_SENTENCES = os.getenv("MAX_SENTENCES", "-1")
41+
MAX_ARTICLES = os.getenv("MAX_ARTICLES", "-1")
4042

4143
logger = get_logger(__name__)
4244

@@ -286,7 +288,7 @@ def recreate_original_sentences_with_labels(row):
286288
def extract_entities(
287289
article_text_data: pd.DataFrame,
288290
model_type: str = "huggingface",
289-
model_path: str = os.path.join("models", "ner", "roberta-finetuned-v3"),
291+
model_path: str = "metaextractor",
290292
) -> pd.DataFrame:
291293
"""
292294
Extracts the entities from the article text data.
@@ -553,31 +555,42 @@ def main():
553555

554556
article_text_data = load_article_text_data(file_path)
555557

556-
if opt["--max_articles"] is not None and int(opt["--max_articles"]) != -1:
558+
if MAX_ARTICLES is not None and int(MAX_ARTICLES) != -1:
557559
article_text_data = article_text_data[
558560
# 7 index used for testing with entities in first couple sentences of article 7
559561
article_text_data["gddid"].isin(
560562
article_text_data["gddid"].unique()[
561-
0 : 0 + int(opt["--max_articles"])
563+
0 : 0 + int(MAX_ARTICLES)
562564
]
563565
)
564566
]
567+
logger.info(
568+
f"Using just a subsample of the data of with {int(MAX_ARTICLES)} articles"
569+
)
565570

566571
# if max_sentences is not -1 then only use the first max_sentences sentences
567-
if opt["--max_sentences"] is not None and int(opt["--max_sentences"]) != -1:
568-
article_text_data = article_text_data.head(int(opt["--max_sentences"]))
572+
if MAX_SENTENCES is not None and int(MAX_SENTENCES) != -1:
573+
# get just sentence id's for each gdd up to max_sentences
574+
article_text_data = article_text_data[
575+
article_text_data["sentid"].isin(
576+
article_text_data["sentid"].unique()[0 : int(MAX_SENTENCES)]
577+
)
578+
]
579+
logger.info(
580+
f"Using just a subsample of the data of with {int(MAX_SENTENCES)} sentences"
581+
)
569582

570583
for article_gdd in article_text_data["gddid"].unique():
571584
logger.info(f"Processing GDD ID: {article_gdd}")
572585

573586
article_text = article_text_data[article_text_data["gddid"] == article_gdd]
574587

575588
if USE_NER_MODEL_TYPE == "huggingface":
576-
logger.info(f"Using HuggingFace model {HF_NER_MODEL_NAME}")
577-
model_path = os.path.join("models", "ner", HF_NER_MODEL_NAME)
589+
logger.info(f"Using HuggingFace model {HF_NER_MODEL_PATH}")
590+
model_path = HF_NER_MODEL_PATH
578591
elif USE_NER_MODEL_TYPE == "spacy":
579592
logger.info(f"Using Spacy model {SPACY_NER_MODEL_NAME}")
580-
model_path = os.path.join("models", "ner", SPACY_NER_MODEL_NAME)
593+
model_path = SPACY_NER_MODEL_NAME
581594
else:
582595
raise ValueError(
583596
f"Model type {USE_NER_MODEL_TYPE} not supported. Please set MODEL_TYPE to either 'huggingface' or 'spacy'."
@@ -611,6 +624,13 @@ def main():
611624
)
612625
continue
613626

627+
# delete the file if it already exists with the article_gdd name
628+
if os.path.exists(os.path.join(opt["--output_path"], f"{article_gdd}.json")):
629+
os.remove(os.path.join(opt["--output_path"], f"{article_gdd}.json"))
630+
logger.warning(
631+
f"Deleted existing file {article_gdd}.json in output directory."
632+
)
633+
614634
export_extracted_entities(
615635
extracted_entities=pprocessed_entities,
616636
output_path=opt["--output_path"],

0 commit comments

Comments
 (0)