Skip to content

Commit 44fb392

Browse files
committed
Remove args and add docs
1 parent a18a06c commit 44fb392

4 files changed

Lines changed: 32 additions & 28 deletions

File tree

docker-compose.yml

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,16 +14,15 @@ services:
1414
build:
1515
dockerfile: ./docker/entity-extraction-pipeline/Dockerfile
1616
context: .
17-
args:
18-
MAX_SENTENCES: 20
19-
MAX_ARTICLES: 1
2017
ports:
2118
- "5000:5000"
2219
volumes:
23-
- ./data/entity-extraction/raw/original_files/:/inputs/
24-
- ./data/entity-extraction/processed/processed_articles/:/outputs/
20+
- ./data/entity-extraction/raw/original_files/:/app/inputs/
21+
- ./data/entity-extraction/processed/processed_articles/:/app/outputs/
2522
environment:
2623
- HF_NER_MODEL_NAME=finding-fossils/metaextractor
2724
- SPACY_NER_MODEL_NAME=en_metaextractor_spacy
2825
- USE_NER_MODEL_TYPE=huggingface
29-
- LOG_OUTPUT_DIR=/outputs/
26+
- LOG_OUTPUT_DIR=/app/outputs/
27+
- MAX_SENTENCES=20
28+
- MAX_ARTICLES=1

docker/data-review-tool/README.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,20 @@
22

33
This docker image contains `Finding Fossils`, a data review tool built using Dash, Python. It is used to visualize the outputs of the models and verify the extracted entities for inclusion in the Neotoma Database.
44

5-
## Docker Compose Setup
5+
The expected inputs are mounted onto the newly created container as volumes and can be dumped in the `data/data-review-tool` folder. It assumes the following:
6+
1. A parquet file containing the outputs from the article relevance prediction component.
7+
2. A zipped file containing the outputs from the named entity extraction component.
8+
3. Once the articles have been verified we update the same parquet file referenced using the environment variable `ARTICLE_RELEVANCE_BATCH` with the entities verified by the steward and the status of review for the article.
69

7-
We first build the docker image to install the required dependencies that can be run using `docker-compose` as follows:
10+
## Additional Options Enabled by Environment Variables
11+
12+
The following environment variables can be set to change the behavior of the pipeline:
13+
- `ARTICLE_RELEVANCE_BATCH`: This variable gives the name of the article relevance output parquet file.
14+
- `ENTITY_EXTRACTION_BATCH`: This variable gives the name of the entity extraction compressed output file.
15+
16+
## Sample Docker Compose Setup
17+
18+
Update the environment variables defined under the `data-review-tool` service in the `docker-compose.yml` file under the root directory. Then build and run the docker image to install the required dependencies using `docker-compose` as follows:
819
```bash
920
docker-compose build
1021
docker-compose up data-review-tool
@@ -21,13 +32,5 @@ services:
2132
ports:
2233
- "8050:8050"
2334
volumes:
24-
- ./data/data-review-tool:/MetaExtractor/data/data-review-tool
35+
- ./data/data-review-tool:/MetaExtractor/inputs
2536
```
26-
27-
### Input
28-
The expected inputs are mounted onto the newly created container as volumes and can be dumped in the `data/data-review-tool` folder. The artifacts required by the data review tool to verify a batch of processed articles are:
29-
- A parquet file containing the outputs from the article relevance prediction component.
30-
- A zipped file containing the outputs from the named entity extraction component.
31-
32-
### Output
33-
Once the articles have been verified and the container has been destroyed, we update the same parquet file referenced in the `Input` with the extracted (predicted by the model) and verified (correct by data steward) entities.

docker/entity-extraction-pipeline/Dockerfile

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,12 @@ COPY src ./src
1818
# non-root user control inspired from here: https://stackoverflow.com/questions/66349101/docker-non-root-user-does-not-have-writing-permissions-when-using-volumes
1919
# Create a non-root user that owns the input/outputs directory by default
2020
RUN useradd -r extraction-user # no specific user ID
21-
RUN mkdir /inputs && chown extraction-user /inputs
22-
RUN mkdir /outputs && chown extraction-user /outputs
21+
RUN mkdir ./inputs && chown extraction-user ./inputs
22+
RUN mkdir ./outputs && chown extraction-user ./outputs
2323
# Mount the "inputs" and "outputs" folders as volumes
24-
VOLUME ["/inputs", "/outputs"]
24+
VOLUME ["./inputs", "./outputs"]
2525

2626
# Set the entry point and command to run the script
2727
USER extraction-user
2828
RUN ls -alp /app
29-
ENTRYPOINT python src/pipeline/entity_extraction_pipeline.py --article_text_path /inputs/ --output_path /outputs/
29+
ENTRYPOINT python src/pipeline/entity_extraction_pipeline.py --article_text_path ./inputs/ --output_path ./outputs/

docker/entity-extraction-pipeline/README.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,23 +6,23 @@ This docker image contains the models and code required to run entity extraction
66
2. The raw input data is mounted as a volume to the docker folder `/app/inputs/`
77
3. The expected output location is mounted as a volume to the docker folder `/app/outputs/`
88
4. A single JSON file per article is exported into the output folder along with a `.log` file for the processing run.
9-
5. An environment variable `LOG_OUTPUT_DIR` is set to the path of the output folder. This is used to write the log file. Default is the directory from which the docker container is run.
109

1110
## Additional Options Enabled by Environment Variables
1211

1312
The following environment variables can be set to change the behavior of the pipeline:
1413
- `USE_NER_MODEL_TYPE`: This variable can be set to `spacy` or `huggingface` to change the NER model used. The default is `huggingface`. This will be used to run batches with each model to evaluate final performance.
14+
- `HF_NER_MODEL_NAME`: The name of the `huggingface-hub` repository hosting the huggingface model artifacts.
15+
- `SPACY_NER_MODEL_NAME`: The name of the `huggingface-hub` repository hosting the spacy model artifacts.
1516
- `MAX_SENTENCES`: This variable can be set to a number to limit the number of sentences processed per article. This is useful for testing and debugging. The default is `-1` which means no limit.
1617
- `MAX_ARTICLES`: This variable can be set to a number to limit the number of articles processed. This is useful for testing and debugging. The default is `-1` which means no limit.
18+
- `LOG_OUTPUT_DIR`: This variable is set to the path of the output folder to write the log file. Default is the directory from which the docker container is run.
1719

18-
## Sample Docker Run & Compose Setup
20+
## Sample Docker Compose Setup
1921

20-
Below is a sample docker run command for running the image:
21-
- the `$(id -u)` is used to run the docker container as the current user so that the output files are not owned by root
22-
- the `LOG_OUTPUT_DIR="../outputs/"` is different from the docker compose as it is relative to the current directory which from Docker run starts in `app` folder
23-
- for git bash on windows the `/${PWD}` is used to get the current directory and the forward slash is important to get the correct path
22+
Update the environment variables defined under the `entity-extraction-pipeline` service in the `docker-compose.yml` file under the root directory. Then build and run the docker image to install the required dependencies using `docker-compose` as follows:
2423
```bash
25-
docker run -u $(id -u) -p 5000:5000 -v /${PWD}/data/entity-extraction/raw/original_files/:/inputs/ -v /${PWD}/data/entity-extraction/processed/processed_articles/:/outputs/ --env LOG_OUTPUT_DIR="../outputs/" metaextractor-entity-extraction-pipeline:v0.0.2
24+
docker-compose build
25+
docker-compose up entity-extraction-pipeline
2626
```
2727

2828
Below is a sample docker compose configuration for running the image:
@@ -39,6 +39,8 @@ services:
3939
- ./data/raw/:/app/inputs/
4040
- ./data/processed/:/app/outputs/
4141
environment:
42+
- HF_NER_MODEL_NAME=finding-fossils/metaextractor
43+
- SPACY_NER_MODEL_NAME=en_metaextractor_spacy
4244
- USE_NER_MODEL_TYPE=huggingface
4345
- LOG_OUTPUT_DIR=/app/outputs/
4446
- MAX_SENTENCES=20

0 commit comments

Comments
 (0)