You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docker/data-review-tool/README.md
+14-11Lines changed: 14 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,9 +2,20 @@
2
2
3
3
This docker image contains `Finding Fossils`, a data review tool built using Dash, Python. It is used to visualize the outputs of the models and verify the extracted entities for inclusion in the Neotoma Database.
4
4
5
-
## Docker Compose Setup
5
+
The expected inputs are mounted onto the newly created container as volumes and can be dumped in the `data/data-review-tool` folder. It assumes the following:
6
+
1. A parquet file containing the outputs from the article relevance prediction component.
7
+
2. A zipped file containing the outputs from the named entity extraction component.
8
+
3. Once the articles have been verified we update the same parquet file referenced using the environment variable `ARTICLE_RELEVANCE_BATCH` with the entities verified by the steward and the status of review for the article.
6
9
7
-
We first build the docker image to install the required dependencies that can be run using `docker-compose` as follows:
10
+
## Additional Options Enabled by Environment Variables
11
+
12
+
The following environment variables can be set to change the behavior of the pipeline:
13
+
-`ARTICLE_RELEVANCE_BATCH`: This variable gives the name of the article relevance output parquet file.
14
+
-`ENTITY_EXTRACTION_BATCH`: This variable gives the name of the entity extraction compressed output file.
15
+
16
+
## Sample Docker Compose Setup
17
+
18
+
Update the environment variables defined under the `data-review-tool` service in the `docker-compose.yml` file under the root directory. Then build and run the docker image to install the required dependencies using `docker-compose` as follows:
The expected inputs are mounted onto the newly created container as volumes and can be dumped in the `data/data-review-tool` folder. The artifacts required by the data review tool to verify a batch of processed articles are:
29
-
- A parquet file containing the outputs from the article relevance prediction component.
30
-
- A zipped file containing the outputs from the named entity extraction component.
31
-
32
-
### Output
33
-
Once the articles have been verified and the container has been destroyed, we update the same parquet file referenced in the `Input` with the extracted (predicted by the model) and verified (correct by data steward) entities.
Copy file name to clipboardExpand all lines: docker/entity-extraction-pipeline/Dockerfile
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -18,12 +18,12 @@ COPY src ./src
18
18
# non-root user control inspired from here: https://stackoverflow.com/questions/66349101/docker-non-root-user-does-not-have-writing-permissions-when-using-volumes
19
19
# Create a non-root user that owns the input/outputs directory by default
20
20
RUN useradd -r extraction-user # no specific user ID
21
-
RUN mkdir /inputs && chown extraction-user /inputs
22
-
RUN mkdir /outputs && chown extraction-user /outputs
21
+
RUN mkdir ./inputs && chown extraction-user ./inputs
22
+
RUN mkdir ./outputs && chown extraction-user ./outputs
23
23
# Mount the "inputs" and "outputs" folders as volumes
24
-
VOLUME ["/inputs", "/outputs"]
24
+
VOLUME ["./inputs", "./outputs"]
25
25
26
26
# Set the entry point and command to run the script
Copy file name to clipboardExpand all lines: docker/entity-extraction-pipeline/README.md
+9-7Lines changed: 9 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,23 +6,23 @@ This docker image contains the models and code required to run entity extraction
6
6
2. The raw input data is mounted as a volume to the docker folder `/app/inputs/`
7
7
3. The expected output location is mounted as a volume to the docker folder `/app/outputs/`
8
8
4. A single JSON file per article is exported into the output folder along with a `.log` file for the processing run.
9
-
5. An environment variable `LOG_OUTPUT_DIR` is set to the path of the output folder. This is used to write the log file. Default is the directory from which the docker container is run.
10
9
11
10
## Additional Options Enabled by Environment Variables
12
11
13
12
The following environment variables can be set to change the behavior of the pipeline:
14
13
-`USE_NER_MODEL_TYPE`: This variable can be set to `spacy` or `huggingface` to change the NER model used. The default is `huggingface`. This will be used to run batches with each model to evaluate final performance.
14
+
-`HF_NER_MODEL_NAME`: The name of the `huggingface-hub` repository hosting the huggingface model artifacts.
15
+
-`SPACY_NER_MODEL_NAME`: The name of the `huggingface-hub` repository hosting the spacy model artifacts.
15
16
-`MAX_SENTENCES`: This variable can be set to a number to limit the number of sentences processed per article. This is useful for testing and debugging. The default is `-1` which means no limit.
16
17
-`MAX_ARTICLES`: This variable can be set to a number to limit the number of articles processed. This is useful for testing and debugging. The default is `-1` which means no limit.
18
+
-`LOG_OUTPUT_DIR`: This variable is set to the path of the output folder to write the log file. Default is the directory from which the docker container is run.
17
19
18
-
## Sample Docker Run & Compose Setup
20
+
## Sample Docker Compose Setup
19
21
20
-
Below is a sample docker run command for running the image:
21
-
- the `$(id -u)` is used to run the docker container as the current user so that the output files are not owned by root
22
-
- the `LOG_OUTPUT_DIR="../outputs/"` is different from the docker compose as it is relative to the current directory which from Docker run starts in `app` folder
23
-
- for git bash on windows the `/${PWD}` is used to get the current directory and the forward slash is important to get the correct path
22
+
Update the environment variables defined under the `entity-extraction-pipeline` service in the `docker-compose.yml` file under the root directory. Then build and run the docker image to install the required dependencies using `docker-compose` as follows:
0 commit comments