Remove args and add docs

brabbit61 · brabbit61 · commit 44fb39222c92 · 2023-06-27T00:02:38.000-07:00
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -14,16 +14,15 @@ services:
     build: 
       dockerfile: ./docker/entity-extraction-pipeline/Dockerfile
       context: .
-      args:
-        MAX_SENTENCES: 20
-        MAX_ARTICLES: 1
     ports:
       - "5000:5000"
     volumes:
-    - ./data/entity-extraction/raw/original_files/:/inputs/
-    - ./data/entity-extraction/processed/processed_articles/:/outputs/
+    - ./data/entity-extraction/raw/original_files/:/app/inputs/
+    - ./data/entity-extraction/processed/processed_articles/:/app/outputs/
     environment:
       - HF_NER_MODEL_NAME=finding-fossils/metaextractor
       - SPACY_NER_MODEL_NAME=en_metaextractor_spacy
       - USE_NER_MODEL_TYPE=huggingface
-      - LOG_OUTPUT_DIR=/outputs/
+      - LOG_OUTPUT_DIR=/app/outputs/
+      - MAX_SENTENCES=20
+      - MAX_ARTICLES=1
diff --git a/docker/data-review-tool/README.md b/docker/data-review-tool/README.md
@@ -2,9 +2,20 @@
 
 This docker image contains `Finding Fossils`, a data review tool built using Dash, Python. It is used to visualize the outputs of the models and verify the extracted entities for inclusion in the Neotoma Database. 
 
-## Docker Compose Setup
+The expected inputs are mounted onto the newly created container as volumes and can be dumped in the `data/data-review-tool` folder. It assumes the following:
+1. A parquet file containing the outputs from the article relevance prediction component.
+2. A zipped file containing the outputs from the named entity extraction component.
+3. Once the articles have been verified we update the same parquet file referenced using the environment variable `ARTICLE_RELEVANCE_BATCH` with the entities verified by the steward and the status of review for the article.
 
-We first build the docker image to install the required dependencies that can be run using `docker-compose` as follows:
+## Additional Options Enabled by Environment Variables
+
+The following environment variables can be set to change the behavior of the pipeline:
+- `ARTICLE_RELEVANCE_BATCH`: This variable gives the name of the article relevance output parquet file.
+- `ENTITY_EXTRACTION_BATCH`: This variable gives the name of the entity extraction compressed output file.
+
+## Sample Docker Compose Setup
+
+Update the environment variables defined under the `data-review-tool` service in the `docker-compose.yml` file under the root directory. Then build and run the docker image to install the required dependencies using `docker-compose` as follows:
 ```bash
 docker-compose build
 docker-compose up data-review-tool
@@ -21,13 +32,5 @@ services:
     ports:
       - "8050:8050"
     volumes:
-    - ./data/data-review-tool:/MetaExtractor/data/data-review-tool
+    - ./data/data-review-tool:/MetaExtractor/inputs
 ```
-
-### Input
-The expected inputs are mounted onto the newly created container as volumes and can be dumped in the `data/data-review-tool` folder. The artifacts required by the data review tool to verify a batch of processed articles are:
-- A parquet file containing the outputs from the article relevance prediction component.
-- A zipped file containing the outputs from the named entity extraction component.
-
-### Output
-Once the articles have been verified and the container has been destroyed, we update the same parquet file referenced in the `Input` with the extracted (predicted by the model) and verified (correct by data steward) entities.
diff --git a/docker/entity-extraction-pipeline/Dockerfile b/docker/entity-extraction-pipeline/Dockerfile
@@ -18,12 +18,12 @@ COPY src ./src
 # non-root user control inspired from here: https://stackoverflow.com/questions/66349101/docker-non-root-user-does-not-have-writing-permissions-when-using-volumes
 # Create a non-root user that owns the input/outputs directory by default
 RUN useradd -r extraction-user          # no specific user ID
-RUN mkdir /inputs && chown extraction-user /inputs
-RUN mkdir /outputs && chown extraction-user /outputs
+RUN mkdir ./inputs && chown extraction-user ./inputs
+RUN mkdir ./outputs && chown extraction-user ./outputs
 # Mount the "inputs" and "outputs" folders as volumes
-VOLUME ["/inputs", "/outputs"]
+VOLUME ["./inputs", "./outputs"]
 
 # Set the entry point and command to run the script
 USER extraction-user  
 RUN ls -alp /app
-ENTRYPOINT python src/pipeline/entity_extraction_pipeline.py --article_text_path /inputs/ --output_path /outputs/
+ENTRYPOINT python src/pipeline/entity_extraction_pipeline.py --article_text_path ./inputs/ --output_path ./outputs/
diff --git a/docker/entity-extraction-pipeline/README.md b/docker/entity-extraction-pipeline/README.md
@@ -6,23 +6,23 @@ This docker image contains the models and code required to run entity extraction
 2. The raw input data is mounted as a volume to the docker folder `/app/inputs/`
 3. The expected output location is mounted as a volume to the docker folder `/app/outputs/`
 4. A single JSON file per article is exported into the output folder along with a `.log` file for the processing run.
-5. An environment variable `LOG_OUTPUT_DIR` is set to the path of the output folder. This is used to write the log file. Default is the directory from which the docker container is run.
 
 ## Additional Options Enabled by Environment Variables
 
 The following environment variables can be set to change the behavior of the pipeline:
 - `USE_NER_MODEL_TYPE`: This variable can be set to `spacy` or `huggingface` to change the NER model used. The default is `huggingface`. This will be used to run batches with each model to evaluate final performance.
+- `HF_NER_MODEL_NAME`: The name of the `huggingface-hub` repository hosting the huggingface model artifacts.
+- `SPACY_NER_MODEL_NAME`: The name of the `huggingface-hub` repository hosting the spacy model artifacts.
 - `MAX_SENTENCES`: This variable can be set to a number to limit the number of sentences processed per article. This is useful for testing and debugging. The default is `-1` which means no limit.
 - `MAX_ARTICLES`: This variable can be set to a number to limit the number of articles processed. This is useful for testing and debugging. The default is `-1` which means no limit.
+- `LOG_OUTPUT_DIR`: This variable is set to the path of the output folder to write the log file. Default is the directory from which the docker container is run.
 
-## Sample Docker Run & Compose Setup
+## Sample Docker Compose Setup
 
-Below is a sample docker run command for running the image:
-- the `$(id -u)` is used to run the docker container as the current user so that the output files are not owned by root
-- the `LOG_OUTPUT_DIR="../outputs/"` is different from the docker compose as it is relative to the current directory which from Docker run starts in `app` folder
-- for git bash on windows the `/${PWD}` is used to get the current directory and the forward slash is important to get the correct path
+Update the environment variables defined under the `entity-extraction-pipeline` service in the `docker-compose.yml` file under the root directory. Then build and run the docker image to install the required dependencies using `docker-compose` as follows:
 ```bash
-docker run -u $(id -u) -p 5000:5000 -v /${PWD}/data/entity-extraction/raw/original_files/:/inputs/ -v /${PWD}/data/entity-extraction/processed/processed_articles/:/outputs/ --env LOG_OUTPUT_DIR="../outputs/" metaextractor-entity-extraction-pipeline:v0.0.2
+docker-compose build
+docker-compose up entity-extraction-pipeline
 ```
 
 Below is a sample docker compose configuration for running the image:
@@ -39,6 +39,8 @@ services:
     - ./data/raw/:/app/inputs/
     - ./data/processed/:/app/outputs/
     environment:
+      - HF_NER_MODEL_NAME=finding-fossils/metaextractor
+      - SPACY_NER_MODEL_NAME=en_metaextractor_spacy
       - USE_NER_MODEL_TYPE=huggingface
       - LOG_OUTPUT_DIR=/app/outputs/
       - MAX_SENTENCES=20