This docker image contains the models and code required to run entity extraction from research articles on the xDD system. It assumes the following:
- The raw text is input in the
nlp352TSV format with either a single article per file or multiple articles denoted by GDD ID- like this sample data from xDD Link to Sample Data
- The raw input data is mounted as a volume to the docker folder
/app/inputs/ - The expected output location is mounted as a volume to the docker folder
/app/outputs/ - A single JSON file per article is exported into the output folder along with a
.logfile for the processing run.
The following environment variables can be set to change the behavior of the pipeline:
USE_NER_MODEL_TYPE: This variable can be set tospacyorhuggingfaceto change the NER model used. The default ishuggingface. This will be used to run batches with each model to evaluate final performance.HF_NER_MODEL_NAME: The name of thehuggingface-hubrepository hosting the huggingface model artifacts.SPACY_NER_MODEL_NAME: The name of thehuggingface-hubrepository hosting the spacy model artifacts.MAX_SENTENCES: This variable can be set to a number to limit the number of sentences processed per article. This is useful for testing and debugging. The default is-1which means no limit.MAX_ARTICLES: This variable can be set to a number to limit the number of articles processed. This is useful for testing and debugging. The default is-1which means no limit.LOG_OUTPUT_DIR: This variable is set to the path of the output folder to write the log file. Default is the directory from which the docker container is run.
The docker image must be able to be run without root permissions. To test that this is correctly setup, run the following command and ensure it completes without error.
docker run -u $(id -u) -p 5000:5000 -v /${PWD}/data/entity-extraction/raw/original_files/:/inputs/ -v /${PWD}/data/entity-extraction/processed/processed_articles/:/outputs/ --env LOG_OUTPUT_DIR="../outputs/" metaextractor-entity-extraction-pipeline:v0.0.3Details:
- the $(id -u) is used to run the docker container as the current user so that the output files are not owned by root
- the LOG_OUTPUT_DIR="../outputs/" is different from the docker compose as it is relative to the current directory which from Docker run starts in app folder
- for git bash on windows the /${PWD} is used to get the current directory and the forward slash is important to get the correct path
Update the environment variables defined under the entity-extraction-pipeline service in the docker-compose.yml file under the root directory. The volume paths are:
INPUT_FOLDER: The folder containing the raw textnlp352TSV file, eg../data/entity-extraction/raw/original_files/(recommended)OUTPUT_FOLDER: The folder to dump the final JSON files, eg../data/entity-extraction/processed/processed_articles/(recommended)
Then build and run the docker image to install the required dependencies using docker-compose as follows:
docker-compose build
docker-compose up entity-extraction-pipelineBelow is a sample docker compose configuration for running the image:
version: "0.0.1"
services:
entity-extraction-pipeline:
image: metaextractor-entity-extraction-pipeline:v0.0.3
build:
...
ports:
- "5000:5000"
volumes:
- ./data/entity-extraction/raw/<INPUT_FOLDER>:/inputs/
- ./data/entity-extraction/processed/<OUTPUT_FOLDER>:/outputs/
environment:
- USE_NER_MODEL_TYPE=huggingface
- LOG_OUTPUT_DIR=/outputs/
- MAX_SENTENCES=20
- MAX_ARTICLES=1To push the docker image to docker hub, first login to docker hub using the following command:
docker loginThen tag the docker image with the following two commands:
# to update the "latest" tag image
docker tag metaextractor-entity-extraction-pipeline:v<VERSION NUMBER> <DOCKER HUB USER ID>/metaextractor-entity-extraction-pipeline
# to upload a specific version tagged image
docker tag metaextractor-entity-extraction-pipeline:v<VERSION NUMBER> <DOCKER HUB USER ID>/metaextractor-entity-extraction-pipeline:v<VERSION NUMBER>Finally, push the docker image to docker hub using the following command:
docker push <DOCKER HUB USER ID>/metaextractor-entity-extraction-pipeline