Skip to content

Commit d6cb366

Browse files
authored
Merge pull request #52 from NeotomaDB/14-fine-tune-spacy-ner-model
Retraining script for spacy
2 parents 3accaa9 + e8001df commit d6cb366

10 files changed

Lines changed: 316 additions & 493 deletions
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# SpaCy Transformer Training & Evaluation
2+
3+
This folder contains the training and evaluation scripts for the SpaCy Transformer based NER model. The scripts are based on the SpaCy CLI scripts and have been modified to work with the Label Studio output format. More information can be found [here](https://spacy.io/usage/training)
4+
5+
**Table of Contents**
6+
- [SpaCy Transformer Training \& Evaluation](#spacy-transformer-training--evaluation)
7+
- [Training Workflow](#training-workflow)
8+
- [Evaluation Workflow](#evaluation-workflow)
9+
- [Overall Process Diagram](#overall-process-diagram)
10+
- [How to Run Training on Free Google Colab with GPU](#how-to-run-training-on-free-google-colab-with-gpu)
11+
12+
## Training Workflow
13+
14+
A bash script is used to initialize a training job. Model training is fully customizable and users are encouraged to update the parameters in the `run_spacy_training.sh` and `spacy_transfomer_train.cfg` files prior to training. The training workflow is as follows:
15+
1. Create a new data directory and dump all the TXT files (contains annotations in the JSONLines format) from Label Studio.
16+
2. Most parameters can be used with the default value, open the `run_spacy_training.sh` bash script and update the following fields with absolute paths or relative paths from the root of the repository:
17+
- `DATA_PATH`: path to directory with Label Studio labelled data
18+
- `DATA_OUTPUT_PATH`: path to directory to store the split dataset (train/val/test) as well as other data artifacts required for training.
19+
- `MODEL_PATH`: If retraining, specify path to model artifacts. If training a model from scratch, pass empty string `""`
20+
- `MODEL_OUTPUT_PATH`: path to store new model artifacts
21+
- `VERSION`: Version can be updated to keep track of different training runs.
22+
- `--gpu-id`: While executing the `spacy train` command, GPU can be used, if available, by setting this flag to **0**.
23+
3. Make the training script executable:
24+
```bash
25+
chmod +x src/entity_extraction/training/spacy_ner/run_spacy_training.sh
26+
```
27+
4. Execute the training script from the :
28+
```bash
29+
./src/entity_extraction/training/spacy_ner/run_spacy_training.sh
30+
```
31+
32+
## Evaluation Workflow
33+
34+
To run full evaluation of the trained model to get detailed metrics and plots, follow these steps:
35+
1. Open the `run_evaluation.sh` bash script and update the following fields:
36+
1. `MODEL_NAME` - the name to be assigned and put into the results file names etc.
37+
2. `MODEL_PATH` - the location of the trained model files.
38+
3. `OUTPUT_DIR` - the location to save the evaluation results.
39+
4. `DATA_DIR` - the location of the JSON file containing split train/test/val Label Studio output data.
40+
5. `GPU` - whether to use GPU or not.
41+
2. Make the evaluation script executable:
42+
```bash
43+
chmod +x src/entity_extraction/training/spacy_ner/run_evaluation.sh
44+
```
45+
3. Run the evaluation script results will be generated in the `OUTPUT_DIR` folder. **This may take while on CPU and even GPU.**
46+
```bash
47+
./src/entity_extraction/training/spacy_ner/run_evaluation.sh
48+
```
49+
50+
## Overall Process Diagram
51+
52+
```mermaid
53+
%%| label: training_pipeline
54+
%%| fig-cap: "This is how the Entity Extraction model training process runs with intermediate files and processes."
55+
%%| fig-height: 6
56+
%%{init: {'theme':'base','themeVariables': {'fontFamily': 'arial','primaryColor': '#BFDFFF','primaryTextColor': '#000','primaryBorderColor': '#4C75A3','lineColor': '#000','secondaryColor': '#006100','tertiaryColor': '#fff'}, 'flowchart' : {'curve':'monotoneY'}}}%%
57+
flowchart TD
58+
F8(Labelled JSON files<br>from LabelStudio) --> C2(spacy_preprocess.py<br>Split into Train/Val/Test<br>Sets by xDD ID)
59+
C2 --> F1(test set)
60+
C2 --> F3(val set)
61+
C2 --> F2(train set)
62+
F3 --> C4
63+
F2 --> C4(run_spacy_training.sh<br>Run Spacy Model Training)
64+
F1 ----> C3(run_evaluation.sh<br>Run Model Evaluation)
65+
F3 --> C3
66+
C4 --> F7(Log Metrics &\nCheckpoints)
67+
C4 --> F4(Log Final\nTrained Model)
68+
F4 --> C3
69+
C3 --> F6(Evaluation<br>Plots)
70+
C3 --> F5(Evaluation results\nJSON)
71+
subgraph Legend
72+
computation[Computation]
73+
fileOutput[File Input/Output]
74+
computation ~~~ fileOutput
75+
style computation fill:#BFDFFF, stroke:#4C75A3
76+
style fileOutput fill:#d3d3d3, stroke:#808080
77+
end
78+
79+
%% create a class for styling the nodes
80+
classDef compute_nodes fill:#BFDFFF, stroke:#4C75A3,stroke-width:2px;
81+
classDef file_nodes fill:#d3d3d3, stroke:#808080,stroke-width:2px;
82+
83+
class F1,F2,F3,F4,F5,F6,F7,F8 file_nodes;
84+
class C2,C3,C4,C5 compute_nodes;
85+
```
86+
87+
## How to Run Training on Free Google Colab with GPU
88+
89+
This notebook sets up the NER model training on Google Colab with GPU. Use the following steps to create the setup/folder structure and run the training. The free level of Colab does not allow CLI so a notebook is used to start the training.
90+
91+
1. Create a folder in your Google Drive and name it the name of your training run (e.g. `spacy-transformer-v1`)
92+
2. Upload the entire `src` folder from the repo into the folder you just created
93+
3. Create a `data` folder inside the folder you just created and upload the `train.spacy` and `val.spacy` files into it
94+
4. Create a `models` folder, this is where checkpoints will be saved during training
95+
5. Create an `evaluation-results` folder, this is where the evaluation results will be saved
96+
6. Create a copy of the `run_spacy_training.sh` and `run_evaluation.sh` files from `src/entity_extraction/training/spacy_ner` and place it in training run folder
97+
7. Your folder structure should now look like:
98+
```
99+
spacy-transformer-v1
100+
├── data
101+
│ ├── train.spacy
102+
│ └── val.spacy
103+
├── models
104+
├── evaluation-results
105+
├── src
106+
├── colab_start_training.ipynb
107+
├── run_evaluation.sh
108+
└── run_spacy_training.sh
109+
```
110+
8. Open the `run_spacy_training.sh` and `run_evaluation.sh` files and change each of the variables/paths to match your current setup. (Note: Google Colab expects absolute paths in the both the files)
111+
9. Open the `colab_start_training.ipynb` file and run the cells to start training.
112+
10. Model files and checkpoints will be saved in the `models` folder and evaluation results will be saved in the `evaluation-results` folder.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"cells":[{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":14351,"status":"ok","timestamp":1686620427587,"user":{"displayName":"Jenit Jain","userId":"10001188616728493595"},"user_tz":420},"id":"Nx4WDIrW9s6F","outputId":"8adf89a3-e8ec-4a81-9a73-bcc63d61e549"},"outputs":[],"source":["from google.colab import drive\n","drive.mount(\"/content/gdrive\")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"d6JIv2sQ9cwr"},"outputs":[],"source":["!pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu113/torch_stable.html"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":24654,"status":"ok","timestamp":1685657531258,"user":{"displayName":"Jenit Jain","userId":"10001188616728493595"},"user_tz":420},"id":"1v7NLx5E-ufU","outputId":"8a45e9e1-7802-4d53-d75d-2815a84b3e9e"},"outputs":[],"source":["!pip install -U spacy[cuda113,transformers]"]},{"cell_type":"code","execution_count":null,"metadata":{"executionInfo":{"elapsed":206,"status":"ok","timestamp":1686620429740,"user":{"displayName":"Jenit Jain","userId":"10001188616728493595"},"user_tz":420},"id":"K2zJ61Bw-kHX"},"outputs":[],"source":["!export CUDA_PATH=\"/opt/nvidia/cuda\""]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ycAOsKH1_83-"},"outputs":[],"source":["# set to where the fine-tuning job folder is setup\n","import os\n","os.chdir(\"/content/gdrive/MyDrive/path\")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"pcZPh35CnBV3"},"outputs":[],"source":["# Start training \n","!bash ./run_spacy_training.sh"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Start evaluation\n","!bash ./run_evaluation.sh"]}],"metadata":{"accelerator":"GPU","colab":{"authorship_tag":"ABX9TyPKxgdhjIRBs91T1CSAtlX1","gpuType":"T4","provenance":[]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Author: Jenit Jain
2+
# Date: 2023-06-21
3+
# Inspired from https://github.com/explosion/projects/blob/v3/pipelines/ner_demo_update/scripts/create_config.py
4+
"""This script create a config file when resuming training of a spacy model from a past checkpoint
5+
6+
Usage: create_config.py --model_path=<model_path> --output_path=<output_path>
7+
8+
Options:
9+
--model_path=<model_path> The path to the model artifacts.
10+
--output_path=<output_path> The path to the output config file.
11+
"""
12+
13+
from pathlib import Path
14+
from docopt import docopt
15+
import spacy
16+
17+
def create_config(model_path: str, output_path: str):
18+
"""
19+
Loads a model's config and updates the source to resume training
20+
21+
Parameters
22+
----------
23+
model_path: str
24+
Path to the model artifacts to resume training
25+
output_path: str
26+
Output path to store the updated configuration file.
27+
"""
28+
spacy.require_cpu()
29+
nlp = spacy.load(opt["--model_path"])
30+
31+
# create a new config as a copy of the loaded pipeline's config
32+
config = nlp.config.copy()
33+
34+
# source all components from the loaded pipeline and freeze all except the
35+
# component to update; replace the listener for the component that is
36+
# being updated so that it can be updated independently
37+
config["components"]["ner"] = {
38+
"source": opt["--model_path"],
39+
}
40+
config["components"]["transformer"] = {
41+
"source": opt["--model_path"],
42+
}
43+
# save the config
44+
config.to_disk(opt["--output_path"])
45+
46+
if __name__ == "__main__":
47+
opt = docopt(__doc__)
48+
create_config(opt['--model_path'], opt['--output_path'])
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
#!/usr/bin/env sh
2+
3+
# if running with conda envs, comment out if not
4+
conda activate fossil_lit
5+
6+
echo python.__version__ = $(python -c 'import sys; print(sys.version)')
7+
# ensure we're in the MetaExtractor root directory
8+
echo "Current working directory: $(pwd)"
9+
10+
DATA_PATH="/path/to/sample input folder"
11+
DATA_OUTPUT_PATH="/path/to/sample output folder"
12+
MODEL_PATH="/path/to/model artifacts"
13+
MODEL_OUTPUT_PATH="/path/to/new model artifacts"
14+
VERSION="v1"
15+
TRAIN_SPLIT=0.7
16+
VAL_SPLIT=0.15
17+
TEST_SPLIT=0.15
18+
19+
20+
rm -f src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg
21+
22+
python3 src/preprocessing/labelling_data_split.py \
23+
--raw_label_path $DATA_PATH \
24+
--output_path $DATA_OUTPUT_PATH \
25+
--train_split $TRAIN_SPLIT \
26+
--val_split $VAL_SPLIT \
27+
--test_split $TEST_SPLIT
28+
29+
python3 src/preprocessing/spacy_preprocess.py \
30+
--data_path $DATA_OUTPUT_PATH \
31+
--train_split $TRAIN_SPLIT \
32+
--val_split $VAL_SPLIT \
33+
--test_split $TEST_SPLIT
34+
35+
if [ -z "$MODEL_PATH" ]; then
36+
# If the model path is null, then start training from scratch
37+
38+
# Fill configuration with required fields
39+
python -m spacy init fill-config \
40+
src/entity_extraction/training/spacy_ner/spacy_transformer_train.cfg \
41+
src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg
42+
43+
# Execute the training job by pointing to the new config file
44+
python -m spacy train \
45+
src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg \
46+
--paths.train $DATA_OUTPUT_PATH/train.spacy \
47+
--paths.dev $DATA_OUTPUT_PATH/val.spacy \
48+
--output $MODEL_OUTPUT_PATH \
49+
--gpu-id -1
50+
51+
else
52+
# Else create a new config file to resume training
53+
python src/entity_extraction/training/spacy_ner/create_config.py \
54+
--model_path $MODEL_PATH \
55+
--output_path src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg
56+
57+
python -m spacy train \
58+
src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg \
59+
--paths.train $DATA_OUTPUT_PATH/train.spacy \
60+
--paths.dev $DATA_OUTPUT_PATH/val.spacy \
61+
--components.ner.source $MODEL_PATH \
62+
--components.transformer.source $MODEL_PATH \
63+
--output $MODEL_OUTPUT_PATH \
64+
--gpu-id -1
65+
fi

0 commit comments

Comments
 (0)