Skip to content

Commit b76378d

Browse files
committed
enhancement: train spacy model from scratch
1 parent d761fcc commit b76378d

File tree

2 files changed

+15
-34
lines changed

2 files changed

+15
-34
lines changed

src/entity_extraction/training/spacy_ner/README.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,10 @@ This folder contains the training and evaluation scripts for the SpaCy Transform
1212
## Training Workflow
1313

1414
A bash script is used to initialize a training job. Model training is fully customizable and users are encouraged to update the parameters in the `run_spacy_training.sh` and `spacy_transfomer_train.cfg` files prior to training. The training workflow is as follows:
15-
1. Create a new data directory and dump all the TXT files (contains annotations in the JSONLines format) from Label Studio.
15+
1. Create a new data directory and dump all the JSON files containing annotations from Label Studio and any reviewed parquet files.
1616
2. Most parameters can be used with the default value, open the `run_spacy_training.sh` bash script and update the following fields with absolute paths or relative paths from the root of the repository:
1717
- `DATA_PATH`: path to directory with Label Studio labelled data
1818
- `DATA_OUTPUT_PATH`: path to directory to store the split dataset (train/val/test) as well as other data artifacts required for training.
19-
- `MODEL_PATH`: If retraining, specify path to model artifacts. If training a model from scratch, pass empty string `""`
2019
- `MODEL_OUTPUT_PATH`: path to store new model artifacts
2120
- `VERSION`: Version can be updated to keep track of different training runs.
2221
- `--gpu-id`: While executing the `spacy train` command, GPU can be used, if available, by setting this flag to **0**.

src/entity_extraction/training/spacy_ner/run_spacy_training.sh

Lines changed: 14 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@ echo "Current working directory: $(pwd)"
99

1010
DATA_PATH="/path/to/sample input folder"
1111
DATA_OUTPUT_PATH="/path/to/sample output folder"
12-
MODEL_PATH="/path/to/model artifacts"
1312
MODEL_OUTPUT_PATH="/path/to/new model artifacts"
1413
VERSION="v1"
1514
TRAIN_SPLIT=0.7
@@ -28,34 +27,17 @@ python3 src/preprocessing/labelling_data_split.py \
2827

2928
python3 src/preprocessing/spacy_preprocess.py --data_path $DATA_OUTPUT_PATH
3029

31-
if [ -z "$MODEL_PATH" ]; then
32-
# If the model path is null, then start training from scratch
33-
34-
# Fill configuration with required fields
35-
python -m spacy init fill-config \
36-
src/entity_extraction/training/spacy_ner/spacy_transformer_train.cfg \
37-
src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg
38-
39-
# Execute the training job by pointing to the new config file
40-
python -m spacy train \
41-
src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg \
42-
--paths.train $DATA_OUTPUT_PATH/train.spacy \
43-
--paths.dev $DATA_OUTPUT_PATH/val.spacy \
44-
--output $MODEL_OUTPUT_PATH \
45-
--gpu-id -1
46-
47-
else
48-
# Else create a new config file to resume training
49-
python src/entity_extraction/training/spacy_ner/create_config.py \
50-
--model_path $MODEL_PATH \
51-
--output_path src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg
52-
53-
python -m spacy train \
54-
src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg \
55-
--paths.train $DATA_OUTPUT_PATH/train.spacy \
56-
--paths.dev $DATA_OUTPUT_PATH/val.spacy \
57-
--components.ner.source $MODEL_PATH \
58-
--components.transformer.source $MODEL_PATH \
59-
--output $MODEL_OUTPUT_PATH \
60-
--gpu-id -1
61-
fi
30+
# Start training from scratch
31+
32+
# Fill configuration with required fields
33+
python -m spacy init fill-config \
34+
src/entity_extraction/training/spacy_ner/spacy_transformer_train.cfg \
35+
src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg
36+
37+
# Execute spacy CLI training
38+
python -m spacy train \
39+
src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg \
40+
--paths.train $DATA_OUTPUT_PATH/train.spacy \
41+
--paths.dev $DATA_OUTPUT_PATH/val.spacy \
42+
--output $MODEL_OUTPUT_PATH \
43+
--gpu-id -1

0 commit comments

Comments
 (0)