You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/entity_extraction/preprocessing/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -94,7 +94,7 @@ This script takes labelled dataset in JSONLines format as input and splits it in
94
94
The resulting train, validation, and test sets can be used for training and evaluating machine learning models.
95
95
96
96
#### **Options**
97
-
-`--raw_label_path=<raw_label_path>`: Specify the path to the directory where the raw label files are located.
97
+
-`--raw_label_path=<raw_label_path>`: Specify the path to the directory where the raw label files exported from LabelStudio and the parquet files containing the reviewed entities are located.
98
98
99
99
-`--output_path=<output_path>`: Specify the path to the directory where the output files will be written.
100
100
@@ -126,4 +126,4 @@ This script manages the creation of custom data artifacts required for training
126
126
4. Creates the custom data artifacts that can be used for training or fine-tuning spaCy models.
127
127
128
128
#### **Options**
129
-
-`--data_path=<data_path>`: Specify the path to the folder containing files in JSONLines format.
129
+
-`--data_path=<data_path>`: Specify the path to the folder containing JSON files in txt/json format.
Copy file name to clipboardExpand all lines: src/entity_extraction/training/spacy/README.md
+1-2Lines changed: 1 addition & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,11 +12,10 @@ This folder contains the training and evaluation scripts for the SpaCy Transform
12
12
## Training Workflow
13
13
14
14
A bash script is used to initialize a training job. Model training is fully customizable and users are encouraged to update the parameters in the `run_spacy_training.sh` and `spacy_transfomer_train.cfg` files prior to training. The training workflow is as follows:
15
-
1. Create a new data directory and dump all the TXT files (contains annotations in the JSONLines format) from Label Studio.
15
+
1. Create a new data directory and dump all the JSON files containing annotations from Label Studio and any reviewed parquet files.
16
16
2. Most parameters can be used with the default value, open the `run_spacy_training.sh` bash script and update the following fields with absolute paths or relative paths from the root of the repository:
17
17
-`DATA_PATH`: path to directory with Label Studio labelled data
18
18
-`DATA_OUTPUT_PATH`: path to directory to store the split dataset (train/val/test) as well as other data artifacts required for training.
19
-
-`MODEL_PATH`: If retraining, specify path to model artifacts. If training a model from scratch, pass empty string `""`
20
19
-`MODEL_OUTPUT_PATH`: path to store new model artifacts
21
20
-`VERSION`: Version can be updated to keep track of different training runs.
22
21
-`--gpu-id`: While executing the `spacy train` command, GPU can be used, if available, by setting this flag to **0**.
0 commit comments