|
| 1 | +# SpaCy Transformer Training & Evaluation |
| 2 | + |
| 3 | +This folder contains the training and evaluation scripts for the SpaCy Transformer based NER model. The scripts are based on the SpaCy CLI scripts and have been modified to work with the Label Studio output format. More information can be found [here](https://spacy.io/usage/training) |
| 4 | + |
| 5 | +**Table of Contents** |
| 6 | +- [SpaCy Transformer Training \& Evaluation](#spacy-transformer-training--evaluation) |
| 7 | + - [Training Workflow](#training-workflow) |
| 8 | + - [Evaluation Workflow](#evaluation-workflow) |
| 9 | + - [Overall Process Diagram](#overall-process-diagram) |
| 10 | + - [How to Run Training on Free Google Colab with GPU](#how-to-run-training-on-free-google-colab-with-gpu) |
| 11 | + |
| 12 | +## Training Workflow |
| 13 | + |
| 14 | +A bash script is used to initialize a training job. Model training is fully customizable and users are encouraged to update the parameters in the `run_spacy_training.sh` and `spacy_transfomer_train.cfg` files prior to training. The training workflow is as follows: |
| 15 | +1. Create a new data directory and dump all the TXT files (contains annotations in the JSONLines format) from Label Studio. |
| 16 | +2. Most parameters can be used with the default value, open the `run_spacy_training.sh` bash script and update the following fields with absolute paths or relative paths from the root of the repository: |
| 17 | + - `DATA_PATH`: path to directory with Label Studio labelled data |
| 18 | + - `DATA_OUTPUT_PATH`: path to directory to store the split dataset (train/val/test) as well as other data artifacts required for training. |
| 19 | + - `MODEL_PATH`: If retraining, specify path to model artifacts. If training a model from scratch, pass empty string `""` |
| 20 | + - `MODEL_OUTPUT_PATH`: path to store new model artifacts |
| 21 | + - `VERSION`: Version can be updated to keep track of different training runs. |
| 22 | + - `--gpu-id`: While executing the `spacy train` command, GPU can be used, if available, by setting this flag to **0**. |
| 23 | +3. Make the training script executable: |
| 24 | +```bash |
| 25 | +chmod +x src/entity_extraction/training/spacy_ner/run_spacy_training.sh |
| 26 | +``` |
| 27 | +4. Execute the training script from the : |
| 28 | +```bash |
| 29 | +./src/entity_extraction/training/spacy_ner/run_spacy_training.sh |
| 30 | +``` |
| 31 | + |
| 32 | +## Evaluation Workflow |
| 33 | + |
| 34 | +To run full evaluation of the trained model to get detailed metrics and plots, follow these steps: |
| 35 | +1. Open the `run_evaluation.sh` bash script and update the following fields: |
| 36 | + 1. `MODEL_NAME` - the name to be assigned and put into the results file names etc. |
| 37 | + 2. `MODEL_PATH` - the location of the trained model files. |
| 38 | + 3. `OUTPUT_DIR` - the location to save the evaluation results. |
| 39 | + 4. `DATA_DIR` - the location of the JSON file containing split train/test/val Label Studio output data. |
| 40 | + 5. `GPU` - whether to use GPU or not. |
| 41 | +2. Make the evaluation script executable: |
| 42 | +```bash |
| 43 | +chmod +x src/entity_extraction/training/spacy_ner/run_evaluation.sh |
| 44 | +``` |
| 45 | +3. Run the evaluation script results will be generated in the `OUTPUT_DIR` folder. **This may take while on CPU and even GPU.** |
| 46 | +```bash |
| 47 | +./src/entity_extraction/training/spacy_ner/run_evaluation.sh |
| 48 | +``` |
| 49 | + |
| 50 | +## Overall Process Diagram |
| 51 | + |
| 52 | +```mermaid |
| 53 | +%%| label: training_pipeline |
| 54 | +%%| fig-cap: "This is how the Entity Extraction model training process runs with intermediate files and processes." |
| 55 | +%%| fig-height: 6 |
| 56 | +%%{init: {'theme':'base','themeVariables': {'fontFamily': 'arial','primaryColor': '#BFDFFF','primaryTextColor': '#000','primaryBorderColor': '#4C75A3','lineColor': '#000','secondaryColor': '#006100','tertiaryColor': '#fff'}, 'flowchart' : {'curve':'monotoneY'}}}%% |
| 57 | +flowchart TD |
| 58 | +F8(Labelled JSON files<br>from LabelStudio) --> C2(spacy_preprocess.py<br>Split into Train/Val/Test<br>Sets by xDD ID) |
| 59 | +C2 --> F1(test set) |
| 60 | +C2 --> F3(val set) |
| 61 | +C2 --> F2(train set) |
| 62 | +F3 --> C4 |
| 63 | +F2 --> C4(run_spacy_training.sh<br>Run Spacy Model Training) |
| 64 | +F1 ----> C3(run_evaluation.sh<br>Run Model Evaluation) |
| 65 | +F3 --> C3 |
| 66 | +C4 --> F7(Log Metrics &\nCheckpoints) |
| 67 | +C4 --> F4(Log Final\nTrained Model) |
| 68 | +F4 --> C3 |
| 69 | +C3 --> F6(Evaluation<br>Plots) |
| 70 | +C3 --> F5(Evaluation results\nJSON) |
| 71 | +subgraph Legend |
| 72 | + computation[Computation] |
| 73 | + fileOutput[File Input/Output] |
| 74 | + computation ~~~ fileOutput |
| 75 | + style computation fill:#BFDFFF, stroke:#4C75A3 |
| 76 | + style fileOutput fill:#d3d3d3, stroke:#808080 |
| 77 | +end |
| 78 | +
|
| 79 | +%% create a class for styling the nodes |
| 80 | +classDef compute_nodes fill:#BFDFFF, stroke:#4C75A3,stroke-width:2px; |
| 81 | +classDef file_nodes fill:#d3d3d3, stroke:#808080,stroke-width:2px; |
| 82 | +
|
| 83 | +class F1,F2,F3,F4,F5,F6,F7,F8 file_nodes; |
| 84 | +class C2,C3,C4,C5 compute_nodes; |
| 85 | +``` |
| 86 | + |
| 87 | +## How to Run Training on Free Google Colab with GPU |
| 88 | + |
| 89 | +This notebook sets up the NER model training on Google Colab with GPU. Use the following steps to create the setup/folder structure and run the training. The free level of Colab does not allow CLI so a notebook is used to start the training. |
| 90 | + |
| 91 | +1. Create a folder in your Google Drive and name it the name of your training run (e.g. `spacy-transformer-v1`) |
| 92 | +2. Upload the entire `src` folder from the repo into the folder you just created |
| 93 | +3. Create a `data` folder inside the folder you just created and upload the `train.spacy` and `val.spacy` files into it |
| 94 | +4. Create a `models` folder, this is where checkpoints will be saved during training |
| 95 | +5. Create an `evaluation-results` folder, this is where the evaluation results will be saved |
| 96 | +6. Create a copy of the `run_spacy_training.sh` and `run_evaluation.sh` files from `src/entity_extraction/training/spacy_ner` and place it in training run folder |
| 97 | +7. Your folder structure should now look like: |
| 98 | + ``` |
| 99 | + spacy-transformer-v1 |
| 100 | + ├── data |
| 101 | + │ ├── train.spacy |
| 102 | + │ └── val.spacy |
| 103 | + ├── models |
| 104 | + ├── evaluation-results |
| 105 | + ├── src |
| 106 | + ├── colab_start_training.ipynb |
| 107 | + ├── run_evaluation.sh |
| 108 | + └── run_spacy_training.sh |
| 109 | + ``` |
| 110 | +8. Open the `run_spacy_training.sh` and `run_evaluation.sh` files and change each of the variables/paths to match your current setup. (Note: Google Colab expects absolute paths in the both the files) |
| 111 | +9. Open the `colab_start_training.ipynb` file and run the cells to start training. |
| 112 | +10. Model files and checkpoints will be saved in the `models` folder and evaluation results will be saved in the `evaluation-results` folder. |
0 commit comments