NeotomaDB
diff --git a/‎src/entity_extraction/training/spacy_ner/README.md‎
Lines changed: 112 additions & 0 deletions b/‎src/entity_extraction/training/spacy_ner/README.md‎
Lines changed: 112 additions & 0 deletions
diff --git a/‎src/entity_extraction/training/spacy_ner/colab_start_training.ipynb‎
Lines changed: 1 addition & 0 deletions b/‎src/entity_extraction/training/spacy_ner/colab_start_training.ipynb‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/entity_extraction/training/spacy_ner/create_config.py‎
Lines changed: 48 additions & 0 deletions b/‎src/entity_extraction/training/spacy_ner/create_config.py‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎src/entity_extraction/training/spacy_ner/run_spacy_training.sh‎
Lines changed: 65 additions & 0 deletions b/‎src/entity_extraction/training/spacy_ner/run_spacy_training.sh‎
Lines changed: 65 additions & 0 deletions
@@ -0,0 +1,112 @@
+# SpaCy Transformer Training & Evaluation
+
+This folder contains the training and evaluation scripts for the SpaCy Transformer based NER model. The scripts are based on the SpaCy CLI scripts and have been modified to work with the Label Studio output format. More information can be found [here](https://spacy.io/usage/training)
+
+**Table of Contents**
+- [SpaCy Transformer Training \& Evaluation](#spacy-transformer-training--evaluation)
+  - [Training Workflow](#training-workflow)
+  - [Evaluation Workflow](#evaluation-workflow)
+  - [Overall Process Diagram](#overall-process-diagram)
+  - [How to Run Training on Free Google Colab with GPU](#how-to-run-training-on-free-google-colab-with-gpu)
+
+## Training Workflow
+
+A bash script is used to initialize a training job. Model training is fully customizable and users are encouraged to update the parameters in the `run_spacy_training.sh` and `spacy_transfomer_train.cfg` files prior to training. The training workflow is as follows:
+1. Create a new data directory and dump all the TXT files (contains annotations in the JSONLines format) from Label Studio.
+2. Most parameters can be used with the default value, open the `run_spacy_training.sh` bash script and update the following fields with absolute paths or relative paths from the root of the repository:
+   - `DATA_PATH`: path to directory with Label Studio labelled data
+   - `DATA_OUTPUT_PATH`: path to directory to store the split dataset (train/val/test) as well as other data artifacts required for training.
+   - `MODEL_PATH`: If retraining, specify path to model artifacts. If training a model from scratch, pass empty string `""`
+   - `MODEL_OUTPUT_PATH`: path to store new model artifacts
+   - `VERSION`: Version can be updated to keep track of different training runs.
+   - `--gpu-id`: While executing the `spacy train` command, GPU can be used, if available, by setting this flag to **0**.
+3. Make the training script executable: 
+```bash
+chmod +x src/entity_extraction/training/spacy_ner/run_spacy_training.sh
+```
+4. Execute the training script from the : 
+```bash
+./src/entity_extraction/training/spacy_ner/run_spacy_training.sh
+```
+
+## Evaluation Workflow
+
+To run full evaluation of the trained model to get detailed metrics and plots, follow these steps:
+1. Open the `run_evaluation.sh` bash script and update the following fields:
+   1. `MODEL_NAME` - the name to be assigned and put into the results file names etc.
+   2. `MODEL_PATH` - the location of the trained model files.
+   3. `OUTPUT_DIR` - the location to save the evaluation results.
+   4. `DATA_DIR` - the location of the JSON file containing split train/test/val Label Studio output data.
+   5. `GPU` - whether to use GPU or not.
+2. Make the evaluation script executable: 
+```bash
+chmod +x src/entity_extraction/training/spacy_ner/run_evaluation.sh
+```
+3. Run the evaluation script results will be generated in the `OUTPUT_DIR` folder. **This may take while on CPU and even GPU.**
+```bash
+./src/entity_extraction/training/spacy_ner/run_evaluation.sh
+```
+
+## Overall Process Diagram
+
+```mermaid
+%%| label: training_pipeline
+%%| fig-cap: "This is how the Entity Extraction model training process runs with intermediate files and processes."
+%%| fig-height: 6
+%%{init: {'theme':'base','themeVariables': {'fontFamily': 'arial','primaryColor': '#BFDFFF','primaryTextColor': '#000','primaryBorderColor': '#4C75A3','lineColor': '#000','secondaryColor': '#006100','tertiaryColor': '#fff'}, 'flowchart' : {'curve':'monotoneY'}}}%%
+flowchart TD
+F8(Labelled JSON files<br>from LabelStudio) --> C2(spacy_preprocess.py<br>Split into Train/Val/Test<br>Sets by xDD ID)
+C2 --> F1(test set)
+C2 --> F3(val set)
+C2 --> F2(train set)
+F3 --> C4
+F2 --> C4(run_spacy_training.sh<br>Run Spacy Model Training)
+F1 ----> C3(run_evaluation.sh<br>Run Model Evaluation)
+F3 --> C3
+C4 --> F7(Log Metrics &\nCheckpoints)
+C4 --> F4(Log Final\nTrained Model)
+F4 --> C3
+C3 --> F6(Evaluation<br>Plots)
+C3 --> F5(Evaluation results\nJSON)
+subgraph Legend
+    computation[Computation] 
+    fileOutput[File Input/Output]
+    computation ~~~ fileOutput
+    style computation fill:#BFDFFF, stroke:#4C75A3
+    style fileOutput fill:#d3d3d3, stroke:#808080
+end
+
+%% create a class for styling the nodes
+classDef compute_nodes fill:#BFDFFF, stroke:#4C75A3,stroke-width:2px;
+classDef file_nodes fill:#d3d3d3, stroke:#808080,stroke-width:2px;
+
+class F1,F2,F3,F4,F5,F6,F7,F8 file_nodes;
+class C2,C3,C4,C5 compute_nodes;
+```
+
+## How to Run Training on Free Google Colab with GPU
+
+This notebook sets up the NER model training on Google Colab with GPU. Use the following steps to create the setup/folder structure and run the training. The free level of Colab does not allow CLI so a notebook is used to start the training.
+
+1. Create a folder in your Google Drive and name it the name of your training run (e.g. `spacy-transformer-v1`)
+2. Upload the entire `src` folder from the repo into the folder you just created
+3. Create a `data` folder inside the folder you just created and upload the `train.spacy` and `val.spacy` files into it
+4. Create a `models` folder, this is where checkpoints will be saved during training
+5. Create an `evaluation-results` folder, this is where the evaluation results will be saved
+6. Create a copy of the `run_spacy_training.sh` and `run_evaluation.sh` files from `src/entity_extraction/training/spacy_ner` and place it in training run folder
+7. Your folder structure should now look like:
+   ```
+   spacy-transformer-v1
+   ├── data
+   │   ├── train.spacy
+   │   └── val.spacy
+   ├── models
+   ├── evaluation-results
+   ├── src
+   ├── colab_start_training.ipynb
+   ├── run_evaluation.sh
+   └── run_spacy_training.sh
+   ```
+8. Open the `run_spacy_training.sh` and `run_evaluation.sh` files and change each of the variables/paths to match your current setup. (Note: Google Colab expects absolute paths in the both the files)
+9. Open the `colab_start_training.ipynb` file and run the cells to start training.
+10. Model files and checkpoints will be saved in the `models` folder and evaluation results will be saved in the `evaluation-results` folder.
@@ -0,0 +1 @@
+{"cells":[{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":14351,"status":"ok","timestamp":1686620427587,"user":{"displayName":"Jenit Jain","userId":"10001188616728493595"},"user_tz":420},"id":"Nx4WDIrW9s6F","outputId":"8adf89a3-e8ec-4a81-9a73-bcc63d61e549"},"outputs":[],"source":["from google.colab import drive\n","drive.mount(\"/content/gdrive\")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"d6JIv2sQ9cwr"},"outputs":[],"source":["!pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu113/torch_stable.html"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":24654,"status":"ok","timestamp":1685657531258,"user":{"displayName":"Jenit Jain","userId":"10001188616728493595"},"user_tz":420},"id":"1v7NLx5E-ufU","outputId":"8a45e9e1-7802-4d53-d75d-2815a84b3e9e"},"outputs":[],"source":["!pip install -U spacy[cuda113,transformers]"]},{"cell_type":"code","execution_count":null,"metadata":{"executionInfo":{"elapsed":206,"status":"ok","timestamp":1686620429740,"user":{"displayName":"Jenit Jain","userId":"10001188616728493595"},"user_tz":420},"id":"K2zJ61Bw-kHX"},"outputs":[],"source":["!export CUDA_PATH=\"/opt/nvidia/cuda\""]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ycAOsKH1_83-"},"outputs":[],"source":["# set to where the fine-tuning job folder is setup\n","import os\n","os.chdir(\"/content/gdrive/MyDrive/path\")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"pcZPh35CnBV3"},"outputs":[],"source":["# Start training \n","!bash ./run_spacy_training.sh"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Start evaluation\n","!bash ./run_evaluation.sh"]}],"metadata":{"accelerator":"GPU","colab":{"authorship_tag":"ABX9TyPKxgdhjIRBs91T1CSAtlX1","gpuType":"T4","provenance":[]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}
@@ -0,0 +1,48 @@
+# Author: Jenit Jain
+# Date: 2023-06-21
+# Inspired from https://github.com/explosion/projects/blob/v3/pipelines/ner_demo_update/scripts/create_config.py
+"""This script create a config file when resuming training of a spacy model from a past checkpoint
+
+Usage: create_config.py --model_path=<model_path> --output_path=<output_path>
+
+Options:
+    --model_path=<model_path>         The path to the model artifacts.
+    --output_path=<output_path>       The path to the output config file.
+"""
+
+from pathlib import Path
+from docopt import docopt
+import spacy
+
+def create_config(model_path: str, output_path: str):
+    """
+    Loads a model's config and updates the source to resume training
+    
+    Parameters
+    ----------
+    model_path: str
+        Path to the model artifacts to resume training
+    output_path: str
+        Output path to store the updated configuration file.
+    """
+    spacy.require_cpu()
+    nlp = spacy.load(opt["--model_path"])
+
+    # create a new config as a copy of the loaded pipeline's config
+    config = nlp.config.copy()
+
+    # source all components from the loaded pipeline and freeze all except the
+    # component to update; replace the listener for the component that is
+    # being updated so that it can be updated independently
+    config["components"]["ner"] = {
+        "source": opt["--model_path"],
+    }
+    config["components"]["transformer"] = {
+        "source": opt["--model_path"],
+    }
+    # save the config
+    config.to_disk(opt["--output_path"])
+    
+if __name__ == "__main__":
+    opt = docopt(__doc__)
+    create_config(opt['--model_path'], opt['--output_path'])
@@ -0,0 +1,65 @@
+#!/usr/bin/env sh 
+
+# if running with conda envs, comment out if not
+conda activate fossil_lit
+
+echo python.__version__ = $(python -c 'import sys; print(sys.version)')
+# ensure we're in the MetaExtractor root directory
+echo "Current working directory: $(pwd)"
+
+DATA_PATH="/path/to/sample input folder"
+DATA_OUTPUT_PATH="/path/to/sample output folder"
+MODEL_PATH="/path/to/model artifacts"
+MODEL_OUTPUT_PATH="/path/to/new model artifacts"
+VERSION="v1"
+TRAIN_SPLIT=0.7
+VAL_SPLIT=0.15
+TEST_SPLIT=0.15
+
+
+rm -f src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg
+
+python3 src/preprocessing/labelling_data_split.py \
+        --raw_label_path $DATA_PATH \
+        --output_path $DATA_OUTPUT_PATH \
+        --train_split $TRAIN_SPLIT \
+        --val_split $VAL_SPLIT \
+        --test_split $TEST_SPLIT
+
+python3 src/preprocessing/spacy_preprocess.py \
+        --data_path $DATA_OUTPUT_PATH \
+        --train_split $TRAIN_SPLIT \
+        --val_split $VAL_SPLIT \
+        --test_split $TEST_SPLIT
+
+if [ -z "$MODEL_PATH" ]; then
+    # If the model path is null, then start training from scratch
+
+    # Fill configuration with required fields
+    python -m spacy init fill-config \
+            src/entity_extraction/training/spacy_ner/spacy_transformer_train.cfg \
+            src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg
+
+    # Execute the training job by pointing to the new config file
+    python -m spacy train \
+        src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg \
+        --paths.train $DATA_OUTPUT_PATH/train.spacy \
+        --paths.dev $DATA_OUTPUT_PATH/val.spacy \
+        --output $MODEL_OUTPUT_PATH \
+        --gpu-id -1
+
+else
+    # Else create a new config file to resume training
+    python src/entity_extraction/training/spacy_ner/create_config.py \
+        --model_path $MODEL_PATH \
+        --output_path src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg
+
+    python -m spacy train \
+        src/entity_extraction/training/spacy_ner/spacy_transformer_$VERSION.cfg \
+        --paths.train $DATA_OUTPUT_PATH/train.spacy \
+        --paths.dev $DATA_OUTPUT_PATH/val.spacy \
+        --components.ner.source $MODEL_PATH \
+        --components.transformer.source $MODEL_PATH \
+        --output $MODEL_OUTPUT_PATH \
+        --gpu-id -1
+fi
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+{"cells":[{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":14351,"status":"ok","timestamp":1686620427587,"user":{"displayName":"Jenit Jain","userId":"10001188616728493595"},"user_tz":420},"id":"Nx4WDIrW9s6F","outputId":"8adf89a3-e8ec-4a81-9a73-bcc63d61e549"},"outputs":[],"source":["from google.colab import drive\n","drive.mount(\"/content/gdrive\")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"d6JIv2sQ9cwr"},"outputs":[],"source":["!pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu113/torch_stable.html"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":24654,"status":"ok","timestamp":1685657531258,"user":{"displayName":"Jenit Jain","userId":"10001188616728493595"},"user_tz":420},"id":"1v7NLx5E-ufU","outputId":"8a45e9e1-7802-4d53-d75d-2815a84b3e9e"},"outputs":[],"source":["!pip install -U spacy[cuda113,transformers]"]},{"cell_type":"code","execution_count":null,"metadata":{"executionInfo":{"elapsed":206,"status":"ok","timestamp":1686620429740,"user":{"displayName":"Jenit Jain","userId":"10001188616728493595"},"user_tz":420},"id":"K2zJ61Bw-kHX"},"outputs":[],"source":["!export CUDA_PATH=\"/opt/nvidia/cuda\""]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ycAOsKH1_83-"},"outputs":[],"source":["# set to where the fine-tuning job folder is setup\n","import os\n","os.chdir(\"/content/gdrive/MyDrive/path\")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"pcZPh35CnBV3"},"outputs":[],"source":["# Start training \n","!bash ./run_spacy_training.sh"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Start evaluation\n","!bash ./run_evaluation.sh"]}],"metadata":{"accelerator":"GPU","colab":{"authorship_tag":"ABX9TyPKxgdhjIRBs91T1CSAtlX1","gpuType":"T4","provenance":[]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}