You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Rule Based Models | This served as the baseline using regex to extract known entities but was not developed further due to the known issues with text quality due to OCR issues and infeasibility for entities like SITE. |
361
361
| <br/> | <br/> |
@@ -370,7 +370,7 @@ The approaches considered along with the rationale for their inclusion/rejection
370
370
For the transformer based models two approaches were used for training, spaCy command line interface (CLI) [@spacy] and HuggingFace's Training application programming interface (API) [@huggingface]. Each have advantages and disadvantages which are outlined in @tbl-spacy-pros-cons.
@@ -386,45 +386,44 @@ For the transformer based models two approaches were used for training, spaCy co
386
386
Using the Hugging Face training API multiple models were trained and evaluated. Each base model along with the hypothesis behind it's selection is outlined in @tbl-hf-model-hypoth.
| RoBERTa-base | One of the typically best performing models for NER [@roberta-ner-wang]|
392
-
| <br/> | <br/> |
392
+
| <br/> | <br/> |
393
393
| RoBERTa-large | A larger model than the base version with potential to learn more complex relationships with the downside of larger compute times. |
394
-
| <br/> | <br/> |
394
+
| <br/> | <br/> |
395
395
| BERT-multilanguage | The known OCR issues and scientific nature of the text may mean the larger vocabulary of this multi-language model may deal with issues better. |
396
-
| <br/> | <br/> |
396
+
| <br/> | <br/> |
397
397
| XLM-RoBERTa-base | Another cross language model (XLM) but using the RoBERTa base architecture and pre-training. |
398
-
| <br/> | <br/> |
398
+
| <br/> | <br/> |
399
399
| Specter2 | This model is BERT based and finetuned on 6M+ scientific articles with it’s own scientific vocabulary making it well suited to analyzing research articles. |
400
400
401
401
: Hugging Face Model Hypotheses {#tbl-hf-model-hypoth tbl-colwidths="[25,75]"}
402
402
403
403
Final hyper parameters used to train the models are outlined in @tbl-hf-train-hyperparams.
| Gradient Accumulation | - Used to mimic larger batch sizes, this value was chosen to achieve batch sizes of ~12k tokens based on best practices [@transformer-train-tips]|
| Gradient Accumulation | - Used to mimic larger batch sizes, this value was set at 4 to achieve batch sizes of ~12k tokens based on best practices [@transformer-train-tips]|
411
+
| <br/> | <br/> |
412
+
| Epochs | - Initial runs with 10-20 epochs, observed evaluation loss minima occurring in first 2-8, settled on 10 |
413
+
| <br/> | <br/> |
414
414
| Learning Rate | - Initially 5e-5 was used and observed rapid over fitting with eval loss reaching a minimum around 2-4 epochs then increasing for the next 5-10 |
415
415
|| - Moved to 2e-5 as well as introducing gradient accumulation of 3 epochs to increase effective batch size |
| Warmup Ratio | - How many steps of training to increase LR from 0 to LR, shown to improve with Adam optimizer - [@borealisai2023tutorial] Set to 10% initially |
|**Optimizer**| - Adam (beta1 = 0.9, beta2=0.999) |
454
+
| Optimizer | - Adam (beta1 = 0.9, beta2=0.999) |
456
455
| <br/> | <br/> |
457
-
|**Early stopping**| - 1600 steps |
456
+
| Early stopping | - 1600 steps |
458
457
459
458
: spaCy CLI Final Hyperparameters {#tbl-spacy-hyperparams tbl-colwidths="[30,70]"}
460
459
@@ -696,7 +695,7 @@ An important observation to make here is that the top models had a lower precisi
696
695
697
696
## Data Review Tool
698
697
699
-
The final data review tool that was created is a multi-page Plotly Dash [@dash] application. The tool can be replicated by launching Docker containers, enabling anyone within the Neotoma community to easily utilize the tool for reviewing outputs from the pipeline.
698
+
The final data review tool that was created is a multi-page Plotly Dash [@dash] application. The tool can be replicated by launching Docker [@docker]containers, enabling anyone within the Neotoma community to easily utilize the tool for reviewing outputs from the pipeline.
@@ -725,13 +724,13 @@ The output of this data review tool is a parquet file that stores the originally
725
724
726
725
## Product Deployment
727
726
728
-
The end goal of this project is to have each data product running unsupervised. The article relevance prediction pipeline was containerized using Docker. It is expected to run on a daily or a weekly basis by Neotoma to run the article relevance prediction and submit relevant articles to xDD to have their full text processed.
727
+
The end goal of this project is to have each data product running unsupervised. The article relevance prediction pipeline was containerized using Docker[@docker]. It is expected to run on a daily or a weekly basis by Neotoma to run the article relevance prediction and submit relevant articles to xDD[@xdd] to have their full text processed.
729
728
730
729
The Article Data Extraction pipeline is containerized using Docker and contains the entity extraction model within it. It will be run on the xDD servers as xDD is not legally allowed to send full text articles off their servers. The container accepts full text articles, extracts the entities, and outputs a single JSON object for each article. The JSON objects are combined with the article relevance prediction results and loaded into the Data Review Tool. @fig-deployment_pipeline depicts the work flow.
731
730
732
731
```{mermaid}
733
732
%%| label: fig-deployment_pipeline
734
-
%%| fig-cap: "This is how the MetaExtractor pipeline flows between the different components."
733
+
%%| fig-cap: "How the MetaExtractor pipeline flows between the different components."
0 commit comments