You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
test_results_df, "Roberta Finetuned V3", title="Entity Distribution for Test Set"
347
+
test_results_df, "Roberta Finetuned V6", title="Entity Distribution for Test Set"
348
348
)
349
349
```
350
350
@@ -491,7 +491,7 @@ The target for the data extraction is to maximize recall. This was chosen as it
491
491
492
492
## Data Review Tool
493
493
494
-
To facilitate the review and improve the efficiency of Neotoma data stewards, an interactive dashboard was developed as the final data product. This dashboard enables manual review of the Article Relevance Prediction and Article Data Extraction results. Users can compare extracted entities to sentences or access the full-text articles to make corrections. They also have the ability to delete incorrect entities and add any missed entities. The Data Review Tool generates a JSON object as output, which can be used to retrain the Article Entity Extraction model and update the Neotoma database. This process promotes enhanced information sharing and improved results, reducing the time required for data stewards to review extracted entities in articles.
494
+
To facilitate the review and improve the efficiency of Neotoma data stewards, an interactive dashboard was developed as the final data product. This dashboard enables manual review of the Article Relevance Prediction and Article Data Extraction results. Users can compare extracted entities to sentences or access the full-text articles to make corrections. They also have the ability to delete incorrect entities and add any missed entities. The Data Review Tool generates a parquet object as output, which can be used to retrain the Article Entity Extraction model and update the Neotoma database. This process promotes enhanced information sharing and improved results, reducing the time required for data stewards to review extracted entities in articles.
495
495
496
496
### Approach
497
497
@@ -515,7 +515,7 @@ In order to create an interactive tool that would be appropriate and efficient f
An important observation to make here is that the top models had a lower precision score for the SITE names and REGION names. The models got confused when deciding whether an entity should be classified as a SITE or a REGION. This was partially due to quality of labeling entities as well as the fact that both these types correspond to the name of a place or a wider area. See @fig-confusion_matrix for a confusion matrix generated using the test set assets highlights the issue.
694
694
695
-
{#fig-confusion_matrix}
695
+
{#fig-confusion_matrix}
696
696
697
697
## Data Review Tool
698
698
@@ -718,7 +718,7 @@ The output of this data review tool is a parquet file that stores the originally
: Data Review Tool Metric Results {#tbl-review_results}
724
724
@@ -727,7 +727,7 @@ The output of this data review tool is a parquet file that stores the originally
727
727
728
728
The end goal of this project is to have each data product running unsupervised. The article relevance prediction pipeline was containerized using Docker. It is expected to run on a daily or a weekly basis by Neotoma to run the article relevance prediction and submit relevant articles to xDD to have their full text processed.
729
729
730
-
The Article Data Extraction pipeline was containerized using Docker and contains the entity extraction model within it. It will be run on the xDD servers as xDD is not legally allowed to send full text articles off their servers. The container accepts full text articles, extracts the entities, and outputs a single JSON object for each article. The JSON objects are combined with the article relevance prediction results and loaded into the Data Review Tool. @fig-deployment_pipeline depicts the work flow.
730
+
The Article Data Extraction pipeline is containerized using Docker and contains the entity extraction model within it. It will be run on the xDD servers as xDD is not legally allowed to send full text articles off their servers. The container accepts full text articles, extracts the entities, and outputs a single JSON object for each article. The JSON objects are combined with the article relevance prediction results and loaded into the Data Review Tool. @fig-deployment_pipeline depicts the work flow.
0 commit comments