Skip to content

Commit 8194129

Browse files
committed
final edits on mds report
1 parent 31028f8 commit 8194129

2 files changed

Lines changed: 9 additions & 9 deletions

File tree

-666 KB
Binary file not shown.

reports/final_mds/finding-fossils-final-mds.qmd

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ format:
2020
colorlinks: true
2121
params:
2222
output_file: "reports"
23-
fig-cap-location: top
23+
fig-cap-location: bottom
2424
---
2525

2626
**Executive Summary**
@@ -325,7 +325,7 @@ train_results_df = load_model_evaluation_results(
325325
)
326326
plot_distribution_of_entities(
327327
train_results_df,
328-
"Roberta Finetuned V3",
328+
"Roberta Finetuned V6",
329329
title="Entity Distribution for Train Set",
330330
)
331331
@@ -335,7 +335,7 @@ val_results_df = load_model_evaluation_results(
335335
)
336336
plot_distribution_of_entities(
337337
val_results_df,
338-
"Roberta Finetuned V3",
338+
"Roberta Finetuned V6",
339339
title="Entity Distribution for Validation Set",
340340
)
341341
@@ -344,7 +344,7 @@ test_results_df = load_model_evaluation_results(
344344
results_type="test",
345345
)
346346
plot_distribution_of_entities(
347-
test_results_df, "Roberta Finetuned V3", title="Entity Distribution for Test Set"
347+
test_results_df, "Roberta Finetuned V6", title="Entity Distribution for Test Set"
348348
)
349349
```
350350

@@ -491,7 +491,7 @@ The target for the data extraction is to maximize recall. This was chosen as it
491491

492492
## Data Review Tool
493493

494-
To facilitate the review and improve the efficiency of Neotoma data stewards, an interactive dashboard was developed as the final data product. This dashboard enables manual review of the Article Relevance Prediction and Article Data Extraction results. Users can compare extracted entities to sentences or access the full-text articles to make corrections. They also have the ability to delete incorrect entities and add any missed entities. The Data Review Tool generates a JSON object as output, which can be used to retrain the Article Entity Extraction model and update the Neotoma database. This process promotes enhanced information sharing and improved results, reducing the time required for data stewards to review extracted entities in articles.
494+
To facilitate the review and improve the efficiency of Neotoma data stewards, an interactive dashboard was developed as the final data product. This dashboard enables manual review of the Article Relevance Prediction and Article Data Extraction results. Users can compare extracted entities to sentences or access the full-text articles to make corrections. They also have the ability to delete incorrect entities and add any missed entities. The Data Review Tool generates a parquet object as output, which can be used to retrain the Article Entity Extraction model and update the Neotoma database. This process promotes enhanced information sharing and improved results, reducing the time required for data stewards to review extracted entities in articles.
495495

496496
### Approach
497497

@@ -515,7 +515,7 @@ In order to create an interactive tool that would be appropriate and efficient f
515515
| ----------------------------------------------------- | ------------------------------------------------ |
516516
| Reviewing workflow | Able to save/resume progress. |
517517
| ----------------------------------------------------- | ------------------------------------------------ |
518-
| Output file format | JSON |
518+
| Output file format | Parquet |
519519

520520
: Data Review Tool Target Metrics {#tbl-review_metrics}
521521

@@ -692,7 +692,7 @@ Markdown(render_df.to_markdown())
692692

693693
An important observation to make here is that the top models had a lower precision score for the SITE names and REGION names. The models got confused when deciding whether an entity should be classified as a SITE or a REGION. This was partially due to quality of labeling entities as well as the fact that both these types correspond to the name of a place or a wider area. See @fig-confusion_matrix for a confusion matrix generated using the test set assets highlights the issue.
694694

695-
![Confusion Matrix for RoBERTa Model](../../results/ner/roberta-finetuned-v3/roberta-finetuned-v3_test_confusion_matrix.png){#fig-confusion_matrix}
695+
![Confusion Matrix for RoBERTa Model](../../results/ner/roberta-finetuned-v6/roberta-finetuned-v6_test_confusion_matrix.png){#fig-confusion_matrix}
696696

697697
## Data Review Tool
698698

@@ -718,7 +718,7 @@ The output of this data review tool is a parquet file that stores the originally
718718
| ----------------------------------------------------- | ------------------------------------------------ |
719719
| Reviewing workflow | Able to save/resume progress. |
720720
| ----------------------------------------------------- | ------------------------------------------------ |
721-
| Output file format | JSON |
721+
| Output file format | Parquet |
722722

723723
: Data Review Tool Metric Results {#tbl-review_results}
724724

@@ -727,7 +727,7 @@ The output of this data review tool is a parquet file that stores the originally
727727

728728
The end goal of this project is to have each data product running unsupervised. The article relevance prediction pipeline was containerized using Docker. It is expected to run on a daily or a weekly basis by Neotoma to run the article relevance prediction and submit relevant articles to xDD to have their full text processed.
729729

730-
The Article Data Extraction pipeline was containerized using Docker and contains the entity extraction model within it. It will be run on the xDD servers as xDD is not legally allowed to send full text articles off their servers. The container accepts full text articles, extracts the entities, and outputs a single JSON object for each article. The JSON objects are combined with the article relevance prediction results and loaded into the Data Review Tool. @fig-deployment_pipeline depicts the work flow.
730+
The Article Data Extraction pipeline is containerized using Docker and contains the entity extraction model within it. It will be run on the xDD servers as xDD is not legally allowed to send full text articles off their servers. The container accepts full text articles, extracts the entities, and outputs a single JSON object for each article. The JSON objects are combined with the article relevance prediction results and loaded into the Data Review Tool. @fig-deployment_pipeline depicts the work flow.
731731

732732
```{mermaid}
733733
%%| label: fig-deployment_pipeline

0 commit comments

Comments
 (0)