feat: add max_rel_per_label to support recall for labeled data by guenthermi · Pull Request #826 · docarray/docarray

guenthermi · 2022-11-22T08:07:06Z

Signed-off-by: Michael Guenther [email protected]

Goals:

Support recall@k and F1 measure@k for labeled datasets.
check and update documentation, if required. See guide

For labeled datasets it is not trivial to calculate metrics like recall and F1 measure, which require the knowledge of the number of relevant documents in the document collection for each query since neither the whole set of relevant documents nor the number of documents with a specific label is provided to the evaluate function.

To enable the calculation of recall and F1 measure, this PR

adds a parameter max_rel_per_label: Dict to the evaluate function which provides the number of relevant documents for each label, i.e., the number of documents in the collection with this label.
calculates those label counts for max_rel_per_label in the embed_and_evaluate_function.

Code Example:

# example DocumentArray with matches and labels for evaluation 
da = DocumentArray([Document(text=str(i), tags={'label': i}) for i in range(3)])
for d in da:
  d.matches = da
# each label occurs one time in the data collection (not provided to the evaluate function)
max_rel_per_label = {i: 1 for i in range(3)}
# evaluate matches
metrics = da.evaluate(['recall_at_k'], max_rel_per_label=max_rel_per_label)
print(metrics)

{'recall_at_k': 1.0}

Signed-off-by: Michael Guenther <[email protected]>

codecov-commenter · 2022-11-22T08:22:13Z

Codecov Report

Base: 82.91% // Head: 72.73% // Decreases project coverage by -10.17% ⚠️

Coverage data is based on head (66f6ee5) compared to base (7a5b0bf).
Patch coverage: 10.52% of modified lines in pull request are covered.

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #826       +/-   ##
===========================================
- Coverage   82.91%   72.73%   -10.18%     
===========================================
  Files         138      138               
  Lines        7122     7137       +15     
===========================================
- Hits         5905     5191      -714     
- Misses       1217     1946      +729

Flag	Coverage Δ
docarray	`72.73% <10.52%> (-10.18%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
docarray/math/evaluation.py	`0.00% <0.00%> (-94.83%)`	⬇️
docarray/array/mixins/evaluation.py	`8.33% <11.76%> (-0.76%)`	⬇️
docarray/math/distance/torch.py	`0.00% <0.00%> (-100.00%)`	⬇️
docarray/math/distance/paddle.py	`0.00% <0.00%> (-100.00%)`	⬇️
docarray/document/strawberry_type.py	`0.00% <0.00%> (-100.00%)`	⬇️
docarray/math/distance/tensorflow.py	`0.00% <0.00%> (-100.00%)`	⬇️
docarray/document/mixins/rich_embedding.py	`0.00% <0.00%> (-100.00%)`	⬇️
docarray/document/mixins/strawberry.py	`16.27% <0.00%> (-79.07%)`	⬇️
docarray/array/mixins/io/csv.py	`23.68% <0.00%> (-65.79%)`	⬇️
... and 40 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

gmastrapas · 2022-11-23T15:07:36Z

docarray/array/mixins/evaluation.py

        caller_max_rel = kwargs.pop('max_rel', None)
        for d, gd in zip(self, ground_truth):
-            max_rel = caller_max_rel or len(gd.matches)
+            if caller_max_rel:


I think you need to refactor the if else logic here a bit

for d, gd in zip(self, ground_truth): if caller_max_rel: max_rel = caller_max_rel if ground_truth_type == 'labels': if max_rel_per_label: max_rel = max_rel_per_label.get(d.tags[label_tag], None) if max_rel is None: raise ValueError( '`max_rel_per_label` misses the label ' + str(d.tags[label_tag]) ) else: raise ValueError('max_rel is required or something') else: max_rel = len(gd.matches)

I think it is correct, that caller_max_rel is used if the user provides a max_rel attribute explicitly.

This exception when max_rel is not set should also not be their because most of the metrics do not require max_rel, but setting it to None might be better than setting it to len(gd.matches). I will change this.

Signed-off-by: Michael Guenther <[email protected]>

bwanglzu

left some comment

bwanglzu · 2022-11-28T10:24:56Z

docarray/array/mixins/evaluation.py

+        if ground_truth and label_tag in ground_truth[0].tags:
+            max_rel_per_label = dict(Counter([d.tags[label_tag] for d in ground_truth]))
+        elif not ground_truth and label_tag in query_data[0].tags:
+            max_rel_per_label = dict(Counter([d.tags[label_tag] for d in query_data]))
+        else:
+            max_rel_per_label = None


i don't understand, max_rel_per_label is a variable you passed into the function, then what is this max_rel_per_label?

okay i see, these are two functions

can you elaberate a bit the naming, max_rel_per_label? why not num_relevant_documents_per_label or something like that?

Signed-off-by: Michael Guenther <[email protected]>

JoanFM

I would like to see some changes in Documentation

Signed-off-by: Michael Guenther <[email protected]>

alexcg1 · 2022-11-29T13:43:43Z