Skip to content

feat: add max_rel_per_label to support recall for labeled data#826

Merged
JoanFM merged 11 commits intodocarray:mainfrom
guenthermi:feat-add-max_rel_per_label
Nov 30, 2022
Merged

feat: add max_rel_per_label to support recall for labeled data#826
JoanFM merged 11 commits intodocarray:mainfrom
guenthermi:feat-add-max_rel_per_label

Conversation

@guenthermi
Copy link
Copy Markdown
Contributor

@guenthermi guenthermi commented Nov 22, 2022

Signed-off-by: Michael Guenther [email protected]

Goals:

  • Support recall@k and F1 measure@k for labeled datasets.
  • check and update documentation, if required. See guide

For labeled datasets it is not trivial to calculate metrics like recall and F1 measure, which require the knowledge of the number of relevant documents in the document collection for each query since neither the whole set of relevant documents nor the number of documents with a specific label is provided to the evaluate function.

To enable the calculation of recall and F1 measure, this PR

  • adds a parameter max_rel_per_label: Dict to the evaluate function which provides the number of relevant documents for each label, i.e., the number of documents in the collection with this label.
  • calculates those label counts for max_rel_per_label in the embed_and_evaluate_function.

Code Example:

# example DocumentArray with matches and labels for evaluation 
da = DocumentArray([Document(text=str(i), tags={'label': i}) for i in range(3)])
for d in da:
  d.matches = da
# each label occurs one time in the data collection (not provided to the evaluate function)
max_rel_per_label = {i: 1 for i in range(3)}
# evaluate matches
metrics = da.evaluate(['recall_at_k'], max_rel_per_label=max_rel_per_label)
print(metrics)
{'recall_at_k': 1.0}

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 22, 2022

Codecov Report

Base: 82.91% // Head: 72.73% // Decreases project coverage by -10.17% ⚠️

Coverage data is based on head (66f6ee5) compared to base (7a5b0bf).
Patch coverage: 10.52% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##             main     #826       +/-   ##
===========================================
- Coverage   82.91%   72.73%   -10.18%     
===========================================
  Files         138      138               
  Lines        7122     7137       +15     
===========================================
- Hits         5905     5191      -714     
- Misses       1217     1946      +729     
Flag Coverage Δ
docarray 72.73% <10.52%> (-10.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/math/evaluation.py 0.00% <0.00%> (-94.83%) ⬇️
docarray/array/mixins/evaluation.py 8.33% <11.76%> (-0.76%) ⬇️
docarray/math/distance/torch.py 0.00% <0.00%> (-100.00%) ⬇️
docarray/math/distance/paddle.py 0.00% <0.00%> (-100.00%) ⬇️
docarray/document/strawberry_type.py 0.00% <0.00%> (-100.00%) ⬇️
docarray/math/distance/tensorflow.py 0.00% <0.00%> (-100.00%) ⬇️
docarray/document/mixins/rich_embedding.py 0.00% <0.00%> (-100.00%) ⬇️
docarray/document/mixins/strawberry.py 16.27% <0.00%> (-79.07%) ⬇️
docarray/array/mixins/io/csv.py 23.68% <0.00%> (-65.79%) ⬇️
... and 40 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

caller_max_rel = kwargs.pop('max_rel', None)
for d, gd in zip(self, ground_truth):
max_rel = caller_max_rel or len(gd.matches)
if caller_max_rel:
Copy link
Copy Markdown
Contributor

@gmastrapas gmastrapas Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to refactor the if else logic here a bit

for d, gd in zip(self, ground_truth):
    if caller_max_rel:
        max_rel = caller_max_rel

    if ground_truth_type == 'labels':
        if max_rel_per_label:
            max_rel = max_rel_per_label.get(d.tags[label_tag], None)
            if max_rel is None:
                raise ValueError(
                    '`max_rel_per_label` misses the label ' + str(d.tags[label_tag])
                )
        else:
            raise ValueError('max_rel is required or something')
    else:
        max_rel = len(gd.matches)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is correct, that caller_max_rel is used if the user provides a max_rel attribute explicitly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception when max_rel is not set should also not be their because most of the metrics do not require max_rel, but setting it to None might be better than setting it to len(gd.matches). I will change this.

@guenthermi guenthermi force-pushed the feat-add-max_rel_per_label branch from d65eb05 to 29777b7 Compare November 24, 2022 08:27
Signed-off-by: Michael Guenther <[email protected]>
Copy link
Copy Markdown

@bwanglzu bwanglzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comment

Comment on lines +459 to +464
if ground_truth and label_tag in ground_truth[0].tags:
max_rel_per_label = dict(Counter([d.tags[label_tag] for d in ground_truth]))
elif not ground_truth and label_tag in query_data[0].tags:
max_rel_per_label = dict(Counter([d.tags[label_tag] for d in query_data]))
else:
max_rel_per_label = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand, max_rel_per_label is a variable you passed into the function, then what is this max_rel_per_label?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay i see, these are two functions

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaberate a bit the naming, max_rel_per_label? why not num_relevant_documents_per_label or something like that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok done

@guenthermi guenthermi requested review from bwanglzu and removed request for gmastrapas November 28, 2022 10:35
@guenthermi guenthermi force-pushed the feat-add-max_rel_per_label branch 3 times, most recently from 802d8da to 62e70f8 Compare November 28, 2022 11:04
@guenthermi guenthermi closed this Nov 28, 2022
@guenthermi guenthermi reopened this Nov 28, 2022
Signed-off-by: Michael Guenther <[email protected]>
@guenthermi guenthermi force-pushed the feat-add-max_rel_per_label branch 2 times, most recently from 9dbabed to 22f0dc9 Compare November 28, 2022 12:12
Signed-off-by: Michael Guenther <[email protected]>
@guenthermi guenthermi force-pushed the feat-add-max_rel_per_label branch from 68852b9 to 5739b4f Compare November 28, 2022 13:57
Copy link
Copy Markdown
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see some changes in Documentation

Signed-off-by: Michael Guenther <[email protected]>
@guenthermi guenthermi requested review from JoanFM and removed request for JoanFM and bwanglzu November 29, 2022 13:25
@guenthermi guenthermi requested review from JoanFM and bwanglzu and removed request for JoanFM and bwanglzu November 29, 2022 13:25
@JoanFM JoanFM requested a review from alexcg1 November 29, 2022 13:32

In this case, the keyword argument `k` is passed to all metric functions, even though it does not fulfill any specific function for the calculation of the reciprocal rank.

### The max-rel parameter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### The max-rel parameter
### The max_rel parameter

### The max-rel parameter

Some metric functions shown in the table above require a `max_rel` parameter.
This parameter should be set to the number of relevant documents in the document collection.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This parameter should be set to the number of relevant documents in the document collection.
This parameter should be set to the number of relevant Documents in the Document collection.


Some metric functions shown in the table above require a `max_rel` parameter.
This parameter should be set to the number of relevant documents in the document collection.
Without the knowledge of this number, metrics like `recall_at_k` and `f1_score_at_k` can be calculated.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can or cannot?

This parameter should be set to the number of relevant documents in the document collection.
Without the knowledge of this number, metrics like `recall_at_k` and `f1_score_at_k` can be calculated.

In the `evaluate` function, one can provide a keyword argument `max_rel`, which is then used for all queries.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change all "one" to "you"

{'recall_at_k': 1.0}
```

Since all relevant documents are in the matches, the recall is one.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize Documents throughout text

Signed-off-by: Michael Guenther <[email protected]>
@guenthermi guenthermi requested review from alexcg1 and removed request for JoanFM November 29, 2022 14:35
Copy link
Copy Markdown
Contributor

@alexcg1 alexcg1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@JoanFM JoanFM merged commit 8a4224d into docarray:main Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants