Skip to content

feat: Implementation of embed_and_evaluate#702

Merged
hanxiao merged 13 commits intomainfrom
feat-add-embed-and-evaluate-function
Nov 2, 2022
Merged

feat: Implementation of embed_and_evaluate#702
hanxiao merged 13 commits intomainfrom
feat-add-embed-and-evaluate-function

Conversation

@guenthermi
Copy link
Copy Markdown
Contributor

@guenthermi guenthermi commented Oct 28, 2022

Goals:

  • Add a function that does embed, match, and evaluate at once
  • Do embed and match in batches to reduce the memory foot-print of the method
  • Enable sampling of query vectors to reduce an quadratic increase of the time complexity
  • check and update documentation, if required. See guide

Example Code:

import numpy as np
from docarray import Document, DocumentArray


def emb_func(da):
    for d in da:
        np.random.seed(int(d.text))
        d.embedding = np.random.random(5)


da = DocumentArray(
    [Document(text=str(i), tags={'label': i % 10}) for i in range(1_000)]
)

da.embed_and_evaluate(
    metrics=['precision_at_k'], embed_funcs=emb_func, query_sample_size=100
)

Reduction of memory usage when evaluating 100 query vectors against 500,000 index vectors with 500 dimensions:

Manual Evaluation:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    28   1130.7 MiB   1130.7 MiB           1   @profile
    29                                         def run_evaluation_old_style(queries, index, model):
    30   1133.1 MiB      2.5 MiB           1       queries.embed(model)
    31   2345.6 MiB   1212.4 MiB           1       index.embed(model)
    32   2360.4 MiB     14.8 MiB           1       queries.match(index)
    33   2360.4 MiB      0.0 MiB           1       return queries.evaluate(metrics=['reciprocal_rank'])

Evaluation with `embed_and_evaluate (batch_size 100,000):

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    23   1130.6 MiB   1130.6 MiB           1   @profile
    24                                         def run_evaluation(queries, index, model, batch_size=None):
    25   1130.6 MiB      0.0 MiB           1       kwargs = {'match_batch_size':batch_size} if batch_size else {}
    26   1439.9 MiB    309.3 MiB           1       return queries.embed_and_evaluate(metrics=['reciprocal_rank'], index_data=index, embed_models=model, **kwargs)

@guenthermi
Copy link
Copy Markdown
Contributor Author

This is only a draft atm. I have to add more tests, benchmark the memory usage, add add documentation if everything seems to work well.

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 28, 2022

Codecov Report

Merging #702 (ebbd9dd) into main (3f07f52) will increase coverage by 2.20%.
The diff coverage is 96.82%.

@@            Coverage Diff             @@
##             main     #702      +/-   ##
==========================================
+ Coverage   85.95%   88.16%   +2.20%     
==========================================
  Files         133      133              
  Lines        6538     6600      +62     
==========================================
+ Hits         5620     5819     +199     
+ Misses        918      781     -137     
Flag Coverage Δ
docarray 88.16% <96.82%> (+2.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/array/mixins/embed.py 92.30% <ø> (ø)
docarray/array/mixins/match.py 91.66% <ø> (ø)
docarray/array/mixins/evaluation.py 88.11% <96.82%> (+6.63%) ⬆️
docarray/array/storage/redis/find.py 62.26% <0.00%> (-26.42%) ⬇️
docarray/array/storage/sqlite/getsetdel.py 100.00% <0.00%> (+2.38%) ⬆️
docarray/array/storage/annlite/getsetdel.py 100.00% <0.00%> (+2.63%) ⬆️
docarray/array/storage/elastic/getsetdel.py 100.00% <0.00%> (+3.38%) ⬆️
docarray/array/storage/weaviate/getsetdel.py 100.00% <0.00%> (+5.26%) ⬆️
docarray/array/storage/base/getsetdel.py 91.21% <0.00%> (+5.40%) ⬆️
docarray/array/mixins/getitem.py 90.47% <0.00%> (+11.11%) ⬆️
... and 6 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@guenthermi guenthermi marked this pull request as ready for review October 31, 2022 10:44
Copy link
Copy Markdown
Member

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, could we change the docstring of embed to mention that this new method exist and should be used when the user need to do embed and eval directly ?

**kwargs,
) -> Optional[Union[float, List[float]]]: # average for each metric
"""
Computes ranking evaluation metrics for a given `DocumentArray`. Moreover, this
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for moreover here IMO

is possible with the :func:``evaluate`` function.

:param metrics: List of metric names or metric functions to be computed
:param index_data: The other DocumentArray to match against, if not given,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we describe what it means match against itself ? Will all of the document inside the da will be match over all the other one ? Or does is split the da in query, index randomly and perform the search from query to index ?

Copy link
Copy Markdown
Contributor Author

@guenthermi guenthermi Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yes I can explain this. Atm everything it will match everything against everything. I might do another small PR afterwards, to enable sampling.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match everything against everything is not tracktable tbh, why not just do the sampling right now ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We discussed that it might make sense, but it is better to implement it in this PR to avoid a breaking change.

'Either a ground_truth `DocumentArray` or labels are '
'required for the evaluation.'
)
if not label_tag in index_data[0].tags:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we need to do this for all of the document in index_data ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't think it is necessary to do a full data validation. If a user provides only partly labels, I think it is ok if it crashes with a key error.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohannesMessner @alaeddine-13 what is your opinion here ? For me it is either we do it for every doc on we don't do validation but I might be wrong

@samsja samsja marked this pull request as draft November 1, 2022 13:12
@github-actions
Copy link
Copy Markdown

github-actions bot commented Nov 2, 2022

📝 Docs are deployed on https://ft-feat-add-embed-and-evaluate-function--jina-docs.netlify.app 🎉

@guenthermi guenthermi marked this pull request as ready for review November 2, 2022 08:07
@guenthermi guenthermi changed the title feat: add first implemenation of embed_and_evaluate feat: Implementation of embed_and_evaluate Nov 2, 2022
@hanxiao hanxiao merged commit c38d82d into main Nov 2, 2022
@hanxiao hanxiao deleted the feat-add-embed-and-evaluate-function branch November 2, 2022 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

3 participants