Skip to content

feat: support root_id for storage backends#808

Merged
JoanFM merged 45 commits intomainfrom
feat-root-id
Nov 28, 2022
Merged

feat: support root_id for storage backends#808
JoanFM merged 45 commits intomainfrom
feat-root-id

Conversation

@AnneYang720
Copy link
Copy Markdown
Contributor

@AnneYang720 AnneYang720 commented Nov 17, 2022

This PR implements issue #775

Goals:

  • codes for supporting root_id
  • add related tests
  • add docs if needed
  • pass ci

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 17, 2022

Codecov Report

Base: 81.08% // Head: 81.67% // Increases project coverage by +0.59% 🎉

Coverage data is based on head (89222dc) compared to base (5fce6b6).
Patch coverage: 98.30% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #808      +/-   ##
==========================================
+ Coverage   81.08%   81.67%   +0.59%     
==========================================
  Files         138      138              
  Lines        7067     7112      +45     
==========================================
+ Hits         5730     5809      +79     
+ Misses       1337     1303      -34     
Flag Coverage Δ
docarray 81.67% <98.30%> (+0.59%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/array/mixins/find.py 87.25% <92.30%> (-1.88%) ⬇️
docarray/array/mixins/setitem.py 74.79% <100.00%> (+1.46%) ⬆️
docarray/array/storage/annlite/backend.py 84.61% <100.00%> (+4.09%) ⬆️
docarray/array/storage/annlite/seqlike.py 92.00% <100.00%> (+8.00%) ⬆️
docarray/array/storage/base/backend.py 89.58% <100.00%> (+0.22%) ⬆️
docarray/array/storage/base/getsetdel.py 91.39% <100.00%> (+5.08%) ⬆️
docarray/array/storage/base/seqlike.py 91.48% <100.00%> (+2.01%) ⬆️
docarray/array/storage/elastic/backend.py 93.38% <100.00%> (+0.04%) ⬆️
docarray/array/storage/memory/find.py 54.68% <100.00%> (+8.39%) ⬆️
docarray/array/storage/milvus/backend.py 93.25% <100.00%> (+0.04%) ⬆️
... and 33 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Copy Markdown
Member

@JohannesMessner JohannesMessner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far!
I think there are a fw more thing we have to consider.

  • The users might also be interested in the score of a retrieved Document. I think if return_root=True we should just copy the score from the chunk Document to the root Document before returning it.

  • What if the user modifies the subindex directly, like da['@c'].extend(...)? Then we can't know the parent id. How do we want to handle this case?

  • What about operations like da[x] = doc? Then we should also set the root_id, not only for extend and append, so we will have to handle this in _update_subindex_set().

@github-actions github-actions bot added size/m and removed size/s labels Nov 22, 2022
@AnneYang720
Copy link
Copy Markdown
Contributor Author

We now face some limitations:

  1. Bool flag root_id is added to each storage backend config with default value True. But in-memory it is always True because it doesn't have a config or flag.
  2. With root_id=True by default, users may receive many warnings when they want to add or change Documents on nested level directly without setting _root_id_ or parent_id manually. Like da['@c'].extend(Document())
  3. If we remove the root_id flag and always store this infomation by default, and only raise warning or error when user do da.find(query, on='@c', return_root=True), user may need to re-index the database.

@JohannesMessner
Copy link
Copy Markdown
Member

We now face some limitations:

1. Bool flag `root_id` is added to each storage backend config with default value `True`. But `in-memory` it is always True because it doesn't have a config or flag.

2. With `root_id=True` by default, users may receive many warnings when they want to add or change Documents on nested level directly without setting `_root_id_` or `parent_id` manually. Like `da['@c'].extend(Document())`

3. If we remove the `root_id` flag and always store this infomation by default, and only raise warning or error when user do `da.find(query, on='@c', return_root=True)`, user may need to re-index the database.

To add a little bit more context to 2. and 3.:

The intended use of this feature is for someone to add Documents to the root-level DocumentArray, so that Documents in a subindex can automatically be 'tagged' with a root_id: da.extend(...).
But if a user modifies a subindex dicrectly, we can't know the roots of the documents that they add (there might not even be a root): da['@c'].extend(...). So, we cannot support this pattern.

The question becomes, when do we warn about this behaviour? At insert time, or at find() time when the missing root_id actually starts to matter? That is what 2. and 3. above refer to.

@JoanFM
Copy link
Copy Markdown
Member

JoanFM commented Nov 28, 2022

We now face some limitations:

1. Bool flag `root_id` is added to each storage backend config with default value `True`. But `in-memory` it is always True because it doesn't have a config or flag.

2. With `root_id=True` by default, users may receive many warnings when they want to add or change Documents on nested level directly without setting `_root_id_` or `parent_id` manually. Like `da['@c'].extend(Document())`

3. If we remove the `root_id` flag and always store this infomation by default, and only raise warning or error when user do `da.find(query, on='@c', return_root=True)`, user may need to re-index the database.

To add a little bit more context to 2. and 3.:

The intended use of this feature is for someone to add Documents to the root-level DocumentArray, so that Documents in a subindex can automatically be 'tagged' with a root_id: da.extend(...). But if a user modifies a subindex dicrectly, we can't know the roots of the documents that they add (there might not even be a root): da['@c'].extend(...). So, we cannot support this pattern.

The question becomes, when do we warn about this behaviour? At insert time, or at find() time when the missing root_id actually starts to matter? That is what 2. and 3. above refer to.

If u have root_id True and you are extending in the wrong way, add the warning, to me this makes sense. And warn the earliest the better

@github-actions
Copy link
Copy Markdown

📝 Docs are deployed on https://ft-feat-root-id--jina-docs.netlify.app 🎉

@JoanFM JoanFM merged commit ecf5d23 into main Nov 28, 2022
@JoanFM JoanFM deleted the feat-root-id branch November 28, 2022 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants