Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
5f7acf0
feat: support root_id for subindex with redis
AnneYang720 Nov 17, 2022
e031227
feat: add root_id to config of backends
AnneYang720 Nov 17, 2022
365052d
feat: store find scores
AnneYang720 Nov 21, 2022
0f5f105
feat: add root_id in _update_subindices_set
AnneYang720 Nov 21, 2022
e290cff
refactor: remove comments
AnneYang720 Nov 21, 2022
aef6b32
feat: set default root_id to True
AnneYang720 Nov 21, 2022
dc3da50
feat: keep user pre-defined root_id
AnneYang720 Nov 21, 2022
2e5a485
feat: add warning for redis _extend about root_id
AnneYang720 Nov 21, 2022
7081f7c
feat: raise error when not all res have root_id
AnneYang720 Nov 21, 2022
865e02f
test: add test_find_return_root
AnneYang720 Nov 21, 2022
1b650b0
Merge branch 'main' into feat-root-id
AnneYang720 Nov 21, 2022
0a0c517
fix: convert value to docarray first
AnneYang720 Nov 21, 2022
601c881
feat: add no root_id warning
AnneYang720 Nov 22, 2022
6d1fcbe
feat: add root_id for root level set
AnneYang720 Nov 22, 2022
dffcb2b
fix: do not assign root_id automatically
AnneYang720 Nov 22, 2022
bcbfd4b
fix: keep scores when return_root is True
AnneYang720 Nov 22, 2022
6eb6eba
test: add tests for root_id
AnneYang720 Nov 22, 2022
fb62164
Merge branch 'main' into feat-root-id
AnneYang720 Nov 22, 2022
65e9715
feat: add root-id support for DAInMemory
AnneYang720 Nov 22, 2022
7f70402
Merge branch 'main' into feat-root-id
AnneYang720 Nov 22, 2022
3997b7e
docs: add doc for find return_root
AnneYang720 Nov 23, 2022
733d24d
Merge branch 'main' into feat-root-id
AnneYang720 Nov 23, 2022
a00bdbd
fix: annlite _append
AnneYang720 Nov 23, 2022
faa09ec
feat: add root_id support for milvus
AnneYang720 Nov 23, 2022
bd4194f
docs: add root_id to config for backends
AnneYang720 Nov 23, 2022
da5e972
refactor: change warning message
AnneYang720 Nov 24, 2022
b3df9f3
feat: init _is_subindex in base _init_storage
AnneYang720 Nov 24, 2022
6e33f98
feat: change root_id to private _root_id_
AnneYang720 Nov 24, 2022
56cb9d0
refactor: simplify code for find return root
AnneYang720 Nov 24, 2022
81824e7
refactor: move get_root_docs to common helper
AnneYang720 Nov 24, 2022
455e9c2
feat: add warning for subindex __setitem__
AnneYang720 Nov 24, 2022
8bf4256
docs: add note for find return_root
AnneYang720 Nov 24, 2022
f5dc437
refactor: use _is_subindex directly
AnneYang720 Nov 25, 2022
3b703b6
refactor: move _get_root_docs to mixins
AnneYang720 Nov 25, 2022
f3d54bd
Merge branch 'main' into feat-root-id
AnneYang720 Nov 25, 2022
736a4e6
fix: check_root_id only when value is Doc
AnneYang720 Nov 25, 2022
8431aac
fix: fix logic in check_root_id
AnneYang720 Nov 25, 2022
cc9160e
Merge branch 'main' into feat-root-id
AnneYang720 Nov 28, 2022
5f8ca22
refactor: add comments
AnneYang720 Nov 28, 2022
b1fbb27
docs: doc modification
AnneYang720 Nov 28, 2022
1c119a4
test: add test cases for test_find_return_root
AnneYang720 Nov 28, 2022
e050025
Merge branch 'main' into feat-root-id
AnneYang720 Nov 28, 2022
34d43ce
docs: add root_id to redis doc
AnneYang720 Nov 28, 2022
98a1270
refactor: add _is_subindex to constructor
AnneYang720 Nov 28, 2022
89222dc
Merge branch 'main' into feat-root-id
AnneYang720 Nov 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 28 additions & 6 deletions docarray/array/mixins/find.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
import abc
from typing import overload, Optional, Union, Dict, List, Tuple, Callable, TYPE_CHECKING
from typing import TYPE_CHECKING, Callable, Dict, List, Optional, Tuple, Union, overload

import numpy as np

from docarray.math import ndarray
from docarray.score import NamedScore

if TYPE_CHECKING: # pragma: no cover
from docarray.typing import T, ArrayType

from docarray import Document, DocumentArray
from docarray.typing import ArrayType, T


class FindMixin:
Expand Down Expand Up @@ -99,6 +97,7 @@ def find(
filter: Union[Dict, str, None] = None,
only_id: bool = False,
index: str = 'text',
return_root: Optional[bool] = False,
on: Optional[str] = None,
**kwargs,
) -> Union['DocumentArray', List['DocumentArray']]:
Expand Down Expand Up @@ -126,14 +125,17 @@ def find(
parameter is ignored. By default, the Document `text` attribute will be used for search,
otherwise the tag field specified by `index` will be used. You can only use this parameter if the
storage backend supports searching by text.
:param return_root: if set, then the root-level DocumentArray will be returned
:param on: specifies a subindex to search on. If set, the returned DocumentArray will be retrieved from the given subindex.
:param kwargs: other kwargs.

:return: a list of DocumentArrays containing the closest Document objects for each of the queries in `query`.
"""
from docarray import Document, DocumentArray

index_da = self._get_index(subindex_name=on)
if index_da is not self:
return index_da.find(
results = index_da.find(
query,
metric,
limit,
Expand All @@ -144,7 +146,15 @@ def find(
index,
on=None,
)
from docarray import Document, DocumentArray

if return_root:
da = self._get_root_docs(results)
for d, s in zip(da, results[:, 'scores']):
d.scores = s

return da

return results

if isinstance(query, dict):
if filter is None:
Expand Down Expand Up @@ -301,3 +311,15 @@ def _find_by_text(self, *args, **kwargs):
raise NotImplementedError(
f'Search by text is not supported with this backend {self.__class__.__name__}'
)

def _get_root_docs(self, docs: 'DocumentArray') -> 'DocumentArray':
"""Get the root documents of the current DocumentArray.

:return: a `DocumentArray` containing the root documents.
"""

if not all(docs[:, 'tags___root_id_']):
raise ValueError(
f'Not all Documents in this subindex have the "_root_id_" attribute set in all `tags`.'
)
return self[docs[:, 'tags___root_id_']]
4 changes: 4 additions & 0 deletions docarray/array/mixins/setitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,10 @@ def __setitem__(
index: 'DocumentArrayIndexType',
value: Union['Document', Sequence['Document']],
):
from docarray.helper import check_root_id

if self._is_subindex:
check_root_id(self, value)

self._update_subindices_set(index, value)
# set by offset
Expand Down
3 changes: 2 additions & 1 deletion docarray/array/storage/annlite/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ class AnnliteConfig:
max_connection: Optional[int] = None
n_components: Optional[int] = None
columns: Optional[Union[List[Tuple[str, str]], Dict[str, str]]] = None
root_id: bool = True


class BackendMixin(BaseBackendMixin):
Expand Down Expand Up @@ -104,7 +105,7 @@ def _init_storage(

self._annlite = AnnLite(self.n_dim, lock=False, **filter_dict(config))

super()._init_storage()
super()._init_storage(**kwargs)

if _docs is None:
return
Expand Down
2 changes: 1 addition & 1 deletion docarray/array/storage/annlite/seqlike.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def _extend(self, values: Iterable['Document']) -> None:
self._offset2ids.extend([doc.id for doc in docs])

def _append(self, value: 'Document'):
self.extend([value])
self._extend([value])

def __eq__(self, other):
"""In annlite backend, data are considered as identical if configs point to the same database source"""
Expand Down
6 changes: 5 additions & 1 deletion docarray/array/storage/base/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,11 @@ def _init_storage(
self,
_docs: Optional['DocumentArraySourceType'] = None,
copy: bool = False,
_is_subindex: bool = False,
*args,
**kwargs,
):
self._is_subindex = _is_subindex
self._load_offset2ids()

def _init_subindices(
Expand All @@ -40,7 +42,9 @@ def _init_subindices(
config_joined = self._ensure_unique_config(
config, config_subindex, config_joined, name
)
self._subindices[name] = self.__class__(config=config_joined)
self._subindices[name] = self.__class__(
config=config_joined, _is_subindex=True
)
if _docs:
from docarray import DocumentArray

Expand Down
14 changes: 13 additions & 1 deletion docarray/array/storage/base/getsetdel.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,13 +200,25 @@ def _update_subindices_set(self, set_index, docs):
_check_valid_values_nested_set(self[set_index], docs)
if set_index in subindices:
subindex_da = subindices[set_index]

subindex_da.clear()
subindex_da.extend(docs)
else: # root level set, update subindices iteratively
for subindex_selector, subindex_da in subindices.items():
old_ids = DocumentArray(self[set_index])[subindex_selector, 'id']
del subindex_da[old_ids]
subindex_da.extend(DocumentArray(docs)[subindex_selector])

value = DocumentArray(docs)

if (
getattr(subindex_da, '_config', None) # checks if in-memory da
and subindex_da._config.root_id
):
for v in value:
for doc in DocumentArray(v)[subindex_selector]:
doc.tags['_root_id_'] = v.id

subindex_da.extend(value[subindex_selector])

def _set_docs(self, ids, docs: Iterable['Document']):
docs = list(docs)
Expand Down
19 changes: 17 additions & 2 deletions docarray/array/storage/base/seqlike.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import warnings
from abc import abstractmethod
from typing import Iterator, Iterable, MutableSequence
from typing import Iterable, Iterator, MutableSequence

from docarray import Document, DocumentArray

Expand All @@ -10,7 +11,15 @@ class BaseSequenceLikeMixin(MutableSequence[Document]):
def _update_subindices_append_extend(self, value):
if getattr(self, '_subindices', None):
for selector, da in self._subindices.items():
docs_selector = DocumentArray(value)[selector]

value = DocumentArray(value)

if getattr(da, '_config', None) and da._config.root_id:
for v in value:
for doc in DocumentArray(v)[selector]:
doc.tags['_root_id_'] = v.id

docs_selector = value[selector]
if len(docs_selector) > 0:
da.extend(docs_selector)

Expand Down Expand Up @@ -63,6 +72,12 @@ def __bool__(self):
return len(self) > 0

def extend(self, values: Iterable['Document'], **kwargs) -> None:

from docarray.helper import check_root_id

if self._is_subindex:
check_root_id(self, values)

self._extend(values, **kwargs)
self._update_subindices_append_extend(values)

Expand Down
3 changes: 2 additions & 1 deletion docarray/array/storage/elastic/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ class ElasticConfig:
ef_construction: Optional[int] = None
m: Optional[int] = None
columns: Optional[Union[List[Tuple[str, str]], Dict[str, str]]] = None
root_id: bool = True


_banned_indexname_chars = ['[', ' ', '"', '*', '\\', '<', '|', ',', '>', '/', '?', ']']
Expand Down Expand Up @@ -100,7 +101,7 @@ def _init_storage(
self._build_offset2id_index()

# Note super()._init_storage() calls _load_offset2ids which calls _get_offset2ids_meta
super()._init_storage()
super()._init_storage(**kwargs)

if _docs is None:
return
Expand Down
16 changes: 16 additions & 0 deletions docarray/array/storage/memory/find.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,3 +180,19 @@ def _get_dist(da: 'DocumentArray'):
idx = np.take_along_axis(top_inds, permutation, axis=1)

return dist, idx

def _get_root_docs(self, docs: 'DocumentArray') -> 'DocumentArray':
"""Get the root documents of the current DocumentArray.

:return: a `DocumentArray` containing the root documents.
"""
from docarray import DocumentArray

root_da_flat = self[...]
da = DocumentArray()
for doc in docs:
result = doc
while getattr(result, 'parent_id', None):
result = root_da_flat[result.parent_id]
da.append(result)
return da
3 changes: 2 additions & 1 deletion docarray/array/storage/milvus/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ class MilvusConfig:
batch_size: int = -1
columns: Optional[Union[List[Tuple[str, str]], Dict[str, str]]] = None
list_like: bool = True
root_id: bool = True


class BackendMixin(BaseBackendMixin):
Expand Down Expand Up @@ -134,7 +135,7 @@ def _init_storage(
self._collection = self._create_or_reuse_collection()
self._offset2id_collection = self._create_or_reuse_offset2id_collection()
self._build_index()
super()._init_storage()
super()._init_storage(**kwargs)

# To align with Sqlite behavior; if `docs` is not `None` and table name
# is provided, :class:`DocumentArraySqlite` will clear the existing
Expand Down
3 changes: 2 additions & 1 deletion docarray/array/storage/qdrant/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ class QdrantConfig:
full_scan_threshold: Optional[int] = None
m: Optional[int] = None
columns: Optional[Union[List[Tuple[str, str]], Dict[str, str]]] = None
root_id: bool = True


class BackendMixin(BaseBackendMixin):
Expand Down Expand Up @@ -128,7 +129,7 @@ def _init_storage(

self._initialize_qdrant_schema()

super()._init_storage()
super()._init_storage(**kwargs)

if docs is None and config.collection_name:
return
Expand Down
3 changes: 2 additions & 1 deletion docarray/array/storage/redis/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ class RedisConfig:
block_size: Optional[int] = None
initial_cap: Optional[int] = None
columns: Optional[Union[List[Tuple[str, str]], Dict[str, str]]] = None
root_id: bool = True


class BackendMixin(BaseBackendMixin):
Expand Down Expand Up @@ -87,7 +88,7 @@ def _init_storage(
self._client = self._build_client()
self._build_index()

super()._init_storage()
super()._init_storage(**kwargs)

if _docs is None:
return
Expand Down
3 changes: 2 additions & 1 deletion docarray/array/storage/sqlite/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ class SqliteConfig:
conn_config: Dict = field(default_factory=dict)
journal_mode: str = 'WAL'
synchronous: str = 'OFF'
root_id: bool = True


class BackendMixin(BaseBackendMixin):
Expand Down Expand Up @@ -101,7 +102,7 @@ def _init_storage(
self._connection.commit()
self._config = config
self._list_like = config.list_like
super()._init_storage()
super()._init_storage(**kwargs)

if _docs is None:
return
Expand Down
1 change: 1 addition & 0 deletions docarray/array/storage/weaviate/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ class WeaviateConfig:
# weaviate python client parameters
batch_size: Optional[int] = field(default=50)
dynamic_batching: Optional[bool] = field(default=False)
root_id: bool = True

def __post_init__(self):
if isinstance(self.timeout_config, list):
Expand Down
31 changes: 30 additions & 1 deletion docarray/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import hubble

if TYPE_CHECKING: # pragma: no cover
from docarray import DocumentArray
from docarray import Document, DocumentArray

__resources_path__ = os.path.join(
os.path.dirname(
Expand Down Expand Up @@ -491,6 +491,35 @@ def _get_array_info(da: 'DocumentArray'):
return is_homo, _nested_in, _nested_items, attr_counter, all_attrs_names


def check_root_id(da: 'DocumentArray', value: Union['Document', Sequence['Document']]):

from docarray import Document
from docarray.array.memory import DocumentArrayInMemory

if not (
isinstance(value, Document)
or (isinstance(value, Sequence) and isinstance(value[0], Document))
):
return

if isinstance(value, Document):
value = [value]

if isinstance(da, DocumentArrayInMemory):
if not all([getattr(doc, 'parent_id', None) for doc in value]):
warnings.warn(
"Not all documents have parent_id set. This may cause unexpected behavior.",
UserWarning,
)
elif da._config.root_id and not all(
[doc.tags.get('_root_id_', None) for doc in value]
):
warnings.warn(
"root_id is enabled but not all documents have _root_id_ set. This may cause unexpected behavior.",
UserWarning,
)


def login(interactive: Optional[bool] = None, force: bool = False, **kwargs):
"""Login to Jina AI Cloud account.
:param interactive: If set to true, login will support notebook environments, otherwise the enviroment will be inferred.
Expand Down
1 change: 1 addition & 0 deletions docs/advanced/document-store/annlite.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ The following configs can be set:
| `max_connection` | The number of bi-directional links created for every new element during construction. | `None`, defaults to the default value in the AnnLite package* |
| `n_components` | The output dimension of PCA model. Should be a positive number and less than `n_dim` if it's not `None` | `None`, defaults to the default value in the AnnLite package* |
| `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True |
| `root_id` | Boolean flag indicating whether to store `root_id` in the tags of chunk level Documents | True |

*You can check the default values in [the AnnLite source code](https://github.com/jina-ai/annlite/blob/main/annlite/core/index/hnsw/index.py)

Expand Down
1 change: 1 addition & 0 deletions docs/advanced/document-store/elasticsearch.md
Original file line number Diff line number Diff line change
Expand Up @@ -404,6 +404,7 @@ The following configs can be set:
| `tag_indices` | List of tags to index | False |
| `batch_size` | Batch size used to handle storage refreshes/updates | 64 |
| `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True |
| `root_id` | Boolean flag indicating whether to store `root_id` in the tags of chunk level Documents | True |

```{tip}
You can read more about HNSW parameters and their default values [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html#dense-vector-params)
Expand Down
3 changes: 2 additions & 1 deletion docs/advanced/document-store/milvus.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,10 +128,11 @@ The following configs can be set:
| `index_params` | A dictionary of parameters used for index building. The [allowed parameters](https://milvus.io/docs/v2.1.x/index.md) depend on the index type. | {'M': 4, 'efConstruction': 200} (assumes HNSW index) |
| `collection_config` | Configuration for the Milvus collection. Passed as **kwargs during collection creation (`Collection(...)`). | {} |
| `serialize_config` | [Serialization config of each Document](../../../fundamentals/document/serialization.md) | {} |
| `consistency_level` | [Consistency level](https://milvus.io/docs/v2.1.x/consistency.md#Consistency-levels) for Milvus database operations. Can be 'Session', 'Strong', 'Bounded' or 'Eventually'. | 'Session' |
| `consistency_level` | [Consistency level](https://milvus.io/docs/v2.1.x/consistency.md#Consistency-levels) for Milvus database operations. Can be 'Session', 'Strong', 'Bounded' or 'Eventually'. | 'Session' |
| `batch_size` | Default batch size for CRUD operations. | -1 (no batching) |
| `columns` | Additional columns to be stored in the datbase, taken from Document `tags`. | None |
| `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True |
| `root_id` | Boolean flag indicating whether to store `root_id` in the tags of chunk level Documents | True |

## Minimal example

Expand Down
2 changes: 2 additions & 0 deletions docs/advanced/document-store/qdrant.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,8 @@ The following configs can be set:
| `m` | Number of edges per node in the index graph. Larger = more accurate search, more space required | `None`, defaults to the default value in Qdrant* |
| `columns` | Other fields to store in Document | `None` |
| `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True |
| `root_id` | Boolean flag indicating whether to store `root_id` in the tags of chunk level Documents | True |


*You can read more about the HNSW parameters and their default values [here](https://qdrant.tech/documentation/indexing/#vector-index)

Expand Down
Loading