From df760ef968d260d90df8f454132fdbcdb9e9ac39 Mon Sep 17 00:00:00 2001 From: winstonww Date: Tue, 15 Mar 2022 13:00:21 +0800 Subject: [PATCH 1/4] docs: rebase and add details to docs --- docs/fundamentals/documentarray/find.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) create mode 100644 docs/fundamentals/documentarray/find.md diff --git a/docs/fundamentals/documentarray/find.md b/docs/fundamentals/documentarray/find.md new file mode 100644 index 00000000000..262acafbf20 --- /dev/null +++ b/docs/fundamentals/documentarray/find.md @@ -0,0 +1,21 @@ +(find-documentarray)= +# Finding Documents + +In previous chapter, we saw how `.match` can be used to match nearest neighbors of query documents using the `embeddings` computed. An alternative way to accomplish the same task is to use the `.find` function. In addition to matching nearest neighbors using `embeddings`, the `.find` function can also be used to filter documents based on the attributes specified in a query dictionary. + +```{important} + +{meth}`~docarray.array.mixins.find.FindMixin.find` supports both **embedding-based nearest-neighbour search** & **document attributes filtering**. +``` + + +```{seealso} +- {meth}`~docarray.array.mixins.match.MatchMixin.match`: find the nearest-neighbour Documents from another DocumentArray (or itself) based on their `.embeddings`. +``` + +## Searching Nearest-neighbour Documents + +Like `.match`, the `.find` method also finds the nearest neighbors of a given collection of documents. You can use `.find` like you would in `.match`. The `.find` method accepts the same options of `.match`. For instance, you can also specify the device (CPU/GPU) and the `batch_size`. It also supports matching all the different types of embeddings as in `.match`. + + +## Filtering Documents based on Attributes From 98c897b856e89db8cb2b4e8db6ccc41d6c404d34 Mon Sep 17 00:00:00 2001 From: winstonww Date: Tue, 15 Mar 2022 14:51:35 +0800 Subject: [PATCH 2/4] docs: find documentation --- docs/fundamentals/documentarray/find.md | 118 ++++++++++++++++++++++- docs/fundamentals/documentarray/index.md | 3 +- 2 files changed, 117 insertions(+), 4 deletions(-) diff --git a/docs/fundamentals/documentarray/find.md b/docs/fundamentals/documentarray/find.md index 262acafbf20..a1fba52ccb8 100644 --- a/docs/fundamentals/documentarray/find.md +++ b/docs/fundamentals/documentarray/find.md @@ -1,13 +1,12 @@ (find-documentarray)= # Finding Documents -In previous chapter, we saw how `.match` can be used to match nearest neighbors of query documents using the `embeddings` computed. An alternative way to accomplish the same task is to use the `.find` function. In addition to matching nearest neighbors using `embeddings`, the `.find` function can also be used to filter documents based on the attributes specified in a query dictionary. - ```{important} {meth}`~docarray.array.mixins.find.FindMixin.find` supports both **embedding-based nearest-neighbour search** & **document attributes filtering**. ``` +In previous chapter, we saw how {meth}`~docarray.array.mixins.match.MatchMixin.match` can be used to match nearest neighbors of query documents using the `.embeddings` computed. Another way to accomplish the same task is to use the {meth}`~docarray.array.mixins.find.FindMixin.find` function. In addition to **matching nearest neighbors** using `embeddings`, the `find` function can also be used to **filter documents based on attributes** by conditions specified in a query dictionary. ```{seealso} - {meth}`~docarray.array.mixins.match.MatchMixin.match`: find the nearest-neighbour Documents from another DocumentArray (or itself) based on their `.embeddings`. @@ -15,7 +14,120 @@ In previous chapter, we saw how `.match` can be used to match nearest neighbors ## Searching Nearest-neighbour Documents -Like `.match`, the `.find` method also finds the nearest neighbors of a given collection of documents. You can use `.find` like you would in `.match`. The `.find` method accepts the same options of `.match`. For instance, you can also specify the device (CPU/GPU) and the `batch_size`. It also supports matching all the different types of embeddings as in `.match`. +Like {meth}`~docarray.array.mixins.match.MatchMixin.match`, the {meth}`~docarray.array.mixins.find.FindMixin.find` method also finds the nearest neighbors of a given collection of query documents. The `find` method works almost the same as `match` and accepts the same options as `match`. For instance, like in the case of `match()`, you can specify the `device` (CPU/GPU) and the `batch_size`. It also supports matching every types of embeddings supported by `match`. + + +The **only** difference is that `match()` is invoked with query `DocumentArray`, and takes the index documents as input. On the other hand, `find()` is invoked with the index `DocumentArray`, and takes query documents as input. + +That is, the following two invocations are equivalent. + +````{tab} .find +```{code-block} python +--- +emphasize-lines: 1, 2 +--- +index_docs.find( + query_docs, + device='gpu', + batch_size=10, + limit=50, + metric_name='cosine', + exclude_self=True, + only_id=False, + **kwargs, +) +``` +```` + +````{tab} .match +```{code-block} python +--- +emphasize-lines: 1, 2 +--- + +query_docs.match( + index_docs, + device='gpu', + batch_size=10, + limit=50, + metric_name='cosine', + exclude_self=True, + only_id=False, + **kwargs, +) +``` +```` ## Filtering Documents based on Attributes + +We can also use {meth}`~docarray.array.mixins.find.FindMixin.find` to filter documents based on attributes with the conditions specified in a `query` dictionary. + +The `query` dictionary defines the filtering conditions using the [MongoDB](https://docs.mongodb.com/manual/reference/operator/query/) query language. Let's take a look at how filtering by attributes can be done. + +### Filtering by Attributes + +As a simple example, let's consider the case when we want to filter documents with `text` equals `'hello'`. This can be done by: + +```python +docs.find({'text': {'$eq': 'hello'}}) +``` + +The above will return a `DocumentArray` in which each document has `doc.text == 'hello'`. We can compose multiple conditions using boolean logic operators. For instance, to filter by one or the other condition, we can: + +```python +docs.find({'$or': [{'text': {'$eq': 'hello'}}, {'text': {'$eq': 'world'}}]}) +``` + +The above returns a `DocumentArray` in which each `doc` in `docs` satisfies `doc.text == 'hello' or doc.text == 'world'`. + + +### Filtering by Tags + +To filter by data in the `tags` attribute, we can: + +```python +docs.find({'tags__number': {'$gt': 3}}) +``` + +The above will return a `DocumentArray` in which each document has `doc.tags['number'] > 3`. + + +### Filtering Using Tags as Placeholder + +We also use `tags` keys as placeholder by: + +```python +docs.find({'text': {'$eq': '{tags__name}'}}) +``` + +The above will return a `DocumentArray` in which each document has `doc.text == doc.tags['name']`. + + +### Supported Operators + +Note, that only the following MongoDB's query operators are supported: + +| Query Operator | Description | +|----------------|------------------------------------------------------------------------------------------------------------| +| `$eq` | Equal to (number, string) | +| `$ne` | Not equal to (number, string) | +| `$gt` | Greater than (number) | +| `$gte` | Greater than or equal to (number) | +| `$lt` | Less than (number) | +| `$lte` | Less than or equal to (number) | +| `$in` | Is in an array | +| `$nin` | Not in an array | +| `$regex` | Match the specified regular expression | +| `$size` | Match array/dict field that have the specified size. `$size` does not accept ranges of values. | +| `$exists` | Matches documents that have the specified field. And empty string content is also cosidered as not exists. | + +For boolean logic operators, only the following are supported: + + +| Boolean Operator | Description | +|------------------|----------------------------------------------------| +| `$and` | Join query clauses with a logical AND | +| `$or` | Join query clauses with a logical OR | +| `$not` | Inverts the effect of a query expression | + diff --git a/docs/fundamentals/documentarray/index.md b/docs/fundamentals/documentarray/index.md index 6fff3dfc252..9385f575e40 100644 --- a/docs/fundamentals/documentarray/index.md +++ b/docs/fundamentals/documentarray/index.md @@ -37,8 +37,9 @@ access-elements access-attributes embedding matching +find evaluation parallelization visualization post-external -``` \ No newline at end of file +``` From 3cde7de044efabf43e790cc1d6ccad3258f20882 Mon Sep 17 00:00:00 2001 From: Han Xiao Date: Tue, 15 Mar 2022 12:43:31 +0100 Subject: [PATCH 3/4] docs(find): improve find docs --- docarray/array/queryset/lookup.py | 21 +- docarray/array/queryset/parser.py | 12 +- docs/fundamentals/document/index.md | 1 + .../documentarray/access-elements.md | 2 +- docs/fundamentals/documentarray/find.md | 205 ++++++++++-------- 5 files changed, 141 insertions(+), 100 deletions(-) diff --git a/docarray/array/queryset/lookup.py b/docarray/array/queryset/lookup.py index 05cce78dbd2..c896556f565 100644 --- a/docarray/array/queryset/lookup.py +++ b/docarray/array/queryset/lookup.py @@ -107,13 +107,22 @@ def lookup(key, val, doc: 'Document') -> bool: elif last == 'size': return iff_not_none(value, lambda y: len(y) == val) elif last == 'exists': - if value is None: - return True != val - elif isinstance(value, (str, bytes)): - return (value == '' or value == b'') != val + if not isinstance(val, bool): + raise ValueError( + '$exists operator can only accept True/False as value for comparison' + ) + + if '__' in get_key: + is_empty = False + try: + is_empty = not value + except: + # ndarray-like will end up here + pass + + return is_empty != val else: - return True == val - # return (value is None or value == '' or value == b'') != val + return (get_key in doc.non_empty_fields) == val else: # return value == val raise ValueError( diff --git a/docarray/array/queryset/parser.py b/docarray/array/queryset/parser.py index 0c0957fb623..83773944550 100644 --- a/docarray/array/queryset/parser.py +++ b/docarray/array/queryset/parser.py @@ -51,11 +51,15 @@ def _parse_lookups(data: Dict = {}, root_node: Optional[LookupNode] = None): f'The operator {key} is not supported yet, please double check the given filters!' ) else: - items = list(value.items()) - if len(items) == 0: - raise ValueError(f'The query is illegal: {data}') + if not value or not isinstance(value, dict): + raise ValueError( + '''Not a valid query. It should follow the format: + { : { : }, ... } + ''' + ) - elif len(items) == 1: + items = list(value.items()) + if len(items) == 1: op, val = items[0] if op in LOGICAL_OPERATORS: if op == '$not': diff --git a/docs/fundamentals/document/index.md b/docs/fundamentals/document/index.md index b602d621f8f..adeee3752e3 100644 --- a/docs/fundamentals/document/index.md +++ b/docs/fundamentals/document/index.md @@ -4,6 +4,7 @@ A Document object has a predefined data schema as below, each of the attributes can be set/get with the dot expression as you would do with any Python object. +(doc-fields)= | Attribute | Type | Description | |-------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------| | id | string | A hexdigest that represents a unique document ID | diff --git a/docs/fundamentals/documentarray/access-elements.md b/docs/fundamentals/documentarray/access-elements.md index 2b3c54446bf..4f0390925e6 100644 --- a/docs/fundamentals/documentarray/access-elements.md +++ b/docs/fundamentals/documentarray/access-elements.md @@ -1,5 +1,5 @@ (access-elements)= -# Access Elements +# Access Documents This is probably my favorite chapter so far. Readers come to this far may ask: okay you re-implement Python List coin it as DocumentArray, what's the big deal? diff --git a/docs/fundamentals/documentarray/find.md b/docs/fundamentals/documentarray/find.md index a1fba52ccb8..f7d17ebec20 100644 --- a/docs/fundamentals/documentarray/find.md +++ b/docs/fundamentals/documentarray/find.md @@ -1,129 +1,130 @@ (find-documentarray)= -# Finding Documents +# Query by Conditions -```{important} +We can use {meth}`~docarray.array.mixins.find.FindMixin.find` to select Documents from a DocumentArray based the conditions specified in a `query` object. One can use `da.find(query)` to filter Documents and get nearest neighbours from `da`: -{meth}`~docarray.array.mixins.find.FindMixin.find` supports both **embedding-based nearest-neighbour search** & **document attributes filtering**. -``` - -In previous chapter, we saw how {meth}`~docarray.array.mixins.match.MatchMixin.match` can be used to match nearest neighbors of query documents using the `.embeddings` computed. Another way to accomplish the same task is to use the {meth}`~docarray.array.mixins.find.FindMixin.find` function. In addition to **matching nearest neighbors** using `embeddings`, the `find` function can also be used to **filter documents based on attributes** by conditions specified in a query dictionary. +- To filter Documents, the `query` object is a Python dictionary object that defines the filtering conditions using a [MongoDB](https://docs.mongodb.com/manual/reference/operator/query/)-like query language. +- To find nearest neighbours, the `query` object is a NdArray-like object that defines embedding. One can also use `.match()` function for this purpose, and there is a minor interface difference between these two functions. -```{seealso} -- {meth}`~docarray.array.mixins.match.MatchMixin.match`: find the nearest-neighbour Documents from another DocumentArray (or itself) based on their `.embeddings`. -``` +Let's see some examples in action. First, let's prepare a DocumentArray we will use. -## Searching Nearest-neighbour Documents +```python +from jina import Document, DocumentArray -Like {meth}`~docarray.array.mixins.match.MatchMixin.match`, the {meth}`~docarray.array.mixins.find.FindMixin.find` method also finds the nearest neighbors of a given collection of query documents. The `find` method works almost the same as `match` and accepts the same options as `match`. For instance, like in the case of `match()`, you can specify the `device` (CPU/GPU) and the `batch_size`. It also supports matching every types of embeddings supported by `match`. +da = DocumentArray([Document(text='journal', weight=25, tags={'h': 14, 'w': 21, 'uom': 'cm'}, modality='A'), + Document(text='notebook', weight=50, tags={'h': 8.5, 'w': 11, 'uom': 'in'}, modality='A'), + Document(text='paper', weight=100, tags={'h': 8.5, 'w': 11, 'uom': 'in'}, modality='D'), + Document(text='planner', weight=75, tags={'h': 22.85, 'w': 30, 'uom': 'cm'}, modality='D'), + Document(text='postcard', weight=45, tags={'h': 10, 'w': 15.25, 'uom': 'cm'}, modality='A')]) +da.summary() +``` -The **only** difference is that `match()` is invoked with query `DocumentArray`, and takes the index documents as input. On the other hand, `find()` is invoked with the index `DocumentArray`, and takes query documents as input. +```text + Documents Summary + + Length 5 + Homogenous Documents True + Common Attributes ('id', 'text', 'tags', 'weight', 'modality') + + Attributes Summary + + Attribute Data type #Unique values Has empty value + ────────────────────────────────────────────────────────── + id ('str',) 5 False + weight ('int',) 5 False + modality ('str',) 2 False + tags ('dict',) 5 False + text ('str',) 5 False +``` -That is, the following two invocations are equivalent. +## Filter with query operators -````{tab} .find -```{code-block} python ---- -emphasize-lines: 1, 2 ---- +A query filter document can use the query operators to specify conditions in the following form: -index_docs.find( - query_docs, - device='gpu', - batch_size=10, - limit=50, - metric_name='cosine', - exclude_self=True, - only_id=False, - **kwargs, -) -``` -```` - -````{tab} .match -```{code-block} python ---- -emphasize-lines: 1, 2 ---- - -query_docs.match( - index_docs, - device='gpu', - batch_size=10, - limit=50, - metric_name='cosine', - exclude_self=True, - only_id=False, - **kwargs, -) +```text +{ : { : }, ... } ``` -```` -## Filtering Documents based on Attributes +Here `field1` is {ref}`any field name` of a Document object. To access nested fields, one can use the dunder expression. For example, `tags__timestamp` is to access `doc.tags['timestamp']` field. -We can also use {meth}`~docarray.array.mixins.find.FindMixin.find` to filter documents based on attributes with the conditions specified in a `query` dictionary. +`operator1` can be one of the following: -The `query` dictionary defines the filtering conditions using the [MongoDB](https://docs.mongodb.com/manual/reference/operator/query/) query language. Let's take a look at how filtering by attributes can be done. +| Query Operator | Description | +|----------------|------------------------------------------------------------------------------------------------------------| +| `$eq` | Equal to (number, string) | +| `$ne` | Not equal to (number, string) | +| `$gt` | Greater than (number) | +| `$gte` | Greater than or equal to (number) | +| `$lt` | Less than (number) | +| `$lte` | Less than or equal to (number) | +| `$in` | Is in an array | +| `$nin` | Not in an array | +| `$regex` | Match the specified regular expression | +| `$size` | Match array/dict field that have the specified size. `$size` does not accept ranges of values. | +| `$exists` | Matches documents that have the specified field. And empty string content is also considered as not exists. | -### Filtering by Attributes -As a simple example, let's consider the case when we want to filter documents with `text` equals `'hello'`. This can be done by: +For example, to select all `modality='D'` Documents, ```python -docs.find({'text': {'$eq': 'hello'}}) -``` - -The above will return a `DocumentArray` in which each document has `doc.text == 'hello'`. We can compose multiple conditions using boolean logic operators. For instance, to filter by one or the other condition, we can: +r = da.find({'modality': {'$eq': 'D'}}) -```python -docs.find({'$or': [{'text': {'$eq': 'hello'}}, {'text': {'$eq': 'world'}}]}) +pprint(r.to_dict(exclude_none=True)) # just for pretty print ``` -The above returns a `DocumentArray` in which each `doc` in `docs` satisfies `doc.text == 'hello' or doc.text == 'world'`. - - -### Filtering by Tags +```text +[{'id': '92aee5d665d0c4dd34db10d83642aded', + 'modality': 'D', + 'tags': {'h': 8.5, 'uom': 'in', 'w': 11.0}, + 'text': 'paper', + 'weight': 100.0}, + {'id': '1a9d2139b02bc1c7842ecda94b347889', + 'modality': 'D', + 'tags': {'h': 22.85, 'uom': 'cm', 'w': 30.0}, + 'text': 'planner', + 'weight': 75.0}] +``` -To filter by data in the `tags` attribute, we can: +To select all Documents whose `.tags['h']>10`, ```python -docs.find({'tags__number': {'$gt': 3}}) +r = da.find({'tags__h': {'$gt': 10}}) ``` -The above will return a `DocumentArray` in which each document has `doc.tags['number'] > 3`. - - -### Filtering Using Tags as Placeholder +```text +[{'id': '4045a9659875fd1299e482d710753de3', + 'modality': 'A', + 'tags': {'h': 14.0, 'uom': 'cm', 'w': 21.0}, + 'text': 'journal', + 'weight': 25.0}, + {'id': 'cf7691c445220b94b88ff116911bad24', + 'modality': 'D', + 'tags': {'h': 22.85, 'uom': 'cm', 'w': 30.0}, + 'text': 'planner', + 'weight': 75.0}] +``` -We also use `tags` keys as placeholder by: +Beside using a predefined value, one can also use a substitution with `{field}`, notice the curly brackets there. For example, ```python -docs.find({'text': {'$eq': '{tags__name}'}}) +r = da.find({'tags__h': {'$gt': '{tags__w}'}}) ``` -The above will return a `DocumentArray` in which each document has `doc.text == doc.tags['name']`. +```text +[{'id': '44c6a4b18eaa005c6dbe15a28a32ebce', + 'modality': 'A', + 'tags': {'h': 14.0, 'uom': 'cm', 'w': 10.0}, + 'text': 'journal', + 'weight': 25.0}] +``` -### Supported Operators -Note, that only the following MongoDB's query operators are supported: +## Combine multiple conditions -| Query Operator | Description | -|----------------|------------------------------------------------------------------------------------------------------------| -| `$eq` | Equal to (number, string) | -| `$ne` | Not equal to (number, string) | -| `$gt` | Greater than (number) | -| `$gte` | Greater than or equal to (number) | -| `$lt` | Less than (number) | -| `$lte` | Less than or equal to (number) | -| `$in` | Is in an array | -| `$nin` | Not in an array | -| `$regex` | Match the specified regular expression | -| `$size` | Match array/dict field that have the specified size. `$size` does not accept ranges of values. | -| `$exists` | Matches documents that have the specified field. And empty string content is also cosidered as not exists. | - -For boolean logic operators, only the following are supported: +You can combine multiple conditions using the following operators | Boolean Operator | Description | |------------------|----------------------------------------------------| @@ -131,3 +132,29 @@ For boolean logic operators, only the following are supported: | `$or` | Join query clauses with a logical OR | | `$not` | Inverts the effect of a query expression | + + +```python +r = da.find({'$or': [{'weight': {'$eq': 45}}, {'modality': {'$eq': 'D'}}]}) +``` + +```text +[{'id': '22985b71b6d483c31cbe507ed4d02bd1', + 'modality': 'D', + 'tags': {'h': 8.5, 'uom': 'in', 'w': 11.0}, + 'text': 'paper', + 'weight': 100.0}, + {'id': 'a071faf19feac5809642e3afcd3a5878', + 'modality': 'D', + 'tags': {'h': 22.85, 'uom': 'cm', 'w': 30.0}, + 'text': 'planner', + 'weight': 75.0}, + {'id': '411ecc70a71a3f00fc3259bf08c239d1', + 'modality': 'A', + 'tags': {'h': 10.0, 'uom': 'cm', 'w': 15.25}, + 'text': 'postcard', + 'weight': 45.0}] +``` + + +## Query nearest neighbours From 02df98e6f3b013b4bb5969713688a0a09ca44aa7 Mon Sep 17 00:00:00 2001 From: Han Xiao Date: Tue, 15 Mar 2022 13:31:41 +0100 Subject: [PATCH 4/4] docs(find): improve find docs --- docs/fundamentals/documentarray/find.md | 9 ++--- docs/fundamentals/documentarray/index.md | 2 +- docs/fundamentals/documentarray/matching.md | 41 ++++++++++++++++++++- tests/unit/array/test_lookup.py | 2 +- 4 files changed, 45 insertions(+), 9 deletions(-) diff --git a/docs/fundamentals/documentarray/find.md b/docs/fundamentals/documentarray/find.md index f7d17ebec20..a64bf10a917 100644 --- a/docs/fundamentals/documentarray/find.md +++ b/docs/fundamentals/documentarray/find.md @@ -4,7 +4,7 @@ We can use {meth}`~docarray.array.mixins.find.FindMixin.find` to select Documents from a DocumentArray based the conditions specified in a `query` object. One can use `da.find(query)` to filter Documents and get nearest neighbours from `da`: - To filter Documents, the `query` object is a Python dictionary object that defines the filtering conditions using a [MongoDB](https://docs.mongodb.com/manual/reference/operator/query/)-like query language. -- To find nearest neighbours, the `query` object is a NdArray-like object that defines embedding. One can also use `.match()` function for this purpose, and there is a minor interface difference between these two functions. +- To find nearest neighbours, the `query` object needs to be a NdArray-like, a Document, or a DocumentArray object that defines embedding. One can also use `.match()` function for this purpose, and there is a minor interface difference between these two functions, which will be described {ref}`in the next chapter`. Let's see some examples in action. First, let's prepare a DocumentArray we will use. @@ -48,7 +48,9 @@ A query filter document can use the query operators to specify conditions in the Here `field1` is {ref}`any field name` of a Document object. To access nested fields, one can use the dunder expression. For example, `tags__timestamp` is to access `doc.tags['timestamp']` field. -`operator1` can be one of the following: +`value1` can be either a user given Python object, or a substitution field with curly bracket `{field}` + +Finally, `operator1` can be one of the following: | Query Operator | Description | |----------------|------------------------------------------------------------------------------------------------------------| @@ -155,6 +157,3 @@ r = da.find({'$or': [{'weight': {'$eq': 45}}, {'modality': {'$eq': 'D'}}]}) 'text': 'postcard', 'weight': 45.0}] ``` - - -## Query nearest neighbours diff --git a/docs/fundamentals/documentarray/index.md b/docs/fundamentals/documentarray/index.md index 9385f575e40..62f05a870da 100644 --- a/docs/fundamentals/documentarray/index.md +++ b/docs/fundamentals/documentarray/index.md @@ -36,8 +36,8 @@ serialization access-elements access-attributes embedding -matching find +matching evaluation parallelization visualization diff --git a/docs/fundamentals/documentarray/matching.md b/docs/fundamentals/documentarray/matching.md index 408ecf44d11..21ab3b1a3aa 100644 --- a/docs/fundamentals/documentarray/matching.md +++ b/docs/fundamentals/documentarray/matching.md @@ -3,10 +3,31 @@ ```{important} -{meth}`~docarray.array.mixins.match.MatchMixin.match` supports both CPU & GPU. +{meth}`~docarray.array.mixins.match.MatchMixin.match` and {meth}`~docarray.array.mixins.find.FindMixin.find` support both CPU & GPU. ``` -Once `.embeddings` is set, one can use {func}`~docarray.array.mixins.match.MatchMixin.match` function to find the nearest-neighbour Documents from another DocumentArray (or itself) based on their `.embeddings`. +Once `.embeddings` is set, one can use {meth}`~docarray.array.mixins.find.FindMixin.find` or {func}`~docarray.array.mixins.match.MatchMixin.match` function to find the nearest-neighbour Documents from another DocumentArray (or itself) based on their `.embeddings` and distance metrics. + + +## Difference between find and match + +Though both `.find()` and `.match()` is about finding nearest neighbours of a given "query" and both accpet similar arguments, there are some differences between them: + +##### Which side is the query at? +- `.find()` always requires the query on the right-hand side. Say you have a DocumentArray with one million Documents, to find one query's nearest neightbours you should write `one_million_docs.find(query)`; +- `.match()` assumes the query is on left-hand side. `A.match(B)` semantically means "A matches against B and save the results to A". So with `.match()` you should write `query.match(one_million_docs)`. + +##### What is type of the query? + - query (RHS) in `.find()` can be plain NdArray-like object or a single Document or a DocumentArray. + - query (lHS) in `.match()` can be either a Document or a DocumentArray. + +##### What is the return? + - `.find()` returns a List of DocumentArray, each of which corresponds to one element/row in the query. + - `.match()` do not return anything. Match results are stored inside right-hand side's `.matches`. + +In the example below, we will use `.match()` to describe the feature. But keep in mind, `.find()` should always work by simply switching the right and left-hand sides. + +## Example The following example finds for each element in `da1` the three closest Documents from the elements in `da2` according to Euclidean distance. @@ -104,6 +125,22 @@ match emb = (0, 0) 1.0 ```` +The above example when writing with `.find()`: + +```python +da2.find(da1, metric='euclidean', limit=3) +``` + +or simply: + +```python +da2.find(np.array( + [[0, 0, 0, 0, 1], + [1, 0, 0, 0, 0], + [1, 1, 1, 1, 0], + [1, 2, 2, 1, 0]]), metric='euclidean', limit=3) +``` + The following metrics are supported: | Metric | Frameworks | diff --git a/tests/unit/array/test_lookup.py b/tests/unit/array/test_lookup.py index b229a599eda..4b45d365954 100644 --- a/tests/unit/array/test_lookup.py +++ b/tests/unit/array/test_lookup.py @@ -11,7 +11,7 @@ def doc(): tags={ 'x': 0.1, 'y': 1.5, - 'z': 0, + 'z': 1, 'name': 'test', 'bar': '', 'labels': ['a', 'b', 'test'],