feat: qdrant document index by kacperlukawski · Pull Request #1321 · docarray/docarray

kacperlukawski · 2023-03-30T16:25:09Z

Goals:

This PR implements Qdrant as a document index.

Signed-off-by: Kacper Łukawski <[email protected]>

…ewrite-qdrant # Conflicts: # docarray/index/__init__.py

samsja · 2023-03-31T09:12:54Z

excited about this PR 🚀 🚀 🚀 🚀 😄

kacperlukawski · 2023-03-31T11:09:43Z

@JohannesMessner I have some doubts regarding the batched methods. Things are obvious for _find_batched, as any two documents will always be similar. But should we return scores as a single np.array in the case of _text_search_batched or _filter_batched? Each query may have a different number of results. Wouldn't it be better to return a list of np.array instances instead? So each of them might be of a different shape. WDYT?

JohannesMessner · 2023-03-31T11:14:09Z

@JohannesMessner I have some doubts regarding the batched methods. Things are obvious for _find_batched, as any two documents will always be similar. But should we return scores as a single np.array in the case of _text_search_batched or _filter_batched? Each query may have a different number of results. Wouldn't it be better to return a list of np.array instances instead? So each of them might be of a different shape. WDYT?

This is a good point, you are right. Feel free to do it this way and change the type hint as needed! If this change requires some cleanup/adjustment in one of the other backends we can take care of that afterwards

Signed-off-by: Kacper Łukawski <[email protected]>

… versions Signed-off-by: Kacper Łukawski <[email protected]>

Signed-off-by: Kacper Łukawski <[email protected]>

# Conflicts: # docarray/index/abstract.py # poetry.lock

JohannesMessner

Looking good!
I would just like to see a few more tests:

Testing the configuration, especially the users changing a configuration for a specific column. For example, they should be able to change the column type in the DB, by passing a col_type to Field(). A possible scenario would be that someone wants a vector to not be indexed, so they want to store it as payload instead of vector. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_column_config.py
Something with tensorflow (sorry if I missed it), this should work out of the box, but let's make sure that there arent' any issues. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_find.py#L140
Something that uses our built-in Documents, they have a lot of unions and optionals in their schema, so let's just double check that everything is good there. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_index_get_del.py#L270

JohannesMessner · 2023-04-05T06:53:05Z

docarray/index/backends/qdrant.py

+        url: Optional[str] = None
+        port: Optional[int] = 6333
+        grpc_port: int = 6334
+        prefer_grpc: bool = True
+        https: Optional[bool] = None
+        api_key: Optional[str] = None
+        prefix: Optional[str] = None
+        timeout: Optional[float] = None
+        host: Optional[str] = None
+        collection_name: str = 'documents'
+        shard_number: Optional[int] = None
+        replication_factor: Optional[int] = None
+        write_consistency_factor: Optional[int] = None
+        on_disk_payload: Optional[bool] = None
+        hnsw_config: Optional[types.HnswConfigDiff] = None
+        optimizers_config: Optional[types.OptimizersConfigDiff] = None
+        wal_config: Optional[types.WalConfigDiff] = None
+        quantization_config: Optional[types.QuantizationConfig] = None


Are any of these changeable on the fly on a running instance? If so it would be nice to move them to the RuntimeConfig; otherwise all good!

JohannesMessner · 2023-04-05T06:56:17Z

docarray/index/backends/qdrant.py

+        # Qdrant does not return any scores if we just filter the objects, without using
+        # semantic search over vectors. Thus, each document is scored with a value of 1


We will start with documentation this week, once that is on its way, let's sure we include stuff like this in the Qdrant section

JohannesMessner · 2023-04-05T07:09:22Z

docarray/index/backends/qdrant.py

+                'id': {},
+                'vector': {},
+                'payload': {},
+                np.ndarray: {},


Is np.ndarray really needed here?

I reused the following test from hnswlib:

def test_schema_with_user_defined_mapping(tmp_path): class MyDoc(BaseDoc): tens: NdArray[10] = Field(dim=1000, col_type=np.ndarray) store = HnswDocumentIndex[MyDoc](work_dir=str(tmp_path)) assert store._column_infos['tens'].db_type == np.ndarray

When the col_type is defined for a field, it's passed directly without calling python_type_to_db_type. I wondered how to overcome it but couldn't find a better way. How should I handle that?

I believe passing ndarray as a column type only really makes sense for hnswlib, not for Qdrant. The user should pass a column type that is a valid type in Qdrant, i.e. one of the ones that you define in your mapping ('vector', 'payload', etc.). No special handling should be necessary.
Edit: just saw your comment below, if defining a column type makes no difference anyways, then this mechanism can be omitted altogether I would say

Done, the np.array is not here anymore. The three types are left though, as we need to differentiate the behaviour for ids, vectors and metadata attributes (those are stored differently).

Signed-off-by: Kacper Łukawski <[email protected]>

kacperlukawski · 2023-04-06T09:38:31Z

@JohannesMessner Thanks for the comments - I'll implement the changes soon! I just added the QueryBuilder, so functional-wise, we're done.

Signed-off-by: Kacper Łukawski <[email protected]>

kacperlukawski · 2023-04-06T14:16:45Z

* Testing the configuration, especially the users changing a configuration for a specific column. For example, they should be able to change the column type in the DB, by passing a `col_type` to `Field()`. A possible scenario would be that someone wants a vector to not be indexed, so they want to store it as `payload` instead of `vector`. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_column_config.py

Qdrant is schemaless, so it doesn't really matter what col_type will be defined. We store the data as JSON, so that makes no difference. I'll review the ES7 approach, but I'm not sure if that also applies to Qdrant.

* Something with tensorflow (sorry if I missed it), this should work out of the box, but let's make sure that there arent' any issues. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_find.py#L140

No, it was not tested yet, but I'll extend the suite. I just didn't find Tensorflow in the dependencies. Shouldn't it be defined in pyproject.toml as well?

* Something that uses our built-in Documents, they have a lot of unions and optionals in their schema, so let's just double check that everything is good there. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_index_get_del.py#L270

Sure, I'll cover those scenarios as well. Thanks for the references!

Signed-off-by: Kacper Łukawski <[email protected]>

JohannesMessner · 2023-04-11T08:36:42Z

No, it was not tested yet, but I'll extend the suite. I just didn't find Tensorflow in the dependencies. Shouldn't it be defined in pyproject.toml as well?

There are some issues with tf and the protobuf version that it is pinned to, so we don't have it in our toml file, and no need to add it. you can just add the tests and mark them as tensorflow using the pytest marker as shown here, then our CI should install tf and run the test appropriately.

docarray/index/backends/qdrant.py

JohannesMessner · 2023-04-11T08:50:48Z

docarray/index/backends/qdrant.py

+        )
+        return [self._convert_to_doc(point) for point in response]
+
+    def execute_query(self, query: Query, *args, **kwargs) -> DocList:


here the users should also be able to pass in a raw query that does not come from our query builder, i.e. a python dict or string like this (copied from your docs):

{ "filter": { "must": [ { "key": "city", "match": { "value": "London" } } ] }, "params": { "hnsw_ef": 128, "exact": false }, "vector": [0.2, 0.1, 0.9, 0.7], "limit": 3 }

This is meant as a fallback option for any functionality that may not be covered by the DocArray API, we don't want to lock away Qdrant functionality from our users. Is that possible without too much hassle?

I'm still working on that one - I'd like to avoid sending raw HTTP requests, and we need to do some conversion.

@JohannesMessner This is already implemented, so raw queries might be passed to .execute_query

Co-authored-by: Johannes Messner <[email protected]> Signed-off-by: Kacper Łukawski <[email protected]>

# Conflicts: # poetry.lock

Signed-off-by: Kacper Łukawski <[email protected]>

…ewrite-qdrant

Signed-off-by: Kacper Łukawski <[email protected]>

# Conflicts: # docarray/index/__init__.py # poetry.lock # pyproject.toml

Signed-off-by: Kacper Łukawski <[email protected]>

kacperlukawski · 2023-04-13T14:29:25Z

Should I reformat the files with black and commit? Those don't seem related, as none of the problematic files was changed in that PR.

samsja · 2023-04-13T14:47:46Z

Those don't seem related

you can do black -s

Signed-off-by: Kacper Łukawski <[email protected]>

…ewrite-qdrant

# Conflicts: # poetry.lock

Signed-off-by: Kacper Łukawski <[email protected]>

kacperlukawski · 2023-04-14T07:38:02Z

@samsja @JohannesMessner Already done - I'd be grateful for a review ;)

JohannesMessner · 2023-04-14T08:08:07Z

@samsja @JohannesMessner Already done - I'd be grateful for a review ;)

Awesome! I think we were in a good place already, but making another pass now

samsja · 2023-04-14T08:18:07Z

lgtm. I will let @JohannesMessner do the final approve as I was less involved on this PR

samsja · 2023-04-14T08:18:18Z

But good job ! 🚀

JohannesMessner

Nice job, thanks a ton for your contribution!! ❤️

kacperlukawski changed the base branch from main to feat-rewrite-v2 March 30, 2023 16:25

kacperlukawski mentioned this pull request Mar 30, 2023

DocumentIndex: support for Qdrant #1211

Closed

Initial implementation of Qdrant document index

54cd27b

Signed-off-by: Kacper Łukawski <[email protected]>

kacperlukawski force-pushed the feat-rewrite-qdrant branch from 9ff1e5c to 54cd27b Compare March 31, 2023 08:32

kacperlukawski added 2 commits March 31, 2023 10:34

Initial implementation of Qdrant document index

ab421d7

Signed-off-by: Kacper Łukawski <[email protected]>

Merge remote-tracking branch 'origin/feat-rewrite-qdrant' into feat-r…

d594e67

…ewrite-qdrant # Conflicts: # docarray/index/__init__.py

kacperlukawski added 6 commits March 31, 2023 13:29

Update poetry.lock

4e5cc88

Signed-off-by: Kacper Łukawski <[email protected]>

Initial implementation of _filter and _text_search, also with batched…

712dccf

… versions Signed-off-by: Kacper Łukawski <[email protected]>

Return separate scores from batched text search requests

0a20122

Signed-off-by: Kacper Łukawski <[email protected]>

Return separate scores from batched find requests

2bd8c5e

Signed-off-by: Kacper Łukawski <[email protected]>

Add empty test_query_builder.py for Qdrant

7dfd201

Signed-off-by: Kacper Łukawski <[email protected]>

Merge branch 'feat-rewrite-v2' into feat-rewrite-qdrant

c34c086

# Conflicts: # docarray/index/abstract.py # poetry.lock

JohannesMessner requested changes Apr 5, 2023

View reviewed changes

kacperlukawski added 2 commits April 6, 2023 11:33

Upgrade Qdrant to 1.1.1

bb6bc01

Signed-off-by: Kacper Łukawski <[email protected]>

Implement QueryBuilder

0aaaf70

Signed-off-by: Kacper Łukawski <[email protected]>

kacperlukawski added 2 commits April 6, 2023 16:07

Add tensorflow tests

dce78c7

Signed-off-by: Kacper Łukawski <[email protected]>

Merge branch 'feat-rewrite-v2' into feat-rewrite-qdrant

dc0eb98

kacperlukawski added 2 commits April 6, 2023 18:25

Supported optional vectors

30399e5

Signed-off-by: Kacper Łukawski <[email protected]>

Fix mypy and formatting

20f8fc9

Signed-off-by: Kacper Łukawski <[email protected]>

JohannesMessner requested changes Apr 11, 2023

View reviewed changes

kacperlukawski and others added 4 commits April 12, 2023 21:39

Update docarray/index/backends/qdrant.py

1ec2123

Co-authored-by: Johannes Messner <[email protected]> Signed-off-by: Kacper Łukawski <[email protected]>

Merge branch 'feat-rewrite-v2' into feat-rewrite-qdrant

d8823af

# Conflicts: # poetry.lock

Remove the test with custom type (np.array)

5595f2e

Signed-off-by: Kacper Łukawski <[email protected]>

Update Qdrant to 1.1.4

346165e

Signed-off-by: Kacper Łukawski <[email protected]>

kacperlukawski and others added 7 commits April 13, 2023 10:48

Refactor tests

5591e2d

Signed-off-by: Kacper Łukawski <[email protected]>

Merge remote-tracking branch 'origin/feat-rewrite-qdrant' into feat-r…

b7f3be5

…ewrite-qdrant

WIP: Raw query execution

4038e57

Signed-off-by: Kacper Łukawski <[email protected]>

Merge branch 'feat-rewrite-v2' into feat-rewrite-qdrant

1a3f592

# Conflicts: # docarray/index/__init__.py # poetry.lock # pyproject.toml

Add raw Qdrant query support in .execute_query

ca1e3a8

Signed-off-by: Kacper Łukawski <[email protected]>

Switch to local mode in Qdrant tests

939503f

Signed-off-by: Kacper Łukawski <[email protected]>

Merge branch 'feat-rewrite-v2' into feat-rewrite-qdrant

7117110

kacperlukawski marked this pull request as ready for review April 13, 2023 12:51

kacperlukawski requested a review from JohannesMessner April 13, 2023 12:51

kacperlukawski added 4 commits April 14, 2023 09:17

Code formatting with black

9f11d34

Signed-off-by: Kacper Łukawski <[email protected]>

Merge remote-tracking branch 'origin/feat-rewrite-qdrant' into feat-r…

8c099dd

…ewrite-qdrant

Merge branch 'feat-rewrite-v2' into feat-rewrite-qdrant

2b7c364

# Conflicts: # poetry.lock

Update poetry.lock

a658b55

Signed-off-by: Kacper Łukawski <[email protected]>

Merge branch 'feat-rewrite-v2' into feat-rewrite-qdrant

d859fd2

JohannesMessner approved these changes Apr 14, 2023

View reviewed changes

kacperlukawski merged commit 2ea0acd into docarray:feat-rewrite-v2 Apr 14, 2023

		# Qdrant does not return any scores if we just filter the objects, without using
		# semantic search over vectors. Thus, each document is scored with a value of 1

Conversation

kacperlukawski commented Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samsja commented Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kacperlukawski commented Mar 31, 2023

Uh oh!

JohannesMessner commented Mar 31, 2023

Uh oh!

JohannesMessner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesMessner Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kacperlukawski commented Apr 6, 2023

Uh oh!

kacperlukawski commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesMessner commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kacperlukawski commented Apr 13, 2023

Uh oh!

samsja commented Apr 13, 2023

Uh oh!

kacperlukawski commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesMessner commented Apr 14, 2023

Uh oh!

samsja commented Apr 14, 2023

Uh oh!

samsja commented Apr 14, 2023

Uh oh!

JohannesMessner left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kacperlukawski commented Mar 30, 2023 •

edited

Loading

samsja commented Mar 31, 2023 •

edited

Loading

JohannesMessner Apr 11, 2023 •

edited

Loading

kacperlukawski commented Apr 6, 2023 •

edited

Loading

JohannesMessner commented Apr 11, 2023 •

edited

Loading

kacperlukawski commented Apr 14, 2023 •

edited

Loading