Skip to content

feat: qdrant document index#1321

Merged
kacperlukawski merged 31 commits intodocarray:feat-rewrite-v2from
kacperlukawski:feat-rewrite-qdrant
Apr 14, 2023
Merged

feat: qdrant document index#1321
kacperlukawski merged 31 commits intodocarray:feat-rewrite-v2from
kacperlukawski:feat-rewrite-qdrant

Conversation

@kacperlukawski
Copy link
Copy Markdown
Contributor

@kacperlukawski kacperlukawski commented Mar 30, 2023

Goals:

This PR implements Qdrant as a document index.

  • init and build index
  • find
  • find batched
  • filter
  • filter batched
  • text search
  • text search batched
  • num docs
  • get items
  • delete items
  • query builder
  • execute query

@kacperlukawski kacperlukawski changed the base branch from main to feat-rewrite-v2 March 30, 2023 16:25
@samsja
Copy link
Copy Markdown
Member

samsja commented Mar 31, 2023

excited about this PR 🚀 🚀 🚀 🚀 😄

@kacperlukawski
Copy link
Copy Markdown
Contributor Author

@JohannesMessner I have some doubts regarding the batched methods. Things are obvious for _find_batched, as any two documents will always be similar. But should we return scores as a single np.array in the case of _text_search_batched or _filter_batched? Each query may have a different number of results. Wouldn't it be better to return a list of np.array instances instead? So each of them might be of a different shape. WDYT?

@JohannesMessner
Copy link
Copy Markdown
Member

@JohannesMessner I have some doubts regarding the batched methods. Things are obvious for _find_batched, as any two documents will always be similar. But should we return scores as a single np.array in the case of _text_search_batched or _filter_batched? Each query may have a different number of results. Wouldn't it be better to return a list of np.array instances instead? So each of them might be of a different shape. WDYT?

This is a good point, you are right. Feel free to do it this way and change the type hint as needed! If this change requires some cleanup/adjustment in one of the other backends we can take care of that afterwards

Copy link
Copy Markdown
Member

@JohannesMessner JohannesMessner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!
I would just like to see a few more tests:

Comment on lines +71 to +88
url: Optional[str] = None
port: Optional[int] = 6333
grpc_port: int = 6334
prefer_grpc: bool = True
https: Optional[bool] = None
api_key: Optional[str] = None
prefix: Optional[str] = None
timeout: Optional[float] = None
host: Optional[str] = None
collection_name: str = 'documents'
shard_number: Optional[int] = None
replication_factor: Optional[int] = None
write_consistency_factor: Optional[int] = None
on_disk_payload: Optional[bool] = None
hnsw_config: Optional[types.HnswConfigDiff] = None
optimizers_config: Optional[types.OptimizersConfigDiff] = None
wal_config: Optional[types.WalConfigDiff] = None
quantization_config: Optional[types.QuantizationConfig] = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of these changeable on the fly on a running instance? If so it would be nice to move them to the RuntimeConfig; otherwise all good!

Comment on lines +275 to +276
# Qdrant does not return any scores if we just filter the objects, without using
# semantic search over vectors. Thus, each document is scored with a value of 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will start with documentation this week, once that is on its way, let's sure we include stuff like this in the Qdrant section

'id': {},
'vector': {},
'payload': {},
np.ndarray: {},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is np.ndarray really needed here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reused the following test from hnswlib:

def test_schema_with_user_defined_mapping(tmp_path):
    class MyDoc(BaseDoc):
        tens: NdArray[10] = Field(dim=1000, col_type=np.ndarray)

    store = HnswDocumentIndex[MyDoc](work_dir=str(tmp_path))
    assert store._column_infos['tens'].db_type == np.ndarray

When the col_type is defined for a field, it's passed directly without calling python_type_to_db_type. I wondered how to overcome it but couldn't find a better way. How should I handle that?

Copy link
Copy Markdown
Member

@JohannesMessner JohannesMessner Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe passing ndarray as a column type only really makes sense for hnswlib, not for Qdrant. The user should pass a column type that is a valid type in Qdrant, i.e. one of the ones that you define in your mapping ('vector', 'payload', etc.). No special handling should be necessary.
Edit: just saw your comment below, if defining a column type makes no difference anyways, then this mechanism can be omitted altogether I would say

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, the np.array is not here anymore. The three types are left though, as we need to differentiate the behaviour for ids, vectors and metadata attributes (those are stored differently).

Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
@kacperlukawski
Copy link
Copy Markdown
Contributor Author

@JohannesMessner Thanks for the comments - I'll implement the changes soon! I just added the QueryBuilder, so functional-wise, we're done.

@kacperlukawski
Copy link
Copy Markdown
Contributor Author

kacperlukawski commented Apr 6, 2023

* Testing the configuration, especially the users changing a configuration for a specific column. For example, they should be able to change the column type in the DB, by passing a `col_type` to `Field()`. A possible scenario would be that someone wants a vector to not be indexed, so they want to store it as `payload` instead of `vector`. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_column_config.py

Qdrant is schemaless, so it doesn't really matter what col_type will be defined. We store the data as JSON, so that makes no difference. I'll review the ES7 approach, but I'm not sure if that also applies to Qdrant.

* Something with tensorflow (sorry if I missed it), this should work out of the box, but let's make sure that there arent' any issues. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_find.py#L140

No, it was not tested yet, but I'll extend the suite. I just didn't find Tensorflow in the dependencies. Shouldn't it be defined in pyproject.toml as well?

* Something that uses our built-in Documents, they have a lot of unions and optionals in their schema, so let's just double check that everything is good there. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_index_get_del.py#L270

Sure, I'll cover those scenarios as well. Thanks for the references!

Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
@JohannesMessner
Copy link
Copy Markdown
Member

JohannesMessner commented Apr 11, 2023

No, it was not tested yet, but I'll extend the suite. I just didn't find Tensorflow in the dependencies. Shouldn't it be defined in pyproject.toml as well?

There are some issues with tf and the protobuf version that it is pinned to, so we don't have it in our toml file, and no need to add it. you can just add the tests and mark them as tensorflow using the pytest marker as shown here, then our CI should install tf and run the test appropriately.

)
return [self._convert_to_doc(point) for point in response]

def execute_query(self, query: Query, *args, **kwargs) -> DocList:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the users should also be able to pass in a raw query that does not come from our query builder, i.e. a python dict or string like this (copied from your docs):

{
    "filter": {
        "must": [
            {
                "key": "city",
                "match": {
                    "value": "London"
                }
            }
        ]
    },
    "params": {
        "hnsw_ef": 128,
        "exact": false
    },
    "vector": [0.2, 0.1, 0.9, 0.7],
    "limit": 3
}

This is meant as a fallback option for any functionality that may not be covered by the DocArray API, we don't want to lock away Qdrant functionality from our users. Is that possible without too much hassle?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working on that one - I'd like to avoid sending raw HTTP requests, and we need to do some conversion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohannesMessner This is already implemented, so raw queries might be passed to .execute_query

kacperlukawski and others added 4 commits April 12, 2023 21:39
Co-authored-by: Johannes Messner <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
@kacperlukawski kacperlukawski marked this pull request as ready for review April 13, 2023 12:51
@kacperlukawski
Copy link
Copy Markdown
Contributor Author

Should I reformat the files with black and commit? Those don't seem related, as none of the problematic files was changed in that PR.

@samsja
Copy link
Copy Markdown
Member

samsja commented Apr 13, 2023

Those don't seem related

you can do black -s

@kacperlukawski
Copy link
Copy Markdown
Contributor Author

kacperlukawski commented Apr 14, 2023

@samsja @JohannesMessner Already done - I'd be grateful for a review ;)

@JohannesMessner
Copy link
Copy Markdown
Member

@samsja @JohannesMessner Already done - I'd be grateful for a review ;)

Awesome! I think we were in a good place already, but making another pass now

@samsja
Copy link
Copy Markdown
Member

samsja commented Apr 14, 2023

lgtm. I will let @JohannesMessner do the final approve as I was less involved on this PR

@samsja
Copy link
Copy Markdown
Member

samsja commented Apr 14, 2023

But good job ! 🚀

Copy link
Copy Markdown
Member

@JohannesMessner JohannesMessner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, thanks a ton for your contribution!! ❤️

@kacperlukawski kacperlukawski merged commit 2ea0acd into docarray:feat-rewrite-v2 Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants