feat: qdrant document index#1321
Conversation
Signed-off-by: Kacper Łukawski <[email protected]>
9ff1e5c to
54cd27b
Compare
Signed-off-by: Kacper Łukawski <[email protected]>
…ewrite-qdrant # Conflicts: # docarray/index/__init__.py
|
excited about this PR 🚀 🚀 🚀 🚀 😄 |
|
@JohannesMessner I have some doubts regarding the batched methods. Things are obvious for |
This is a good point, you are right. Feel free to do it this way and change the type hint as needed! If this change requires some cleanup/adjustment in one of the other backends we can take care of that afterwards |
Signed-off-by: Kacper Łukawski <[email protected]>
… versions Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
# Conflicts: # docarray/index/abstract.py # poetry.lock
JohannesMessner
left a comment
There was a problem hiding this comment.
Looking good!
I would just like to see a few more tests:
- Testing the configuration, especially the users changing a configuration for a specific column. For example, they should be able to change the column type in the DB, by passing a
col_typetoField(). A possible scenario would be that someone wants a vector to not be indexed, so they want to store it aspayloadinstead ofvector. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_column_config.py - Something with tensorflow (sorry if I missed it), this should work out of the box, but let's make sure that there arent' any issues. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_find.py#L140
- Something that uses our built-in Documents, they have a lot of unions and optionals in their schema, so let's just double check that everything is good there. E.g.: https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/doc_index/elastic/v7/test_index_get_del.py#L270
| url: Optional[str] = None | ||
| port: Optional[int] = 6333 | ||
| grpc_port: int = 6334 | ||
| prefer_grpc: bool = True | ||
| https: Optional[bool] = None | ||
| api_key: Optional[str] = None | ||
| prefix: Optional[str] = None | ||
| timeout: Optional[float] = None | ||
| host: Optional[str] = None | ||
| collection_name: str = 'documents' | ||
| shard_number: Optional[int] = None | ||
| replication_factor: Optional[int] = None | ||
| write_consistency_factor: Optional[int] = None | ||
| on_disk_payload: Optional[bool] = None | ||
| hnsw_config: Optional[types.HnswConfigDiff] = None | ||
| optimizers_config: Optional[types.OptimizersConfigDiff] = None | ||
| wal_config: Optional[types.WalConfigDiff] = None | ||
| quantization_config: Optional[types.QuantizationConfig] = None |
There was a problem hiding this comment.
Are any of these changeable on the fly on a running instance? If so it would be nice to move them to the RuntimeConfig; otherwise all good!
| # Qdrant does not return any scores if we just filter the objects, without using | ||
| # semantic search over vectors. Thus, each document is scored with a value of 1 |
There was a problem hiding this comment.
We will start with documentation this week, once that is on its way, let's sure we include stuff like this in the Qdrant section
docarray/index/backends/qdrant.py
Outdated
| 'id': {}, | ||
| 'vector': {}, | ||
| 'payload': {}, | ||
| np.ndarray: {}, |
There was a problem hiding this comment.
Is np.ndarray really needed here?
There was a problem hiding this comment.
I reused the following test from hnswlib:
def test_schema_with_user_defined_mapping(tmp_path):
class MyDoc(BaseDoc):
tens: NdArray[10] = Field(dim=1000, col_type=np.ndarray)
store = HnswDocumentIndex[MyDoc](work_dir=str(tmp_path))
assert store._column_infos['tens'].db_type == np.ndarray
When the col_type is defined for a field, it's passed directly without calling python_type_to_db_type. I wondered how to overcome it but couldn't find a better way. How should I handle that?
There was a problem hiding this comment.
I believe passing ndarray as a column type only really makes sense for hnswlib, not for Qdrant. The user should pass a column type that is a valid type in Qdrant, i.e. one of the ones that you define in your mapping ('vector', 'payload', etc.). No special handling should be necessary.
Edit: just saw your comment below, if defining a column type makes no difference anyways, then this mechanism can be omitted altogether I would say
There was a problem hiding this comment.
Done, the np.array is not here anymore. The three types are left though, as we need to differentiate the behaviour for ids, vectors and metadata attributes (those are stored differently).
Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
|
@JohannesMessner Thanks for the comments - I'll implement the changes soon! I just added the |
Signed-off-by: Kacper Łukawski <[email protected]>
Qdrant is schemaless, so it doesn't really matter what
No, it was not tested yet, but I'll extend the suite. I just didn't find Tensorflow in the dependencies. Shouldn't it be defined in
Sure, I'll cover those scenarios as well. Thanks for the references! |
Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
There are some issues with tf and the protobuf version that it is pinned to, so we don't have it in our toml file, and no need to add it. you can just add the tests and mark them as tensorflow using the pytest marker as shown here, then our CI should install tf and run the test appropriately. |
docarray/index/backends/qdrant.py
Outdated
| ) | ||
| return [self._convert_to_doc(point) for point in response] | ||
|
|
||
| def execute_query(self, query: Query, *args, **kwargs) -> DocList: |
There was a problem hiding this comment.
here the users should also be able to pass in a raw query that does not come from our query builder, i.e. a python dict or string like this (copied from your docs):
{
"filter": {
"must": [
{
"key": "city",
"match": {
"value": "London"
}
}
]
},
"params": {
"hnsw_ef": 128,
"exact": false
},
"vector": [0.2, 0.1, 0.9, 0.7],
"limit": 3
}This is meant as a fallback option for any functionality that may not be covered by the DocArray API, we don't want to lock away Qdrant functionality from our users. Is that possible without too much hassle?
There was a problem hiding this comment.
I'm still working on that one - I'd like to avoid sending raw HTTP requests, and we need to do some conversion.
There was a problem hiding this comment.
@JohannesMessner This is already implemented, so raw queries might be passed to .execute_query
Co-authored-by: Johannes Messner <[email protected]> Signed-off-by: Kacper Łukawski <[email protected]>
# Conflicts: # poetry.lock
Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
# Conflicts: # docarray/index/__init__.py # poetry.lock # pyproject.toml
Signed-off-by: Kacper Łukawski <[email protected]>
Signed-off-by: Kacper Łukawski <[email protected]>
|
Should I reformat the files with black and commit? Those don't seem related, as none of the problematic files was changed in that PR. |
you can do |
Signed-off-by: Kacper Łukawski <[email protected]>
# Conflicts: # poetry.lock
Signed-off-by: Kacper Łukawski <[email protected]>
|
@samsja @JohannesMessner Already done - I'd be grateful for a review ;) |
Awesome! I think we were in a good place already, but making another pass now |
|
lgtm. I will let @JohannesMessner do the final approve as I was less involved on this PR |
|
But good job ! 🚀 |
JohannesMessner
left a comment
There was a problem hiding this comment.
Nice job, thanks a ton for your contribution!! ❤️
Goals:
This PR implements Qdrant as a document index.