feat: support milvus#1666
Conversation
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
JohannesMessner
left a comment
There was a problem hiding this comment.
I think we are on a good path here! Just a few comments
docarray/index/backends/milvus.py
Outdated
|
|
||
| def _init_index(self) -> Collection: | ||
| if not utility.has_collection(self._db_config.collection_name): | ||
| print(self._db_config.collection_name) |
There was a problem hiding this comment.
Don't forget to remove the print
docarray/index/backends/milvus.py
Outdated
| name="doc_id" if column_name == "id" else column_name, | ||
| dtype=DataType.VARCHAR if column_name == "id" else info.db_type, | ||
| is_primary=False, |
There was a problem hiding this comment.
shouldn't id``be completely skipped here since it is alrady added to the schema above? Won't this add the id twice, once as idand once asdoc_id`?
There was a problem hiding this comment.
Thanks for the feedback! I'll only keep doc_id
docarray/index/backends/milvus.py
Outdated
| name=self._db_config.collection_name, | ||
| schema=CollectionSchema( | ||
| fields=fields, | ||
| description="Collection Schema", |
There was a problem hiding this comment.
Perhaps the description could also be passes as a config?
| f"Index '{self._db_config.index_name}' has been successfully created" | ||
| ) | ||
|
|
||
| def index(self, docs: Union[BaseDoc, Sequence[BaseDoc]], **kwargs): |
There was a problem hiding this comment.
What is the reason for implementing index() instead of implementing _index() and utilizing index() from the base class?
There was a problem hiding this comment.
In my new implementation, the index() function would also do the serialization (there's a column for serialized data), so the index function need to be utilized
docarray/index/backends/milvus.py
Outdated
| The database can only store float vectors, so this method is used to convert | ||
| TensorFlow or PyTorch tensors to a format compatible with the database. |
There was a problem hiding this comment.
The base class should already take care of torch, tf etc conversions. Are you sure you need to do this again?
There was a problem hiding this comment.
I'm not very sure how the base class handles conversions between different formats. The vector embedding is still stored in its raw format if we use the __getattr__ method.
There was a problem hiding this comment.
I just found base class' implementation. I'll remove my implementation here. Thanks for your suggestion!
docarray/index/backends/milvus.py
Outdated
| elif torch.is_tensor(column_value): | ||
| return column_value.float().numpy().tolist() | ||
| elif tf.is_tensor(column_value): |
There was a problem hiding this comment.
This is not safe, because not every user has torch and/or tf installed. Therefore we can't directly use torch/tf. Here you can see how that is handled in other places.
docarray/index/backends/milvus.py
Outdated
| ], | ||
| ) | ||
|
|
||
| return DocList[self._schema]([self._schema(**ret[i]) for i in range(len(ret))]) |
There was a problem hiding this comment.
No need to cast the output to DocList, the base class does it for you
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
docarray/index/__init__.py
Outdated
| elif name == 'WeaviateDocumentIndex': | ||
| import_library('weaviate', raise_error=True) | ||
| import docarray.index.backends.weaviate as lib | ||
| elif name == "MilvusDocumentIndex": |
docarray/index/backends/milvus.py
Outdated
| utility, | ||
| ) | ||
| else: | ||
| hnswlib = import_library('hnswlib', raise_error=True) |
docarray/index/backends/milvus.py
Outdated
| self._create_collection_name() | ||
| self._collection = self._init_index() | ||
| self._build_index() | ||
| self._logger.info(f"{self.__class__.__name__} has been initialized") |
docarray/index/backends/milvus.py
Outdated
|
|
||
| torch_available, tf_available = is_torch_available(), is_tf_available() | ||
|
|
||
| if torch_available: |
There was a problem hiding this comment.
That's because when we perform a vector search, it requires the input in Python list format. If the user's input tensor is in torch/tensorflow format, we need to perform a conversion.
JoanFM
left a comment
There was a problem hiding this comment.
Also make sure subindex is covered
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
|
Great work @maxwelljin, I'll take over from here! |
This PR aims to support Milvus as a storage backend