feat: support milvus by maxwelljin · Pull Request #1666 · docarray/docarray

maxwelljin · 2023-06-21T04:00:40Z

This PR aims to support Milvus as a storage backend

Signed-off-by: maxwelljin2 <[email protected]>

JohannesMessner

I think we are on a good path here! Just a few comments

JohannesMessner · 2023-06-26T15:45:44Z

docarray/index/backends/milvus.py

+
+    def _init_index(self) -> Collection:
+        if not utility.has_collection(self._db_config.collection_name):
+            print(self._db_config.collection_name)


Don't forget to remove the print

JohannesMessner · 2023-06-26T15:47:31Z

docarray/index/backends/milvus.py

+                        name="doc_id" if column_name == "id" else column_name,
+                        dtype=DataType.VARCHAR if column_name == "id" else info.db_type,
+                        is_primary=False,


shouldn't id``be completely skipped here since it is alrady added to the schema above? Won't this add the id twice, once as idand once asdoc_id`?

Thanks for the feedback! I'll only keep doc_id

JohannesMessner · 2023-06-26T15:48:26Z

docarray/index/backends/milvus.py

+                name=self._db_config.collection_name,
+                schema=CollectionSchema(
+                    fields=fields,
+                    description="Collection Schema",


Perhaps the description could also be passes as a config?

JohannesMessner · 2023-06-26T15:52:27Z

docarray/index/backends/milvus.py

+            f"Index '{self._db_config.index_name}' has been successfully created"
+        )
+
+    def index(self, docs: Union[BaseDoc, Sequence[BaseDoc]], **kwargs):


What is the reason for implementing index() instead of implementing _index() and utilizing index() from the base class?

In my new implementation, the index() function would also do the serialization (there's a column for serialized data), so the index function need to be utilized

JohannesMessner · 2023-06-26T15:53:39Z

docarray/index/backends/milvus.py

+        The database can only store float vectors, so this method is used to convert
+        TensorFlow or PyTorch tensors to a format compatible with the database.


The base class should already take care of torch, tf etc conversions. Are you sure you need to do this again?

I'm not very sure how the base class handles conversions between different formats. The vector embedding is still stored in its raw format if we use the __getattr__ method.

I just found base class' implementation. I'll remove my implementation here. Thanks for your suggestion!

JohannesMessner · 2023-06-26T15:56:06Z

docarray/index/backends/milvus.py

+        elif torch.is_tensor(column_value):
+            return column_value.float().numpy().tolist()
+        elif tf.is_tensor(column_value):


This is not safe, because not every user has torch and/or tf installed. Therefore we can't directly use torch/tf. Here you can see how that is handled in other places.

JohannesMessner · 2023-06-26T15:56:55Z

docarray/index/backends/milvus.py

+            ],
+        )
+
+        return DocList[self._schema]([self._schema(**ret[i]) for i in range(len(ret))])


No need to cast the output to DocList, the base class does it for you

Signed-off-by: maxwelljin2 <[email protected]>

JoanFM · 2023-06-28T07:56:30Z

docarray/index/__init__.py

    elif name == 'WeaviateDocumentIndex':
        import_library('weaviate', raise_error=True)
        import docarray.index.backends.weaviate as lib
+    elif name == "MilvusDocumentIndex":


single quotes please

JoanFM · 2023-06-28T07:56:47Z

docarray/index/backends/milvus.py

+        utility,
+    )
+else:
+    hnswlib = import_library('hnswlib', raise_error=True)


why do we need this library?

JoanFM · 2023-06-28T07:57:12Z

docarray/index/backends/milvus.py

+        self._create_collection_name()
+        self._collection = self._init_index()
+        self._build_index()
+        self._logger.info(f"{self.__class__.__name__} has been initialized")


use single quotes

JoanFM · 2023-06-28T07:57:39Z

docarray/index/backends/milvus.py

+
+torch_available, tf_available = is_torch_available(), is_tf_available()
+
+if torch_available:


why do we need to import these?

That's because when we perform a vector search, it requires the input in Python list format. If the user's input tensor is in torch/tensorflow format, we need to perform a conversion.

JoanFM

Also make sure subindex is covered

Signed-off-by: maxwelljin2 <[email protected]>

jupyterjazz · 2023-07-02T19:17:19Z

Great work @maxwelljin, I'll take over from here!

feat: init commit for milvus storage backend

ad4b7d9

Signed-off-by: maxwelljin2 <[email protected]>

maxwelljin linked an issue Jun 21, 2023 that may be closed by this pull request

Feature: please add milvus as storage backend to docarray v2 #1549

Closed

maxwelljin added area/core size/m labels Jun 21, 2023

maxwelljin added 2 commits June 26, 2023 18:32

feat: milvus index & get items

888cb73

Signed-off-by: maxwelljin2 <[email protected]>

feat: milvus index & get items

a4cfe03

Signed-off-by: maxwelljin2 <[email protected]>

JohannesMessner requested changes Jun 26, 2023

View reviewed changes

maxwelljin added 3 commits June 27, 2023 18:11

feat: add vector search

b47924c

Signed-off-by: maxwelljin2 <[email protected]>

fix: release resources after loading

c444626

Signed-off-by: maxwelljin2 <[email protected]>

feat: add filters

87a75d4

Signed-off-by: maxwelljin2 <[email protected]>

JoanFM requested changes Jun 28, 2023

View reviewed changes

maxwelljin added 18 commits June 29, 2023 14:42

feat: batch find for mivlus backend

79ae266

Signed-off-by: maxwelljin2 <[email protected]>

test: find/batch find

7376b90

Signed-off-by: maxwelljin2 <[email protected]>

chore: single quote

8dde5a1

Signed-off-by: maxwelljin2 <[email protected]>

chore: single quote

e7f28fd

Signed-off-by: maxwelljin2 <[email protected]>

test: filter

f3c64a3

Signed-off-by: maxwelljin2 <[email protected]>

refractor: remove to_vector function

030dc0d

Signed-off-by: maxwelljin2 <[email protected]>

chore: add comments

0695a3b

Signed-off-by: maxwelljin2 <[email protected]>

chore: update lock

f23b181

Signed-off-by: maxwelljin2 <[email protected]>

fix: num of docs not precise

1b2e08c

Signed-off-by: maxwelljin2 <[email protected]>

test: add flat document test

06af471

Signed-off-by: maxwelljin2 <[email protected]>

feat: support multiple vectors

9b96009

Signed-off-by: maxwelljin2 <[email protected]>

feat: support multi-dimensional vector for milvus

7ab088b

Signed-off-by: maxwelljin2 <[email protected]>

feat: support nested doc

87f35ac

Signed-off-by: maxwelljin2 <[email protected]>

test: test index get/del

fc72ab0

Signed-off-by: maxwelljin2 <[email protected]>

chore: update lock

6e1fe26

Signed-off-by: maxwelljin2 <[email protected]>

chore: add yaml

b261801

Signed-off-by: maxwelljin2 <[email protected]>

chore: update dependency

7dc72a9

Signed-off-by: maxwelljin2 <[email protected]>

ci: change protobuf version to accomodate milvus

79152a3

Signed-off-by: maxwelljin2 <[email protected]>

jupyterjazz marked this pull request as ready for review July 2, 2023 19:16

Merge branch 'main' into feat-milvus

c512c59

jupyterjazz changed the base branch from main to feat-support-milvus July 2, 2023 19:17

Merge branch 'feat-support-milvus' into feat-milvus

c6ade33

jupyterjazz merged commit 0e24ce9 into docarray:feat-support-milvus Jul 2, 2023

jupyterjazz mentioned this pull request Jul 2, 2023

feat: support milvus #1681

Merged

6 tasks

		The database can only store float vectors, so this method is used to convert
		TensorFlow or PyTorch tensors to a format compatible with the database.


		torch_available, tf_available = is_torch_available(), is_tf_available()

		if torch_available:

Conversation

maxwelljin commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesMessner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoanFM left a comment

Choose a reason for hiding this comment

Uh oh!

jupyterjazz commented Jul 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maxwelljin commented Jun 21, 2023 •

edited

Loading