From 62da10db158499fd4df346e8abc33ada634d9183 Mon Sep 17 00:00:00 2001 From: Johannes Messner Date: Mon, 24 Apr 2023 11:47:56 +0200 Subject: [PATCH 1/3] docs: index predefined documents Signed-off-by: Johannes Messner --- docs/user_guide/storing/docindex.md | 75 +++++++++++++++++++++++++++-- 1 file changed, 70 insertions(+), 5 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index 91c29197e5f..5082a6ac1c6 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -77,7 +77,7 @@ class MyDoc(BaseDoc): db = HnswDocumentIndex[MyDoc](work_dir='./my_test_db') ``` -**Schema definition:** +### Schema definition In this code snippet, `HnswDocumentIndex` takes a schema of the form of `MyDoc`. The Document Index then _creates a column for each field in `MyDoc`_. @@ -93,6 +93,69 @@ the database will store vectors with 128 dimensions. Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! + +### Using a predefined Document as schema + +DocArray offers a number of predefined Documents, like [ImageDoce][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: +Depending on the backend, and exception will be raised, or no vector index for ANN lookup will be built. + +The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` +field. But this is crucial information for any vector database to work properly! + +You can work around this problem by subclassing the predefined Document and adding the dimensionality information: + +=== "Using type hint" + ```python + from docarray.documents import TextDoc + from docarray.typing import NdArray + from docarray.index import HnswDocumentIndex + + + class MyDoc(TextDoc): + embedding: NdArray[128] + + + db = HnswDocumentIndex[MyDoc](work_dir='test_db') + ``` + +=== "Using Field()" + ```python + from docarray.documents import TextDoc + from docarray.typing import AnyTensor + from docarray.index import HnswDocumentIndex + from pydantic import Field + + + class MyDoc(TextDoc): + embedding: AnyTensor = Field(n_dim=128) + + + db = HnswDocumentIndex[MyDoc](work_dir='test_db3') + ``` + +Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the +predefined Document type, or of your custom Document type. + +The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: + +```python +from docarray import DocList + +# data of type TextDoc +data = DocList[TextDoc]( + [ + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + TextDoc(text='hello world', embedding=np.random.rand(128)), + ] +) + +# you can index this into Document Index of type MyDoc +db.index(data) +``` + + **Database location:** For `HnswDocumentIndex` you need to specify a `work_dir` where the data will be stored; for other backends you @@ -136,6 +199,8 @@ need to have compatible schemas. - A and B have the same field names and field types - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A + In particular this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + ## Vector similarity search Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. @@ -183,7 +248,7 @@ matching documents and their associated similarity scores. How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations). -**Batched search:** +### Batched search You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. @@ -319,7 +384,7 @@ The specific configurations that you can tweak depend on the backend, but the in Document Indexes differentiate between three different kind of configurations: -**Database configurations** +### Database configurations _Database configurations_ are configurations that pertain to the entire database or table (as opposed to just a specific column), and that you _don't_ dynamically change at runtime. @@ -371,7 +436,7 @@ You can customize every field in this configuration: # > HnswDocumentIndex.DBConfig(work_dir='/tmp/my_db') ``` -**Runtime configurations** +### Runtime configurations _Runtime configurations_ are configurations that pertain to the entire database or table (as opposed to just a specific column), and that you can dynamically change at runtime. @@ -459,7 +524,7 @@ You can customize every field in this configuration using the [configure()][doca After this change, the new setting will be applied to _every_ column that corresponds to a `np.ndarray` type. -**Column configurations** +### Column configurations For many vector databases, individual columns can have different configurations. From 487b5ec1d055d93e809ee60a105ede28a44f0f06 Mon Sep 17 00:00:00 2001 From: Johannes Messner Date: Mon, 24 Apr 2023 12:15:49 +0200 Subject: [PATCH 2/3] docs: fix typos Signed-off-by: Johannes Messner --- docs/user_guide/storing/docindex.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index 5082a6ac1c6..f6ffab33a19 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -96,9 +96,9 @@ the database will store vectors with 128 dimensions. ### Using a predefined Document as schema -DocArray offers a number of predefined Documents, like [ImageDoce][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. +DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc]. If you try to use these directly as a schema for a Document Index, you will get unexpected behavior: -Depending on the backend, and exception will be raised, or no vector index for ANN lookup will be built. +Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built. The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding` field. But this is crucial information for any vector database to work properly! @@ -135,7 +135,7 @@ You can work around this problem by subclassing the predefined Document and addi ``` Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the -predefined Document type, or of your custom Document type. +predefined Document type, or your custom Document type. The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`: From af3902bd7c08038128a8810846f6851bc9cf8495 Mon Sep 17 00:00:00 2001 From: Johannes Messner Date: Mon, 24 Apr 2023 12:17:48 +0200 Subject: [PATCH 3/3] docs: add comma Signed-off-by: Johannes Messner --- docs/user_guide/storing/docindex.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index f6ffab33a19..2240b06cc86 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -199,7 +199,7 @@ need to have compatible schemas. - A and B have the same field names and field types - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A - In particular this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. + In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index. ## Vector similarity search