Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 70 additions & 5 deletions docs/user_guide/storing/docindex.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ class MyDoc(BaseDoc):
db = HnswDocumentIndex[MyDoc](work_dir='./my_test_db')
```

**Schema definition:**
### Schema definition

In this code snippet, `HnswDocumentIndex` takes a schema of the form of `MyDoc`.
The Document Index then _creates a column for each field in `MyDoc`_.
Expand All @@ -93,6 +93,69 @@ the database will store vectors with 128 dimensions.
Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that
for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually!


### Using a predefined Document as schema

DocArray offers a number of predefined Documents, like [ImageDoce][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DocArray offers a number of predefined Documents, like [ImageDoce][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].

If you try to use these directly as a schema for a Document Index, you will get unexpected behavior:
Depending on the backend, and exception will be raised, or no vector index for ANN lookup will be built.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Depending on the backend, and exception will be raised, or no vector index for ANN lookup will be built.
Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built.


The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding`
field. But this is crucial information for any vector database to work properly!

You can work around this problem by subclassing the predefined Document and adding the dimensionality information:

=== "Using type hint"
```python
from docarray.documents import TextDoc
from docarray.typing import NdArray
from docarray.index import HnswDocumentIndex


class MyDoc(TextDoc):
embedding: NdArray[128]


db = HnswDocumentIndex[MyDoc](work_dir='test_db')
```

=== "Using Field()"
```python
from docarray.documents import TextDoc
from docarray.typing import AnyTensor
from docarray.index import HnswDocumentIndex
from pydantic import Field


class MyDoc(TextDoc):
embedding: AnyTensor = Field(n_dim=128)


db = HnswDocumentIndex[MyDoc](work_dir='test_db3')
```

Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the
predefined Document type, or of your custom Document type.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
predefined Document type, or of your custom Document type.
predefined Document types, or your custom Document type.


The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`:

```python
from docarray import DocList

# data of type TextDoc
data = DocList[TextDoc](
[
TextDoc(text='hello world', embedding=np.random.rand(128)),
TextDoc(text='hello world', embedding=np.random.rand(128)),
TextDoc(text='hello world', embedding=np.random.rand(128)),
]
)

# you can index this into Document Index of type MyDoc
db.index(data)
```


**Database location:**

For `HnswDocumentIndex` you need to specify a `work_dir` where the data will be stored; for other backends you
Expand Down Expand Up @@ -136,6 +199,8 @@ need to have compatible schemas.
- A and B have the same field names and field types
- A and B have the same field names, and, for every field, the type of B is a subclass of the type of A

In particular this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In particular this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index.
In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the policy on capitalizing Document now that we don't use that class name? I think @samsja mentioned on Discord we don't do that any more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still capitalize it, since it is a concept in our library. Lowercased it looks a bit weird and "unofficial" to me. Plus, I think the rule of thumb was always that "concepts" are capitalized, whereas classes go in between backticks

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong feeling here. But tehcnically speaking Document is not a concept in term of code in the library

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a concept but just not a class, otherwise "concept" and "class" would be synonyms. But I just checked the pydantic documentation, they don't capitalize "model". So no strong feeling either


## Vector similarity search

Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method.
Expand Down Expand Up @@ -183,7 +248,7 @@ matching documents and their associated similarity scores.

How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations).

**Batched search:**
### Batched search

You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method.

Expand Down Expand Up @@ -319,7 +384,7 @@ The specific configurations that you can tweak depend on the backend, but the in

Document Indexes differentiate between three different kind of configurations:

**Database configurations**
### Database configurations

_Database configurations_ are configurations that pertain to the entire database or table (as opposed to just a specific column),
and that you _don't_ dynamically change at runtime.
Expand Down Expand Up @@ -371,7 +436,7 @@ You can customize every field in this configuration:
# > HnswDocumentIndex.DBConfig(work_dir='/tmp/my_db')
```

**Runtime configurations**
### Runtime configurations

_Runtime configurations_ are configurations that pertain to the entire database or table (as opposed to just a specific column),
and that you can dynamically change at runtime.
Expand Down Expand Up @@ -459,7 +524,7 @@ You can customize every field in this configuration using the [configure()][doca

After this change, the new setting will be applied to _every_ column that corresponds to a `np.ndarray` type.

**Column configurations**
### Column configurations

For many vector databases, individual columns can have different configurations.

Expand Down