Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 110 additions & 8 deletions docs/advanced/document-store/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ benchmark
```

Documents inside a DocumentArray can live in a [document store](https://en.wikipedia.org/wiki/Document-oriented_database) instead of in memory, e.g. in SQLite, Redis.
Comparing to the in-memory storage, the benefit of using an external store is often about longer persistence and faster retrieval.
The benefit of using an external store over an in-memory store is often about longer persistence and faster retrieval.

The look-and-feel of a DocumentArray with external store is **almost the same** as a regular in-memory DocumentArray. This allows users to easily switch between backends under the same DocArray idiom.

Take SQLite as an example, using it as the store backend of a DocumentArray is as simple as follows:
Take SQLite as an example. Using it as the storage backend of a DocumentArray is as simple as follows:

```python
from docarray import DocumentArray, Document
Expand Down Expand Up @@ -58,19 +58,19 @@ da.summary()
│ │
╰────────────────────────────────────────────────────────────────────────────╯
```
Note that `da` was modified inside a `with` statement. This context manager ensures that the the `DocumentArray` indices,
Note that `da` was modified inside a `with` statement. This context manager ensures that the the `DocumentArray` indices,
which allow users to access the `DocumentArray` by position (allowing statements such as `da[1]`),
are properly mapped and saved to the storage backend.
This is the recommended default usage to modify a DocumentArray that lives on a document store to avoid
unexpected behaviors that can yield to, for example, inaccessible elements by position.


Creating, retrieving, updating, deleting Documents are identical to the regular {ref}`DocumentArray<documentarray>`. All DocumentArray methods such as `.summary()`, `.embed()`, `.plot_embeddings()` should work out of the box.
The procedures for creating, retrieving, updating, and deleting Documents are identical to those for a regular {ref}`DocumentArray<documentarray>`. All DocumentArray methods such as `.summary()`, `.embed()`, `.plot_embeddings()` should also work out of the box.


## Construct

There are two ways for initializing a DocumentArray with a store backend.
There are two ways for initializing a DocumentArray with an external storage backend.

````{tab} Specify storage

Expand Down Expand Up @@ -100,7 +100,7 @@ da = DocumentArray()

````

Depending on the context, you can choose the style that fits better. For example, if one wants to use class method such as `DocumentArray.empty(10)`, then explicit importing `DocumentArraySqlite` is the way to go. Of course, you can choose not to alias the imported class to make the code even more explicit.
Depending on the context, you can choose the style that fits better. For example, if you want to use a class method such as `DocumentArray.empty(10)`, then explicitly importing `DocumentArraySqlite` is the way to go. Of course, you can choose not to alias the imported class to make the code even more explicit.

```{admonition} Subindices
:class: seealso
Expand All @@ -116,7 +116,7 @@ To learn how to do that, see {ref}`here <subindex>`.

The config of a store backend is either store-specific dataclass object or a `dict` that can be parsed into the former.

One can pass the config in the constructor via `config`:
You can pass the config in the constructor via `config`:

````{tab} Use dataclass

Expand Down Expand Up @@ -346,6 +346,108 @@ array([[7., 7., 7.],
[4., 4., 4.]])
```

## Persistence, mutations and context manager

Having DocumentArrays that are backed by a document store introduces an extra consideration into the way you think about DocumentArrays.
The DocumentArray object created in your Python program is now a view of the underlying implementation in the external store.
This means that your DocumentArray object in Python can be out of sync with what is persisted to the external store.

**For example**
```python
from docarray import DocumentArray, Document

da1 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index"))
da1.append(Document())
print(f"Length of da1 is {len(da1)}")

da2 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index"))
print(f"Length of da2 is {len(da2)}")
```
**Output**
```console
Length of da1 is 1
Length of da2 is 0
```

Executing this script multiple times yields the same result.

When you run the line `da1.append(Document())`, you expect the DocumentArray with `index_name='my_index'` to now have a length of `1`.
However, when you try to create another view of the DocumentArray in `da2`, you get a fresh DocumentArray.

You also expect the script to increment the length of the DocumentArrays every time you run it.
This is because the previous run should have saved the length of the DocumentArray with `index_name="my_index"` and your most recent run will append a new document, incrementing the length by `+1` each time.

However, it seems like your append operation is also not being persisted.

````{dropdown} What actually happened here?
The DocumentArray actually did persist, but not in the way you might expect.
Since you did not use the `with` context manager or scope your mutation, the persistence logic is being evaluated when the program exits.
`da1` is destroyed first, persisting the DocumentArray of length `1`.
But when `da2` is destroyed, it persists a DocumentArray of length `0` to the same index in Redis as `da1`, overriding its value.

This means that if you had not created `da2`, the overriding would not have occured and the script would actually increment the length of the DocumentArray correctly.
You can prove this to yourself by commenting out the last 2 lines of the script and running the script repeatedly.

**Script**
```python
from docarray import DocumentArray, Document

da1 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index"))
da1.append(Document())
print(f"Length of da1 is {len(da1)}")

# da2 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index"))
# print(f"Length of da2 is {len(da2)}")
```

**First run output**
```console
Length of da1 is 1
```
**Second run output**
```console
Length of da1 is 2
```
**Third run output**
```console
Length of da1 is 3
```
````

Now that you know the issue, let's explore what you should do to work with DocumentArrays backed by document store in a more predictable manner.
### Using Context Manager
The recommended way is to use the DocumentArray as a context manager like so:

```python
from docarray import DocumentArray, Document

da1 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index"))
with da1: # Use the context manager to make sure you persist the mutation
da1.append(Document()) #
print(f"Length of da1 is {len(da1)}")

da2 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index"))
print(f"Length of da2 is {len(da2)}")
```
**First run output**
```console
Length of da1 is 1
Length of da2 is 1
```
**Second run output**
```console
Length of da1 is 2
Length of da2 is 2
```
**Third run output**
```console
Length of da1 is 3
Length of da2 is 3
```

The append you made to the DocumentArray is now persisted properly. Hurray!


## Known limitations


Expand Down Expand Up @@ -413,7 +515,7 @@ Take home message is, use the context manager and put your write operations into

### Out-of-array modification

One can not take a Document *out* from a DocumentArray and modify it, then expect its modification to be committed back to the DocumentArray.
You can not take a Document *out* from a DocumentArray and modify it, then expect its modification to be committed back to the DocumentArray.

Specifically, the pattern below is not supported by any external store backend:

Expand Down