diff --git a/docs/advanced/document-store/elasticsearch.md b/docs/advanced/document-store/elasticsearch.md index 67f069cd122..a6f83a22679 100644 --- a/docs/advanced/document-store/elasticsearch.md +++ b/docs/advanced/document-store/elasticsearch.md @@ -82,7 +82,8 @@ da = DocumentArray( config={'index_name': 'old_stuff', 'n_dim': 128}, ) -da.extend([Document() for _ in range(1000)]) +with da: + da.extend([Document() for _ in range(1000)]) da2 = DocumentArray( storage='elasticsearch', @@ -304,13 +305,14 @@ for those that have `pizza` in their text description. from docarray import DocumentArray, Document da = DocumentArray(storage='elasticsearch', config={'n_dim': 2, 'index_text': True}) -da.extend( - [ - Document(text='Person eating'), - Document(text='Person eating pizza'), - Document(text='Pizza restaurant'), - ] -) +with da: + da.extend( + [ + Document(text='Person eating'), + Document(text='Person eating pizza'), + Document(text='Pizza restaurant'), + ] + ) pizza_docs = da.find('pizza') pizza_docs[:, 'text'] @@ -336,28 +338,29 @@ from docarray import DocumentArray, Document da = DocumentArray( storage='elasticsearch', config={'n_dim': 32, 'tag_indices': ['food_type', 'price']} ) -da.extend( - [ - Document( - tags={ - 'food_type': 'Italian and Spanish food', - 'price': 'cheap but not that cheap', - }, - ), - Document( - tags={ - 'food_type': 'French and Italian food', - 'price': 'on the expensive side', - }, - ), - Document( - tags={ - 'food_type': 'chinese noddles', - 'price': 'quite cheap for what you get!', - }, - ), - ] -) +with da: + da.extend( + [ + Document( + tags={ + 'food_type': 'Italian and Spanish food', + 'price': 'cheap but not that cheap', + }, + ), + Document( + tags={ + 'food_type': 'French and Italian food', + 'price': 'on the expensive side', + }, + ), + Document( + tags={ + 'food_type': 'chinese noddles', + 'price': 'quite cheap for what you get!', + }, + ), + ] + ) results_cheap = da.find('cheap', index='price') print('searching "cheap" in :\n\t', results_cheap[:, 'tags__price']) diff --git a/docs/advanced/document-store/index.md b/docs/advanced/document-store/index.md index 3c5ed5e1dce..5db58353f22 100644 --- a/docs/advanced/document-store/index.md +++ b/docs/advanced/document-store/index.md @@ -15,11 +15,11 @@ benchmark ``` Documents inside a DocumentArray can live in a [document store](https://en.wikipedia.org/wiki/Document-oriented_database) instead of in memory, e.g. in SQLite, Redis. -Comparing to the in-memory storage, the benefit of using an external store is often about longer persistence and faster retrieval. +The benefit of using an external store over an in-memory store is often about longer persistence and faster retrieval. The look-and-feel of a DocumentArray with external store is **almost the same** as a regular in-memory DocumentArray. This allows users to easily switch between backends under the same DocArray idiom. -Take SQLite as an example, using it as the store backend of a DocumentArray is as simple as follows: +Take SQLite as an example. Using it as the storage backend of a DocumentArray is as simple as follows: ```python from docarray import DocumentArray, Document @@ -58,19 +58,19 @@ da.summary() │ │ ╰────────────────────────────────────────────────────────────────────────────╯ ``` -Note that `da` was modified inside a `with` statement. This context manager ensures that the the `DocumentArray` indices, +Note that `da` was modified inside a `with` statement. This context manager ensures that the the `DocumentArray` indices, which allow users to access the `DocumentArray` by position (allowing statements such as `da[1]`), are properly mapped and saved to the storage backend. This is the recommended default usage to modify a DocumentArray that lives on a document store to avoid unexpected behaviors that can yield to, for example, inaccessible elements by position. -Creating, retrieving, updating, deleting Documents are identical to the regular {ref}`DocumentArray`. All DocumentArray methods such as `.summary()`, `.embed()`, `.plot_embeddings()` should work out of the box. +The procedures for creating, retrieving, updating, and deleting Documents are identical to those for a regular {ref}`DocumentArray`. All DocumentArray methods such as `.summary()`, `.embed()`, `.plot_embeddings()` should also work out of the box. ## Construct -There are two ways for initializing a DocumentArray with a store backend. +There are two ways for initializing a DocumentArray with an external storage backend. ````{tab} Specify storage @@ -100,7 +100,7 @@ da = DocumentArray() ```` -Depending on the context, you can choose the style that fits better. For example, if one wants to use class method such as `DocumentArray.empty(10)`, then explicit importing `DocumentArraySqlite` is the way to go. Of course, you can choose not to alias the imported class to make the code even more explicit. +Depending on the context, you can choose the style that fits better. For example, if you want to use a class method such as `DocumentArray.empty(10)`, then explicitly importing `DocumentArraySqlite` is the way to go. Of course, you can choose not to alias the imported class to make the code even more explicit. ```{admonition} Subindices :class: seealso @@ -116,7 +116,7 @@ To learn how to do that, see {ref}`here `. The config of a store backend is either store-specific dataclass object or a `dict` that can be parsed into the former. -One can pass the config in the constructor via `config`: +You can pass the config in the constructor via `config`: ````{tab} Use dataclass @@ -346,6 +346,108 @@ array([[7., 7., 7.], [4., 4., 4.]]) ``` +## Persistence, mutations and context manager + +Having DocumentArrays that are backed by a document store introduces an extra consideration into the way you think about DocumentArrays. +The DocumentArray object created in your Python program is now a view of the underlying implementation in the external store. +This means that your DocumentArray object in Python can be out of sync with what is persisted to the external store. + +**For example** +```python +from docarray import DocumentArray, Document + +da1 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index")) +da1.append(Document()) +print(f"Length of da1 is {len(da1)}") + +da2 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index")) +print(f"Length of da2 is {len(da2)}") +``` +**Output** +```console +Length of da1 is 1 +Length of da2 is 0 +``` + +Executing this script multiple times yields the same result. + +When you run the line `da1.append(Document())`, you expect the DocumentArray with `index_name='my_index'` to now have a length of `1`. +However, when you try to create another view of the DocumentArray in `da2`, you get a fresh DocumentArray. + +You also expect the script to increment the length of the DocumentArrays every time you run it. +This is because the previous run should have saved the length of the DocumentArray with `index_name="my_index"` and your most recent run will append a new document, incrementing the length by `+1` each time. + +However, it seems like your append operation is also not being persisted. + +````{dropdown} What actually happened here? +The DocumentArray actually did persist, but not in the way you might expect. +Since you did not use the `with` context manager or scope your mutation, the persistence logic is being evaluated when the program exits. +`da1` is destroyed first, persisting the DocumentArray of length `1`. +But when `da2` is destroyed, it persists a DocumentArray of length `0` to the same index in Redis as `da1`, overriding its value. + +This means that if you had not created `da2`, the overriding would not have occured and the script would actually increment the length of the DocumentArray correctly. +You can prove this to yourself by commenting out the last 2 lines of the script and running the script repeatedly. + +**Script** +```python +from docarray import DocumentArray, Document + +da1 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index")) +da1.append(Document()) +print(f"Length of da1 is {len(da1)}") + +# da2 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index")) +# print(f"Length of da2 is {len(da2)}") +``` + +**First run output** +```console +Length of da1 is 1 +``` +**Second run output** +```console +Length of da1 is 2 +``` +**Third run output** +```console +Length of da1 is 3 +``` +```` + +Now that you know the issue, let's explore what you should do to work with DocumentArrays backed by document store in a more predictable manner. +### Using Context Manager +The recommended way is to use the DocumentArray as a context manager like so: + +```python +from docarray import DocumentArray, Document + +da1 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index")) +with da1: # Use the context manager to make sure you persist the mutation + da1.append(Document()) # +print(f"Length of da1 is {len(da1)}") + +da2 = DocumentArray(storage='redis', config=dict(n_dim=3, index_name="my_index")) +print(f"Length of da2 is {len(da2)}") +``` +**First run output** +```console +Length of da1 is 1 +Length of da2 is 1 +``` +**Second run output** +```console +Length of da1 is 2 +Length of da2 is 2 +``` +**Third run output** +```console +Length of da1 is 3 +Length of da2 is 3 +``` + +The append you made to the DocumentArray is now persisted properly. Hurray! + + ## Known limitations @@ -413,7 +515,7 @@ Take home message is, use the context manager and put your write operations into ### Out-of-array modification -One can not take a Document *out* from a DocumentArray and modify it, then expect its modification to be committed back to the DocumentArray. +You can not take a Document *out* from a DocumentArray and modify it, then expect its modification to be committed back to the DocumentArray. Specifically, the pattern below is not supported by any external store backend: diff --git a/docs/advanced/document-store/redis.md b/docs/advanced/document-store/redis.md index e17cf84fab6..2f131922fda 100644 --- a/docs/advanced/document-store/redis.md +++ b/docs/advanced/document-store/redis.md @@ -144,26 +144,27 @@ da = DocumentArray( }, ) -da.extend( - [ - Document( - id=f'{i}', - embedding=i * np.ones(n_dim), - tags={'price': i, 'color': 'blue', 'stock': i % 2 == 0}, - ) - for i in range(10) - ] -) -da.extend( - [ - Document( - id=f'{i+10}', - embedding=i * np.ones(n_dim), - tags={'price': i, 'color': 'red', 'stock': i % 2 == 0}, - ) - for i in range(10) - ] -) +with da: + da.extend( + [ + Document( + id=f'{i}', + embedding=i * np.ones(n_dim), + tags={'price': i, 'color': 'blue', 'stock': i % 2 == 0}, + ) + for i in range(10) + ] + ) + da.extend( + [ + Document( + id=f'{i+10}', + embedding=i * np.ones(n_dim), + tags={'price': i, 'color': 'red', 'stock': i % 2 == 0}, + ) + for i in range(10) + ] + ) print('\nIndexed price, color and stock:\n') for doc in da: @@ -301,7 +302,8 @@ da = DocumentArray( }, ) -da.extend([Document(id=f'{i}', embedding=i * np.ones(n_dim)) for i in range(10)]) +with da: + da.extend([Document(id=f'{i}', embedding=i * np.ones(n_dim)) for i in range(10)]) np_query = np.ones(n_dim) * 8 n_limit = 5 @@ -367,13 +369,14 @@ The following example builds a `DocumentArray` with several documents containing from docarray import Document, DocumentArray da = DocumentArray(storage='redis', config={'n_dim': 2, 'index_text': True}) -da.extend( - [ - Document(id='1', text='token1 token2 token3'), - Document(id='2', text='token1 token2'), - Document(id='3', text='token2 token3 token4'), - ] -) +with da: + da.extend( + [ + Document(id='1', text='token1 token2 token3'), + Document(id='2', text='token1 token2'), + Document(id='3', text='token2 token3 token4'), + ] + ) results = da.find('token1') print(results[:, 'text']) @@ -420,28 +423,29 @@ da = DocumentArray( storage='redis', config={'n_dim': 32, 'tag_indices': ['food_type', 'price']}, ) -da.extend( - [ - Document( - tags={ - 'food_type': 'Italian and Spanish food', - 'price': 'cheap but not that cheap', - }, - ), - Document( - tags={ - 'food_type': 'French and Italian food', - 'price': 'on the expensive side', - }, - ), - Document( - tags={ - 'food_type': 'chinese noddles', - 'price': 'quite cheap for what you get!', - }, - ), - ] -) +with da: + da.extend( + [ + Document( + tags={ + 'food_type': 'Italian and Spanish food', + 'price': 'cheap but not that cheap', + }, + ), + Document( + tags={ + 'food_type': 'French and Italian food', + 'price': 'on the expensive side', + }, + ), + Document( + tags={ + 'food_type': 'chinese noddles', + 'price': 'quite cheap for what you get!', + }, + ), + ] + ) results_cheap = da.find('cheap', index='price') print('searching "cheap" in :\n\t', results_cheap[:, 'tags__price']) diff --git a/docs/advanced/document-store/weaviate.md b/docs/advanced/document-store/weaviate.md index f94b341d685..87a75a12e54 100644 --- a/docs/advanced/document-store/weaviate.md +++ b/docs/advanced/document-store/weaviate.md @@ -127,13 +127,14 @@ Then, we can index some Documents: ```python from docarray import Document -da.extend( - [ - Document(text='Persist Documents with Weaviate.'), - Document(text='And enjoy fast nearest neighbor search.'), - Document(text='All while using DocArray API.'), - ] -) +with da: + da.extend( + [ + Document(text='Persist Documents with Weaviate.'), + Document(text='And enjoy fast nearest neighbor search.'), + Document(text='All while using DocArray API.'), + ] + ) ``` Now, we can generate embeddings inside the database using BERT model: @@ -426,13 +427,14 @@ da = DocumentArray( ) # load the dummy data -da.extend( - [ - Document(text='Persist Documents with Weaviate.'), - Document(text='And enjoy fast nearest neighbor search.'), - Document(text='All while using DocArray API.'), - ] -) +with da: + da.extend( + [ + Document(text='Persist Documents with Weaviate.'), + Document(text='And enjoy fast nearest neighbor search.'), + Document(text='All while using DocArray API.'), + ] + ) tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') model = AutoModel.from_pretrained('bert-base-uncased') @@ -493,13 +495,14 @@ da = DocumentArray( ) # load some dummy data -da.extend( - [ - Document(text='Persist Documents with Weaviate.'), - Document(text='And enjoy fast nearest neighbor search.'), - Document(text='All while using DocArray API.'), - ] -) +with da: + da.extend( + [ + Document(text='Persist Documents with Weaviate.'), + Document(text='And enjoy fast nearest neighbor search.'), + Document(text='All while using DocArray API.'), + ] + ) tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') model = AutoModel.from_pretrained('bert-base-uncased')