diff --git a/docs/advanced/document-store/annlite.md b/docs/advanced/document-store/annlite.md index c764a157b93..8a827255927 100644 --- a/docs/advanced/document-store/annlite.md +++ b/docs/advanced/document-store/annlite.md @@ -1,7 +1,7 @@ (annlite)= # Annlite -One can use [Annlite](https://github.com/jina-ai/annlite) as the document store for DocumentArray. It is useful when one wants to have faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. +You can use [Annlite](https://github.com/jina-ai/annlite) as a document store for DocumentArray. It's suitable for faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. ````{tip} This feature requires `annlite`. You can install it via `pip install "docarray[annlite]".` @@ -10,7 +10,7 @@ This feature requires `annlite`. You can install it via `pip install "docarray[a ## Usage -One can instantiate a DocumentArray with Annlite storage like so: +You can instantiate a DocumentArray with Annlite storage like so: ```python from docarray import DocumentArray @@ -20,7 +20,7 @@ da = DocumentArray(storage='annlite', config={'n_dim': 10}) The usage would be the same as the ordinary DocumentArray. -To access a DocumentArray formerly persisted, one can specify the `data_path` in `config`. +To access a DocumentArray formerly persisted, you can specify the `data_path` in `config`. ```python from docarray import DocumentArray diff --git a/docs/advanced/document-store/elasticsearch.md b/docs/advanced/document-store/elasticsearch.md index cc713ca4b3a..998d196c997 100644 --- a/docs/advanced/document-store/elasticsearch.md +++ b/docs/advanced/document-store/elasticsearch.md @@ -2,7 +2,7 @@ # Elasticsearch -One can use [Elasticsearch](https://www.elastic.co) as the document store for DocumentArray. It is useful when one wants to have faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. +You can use [Elasticsearch](https://www.elastic.co) as a document store for DocumentArray. It's suitable for faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. ````{tip} This feature requires `elasticsearch`. You can install it via `pip install "docarray[elasticsearch]".` @@ -41,7 +41,7 @@ docker-compose up ### Create DocumentArray with Elasticsearch backend -Assuming service is started using the default configuration (i.e. server address is `http://localhost:9200`), one can instantiate a DocumentArray with Elasticsearch storage as such: +Assuming service is started using the default configuration (i.e. server address is `http://localhost:9200`), you can instantiate a DocumentArray with Elasticsearch storage as such: ```python from docarray import DocumentArray @@ -70,7 +70,7 @@ da = DocumentArray( Here is [the official Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#elasticsearch-security-certificates) for you to get certificate, password etc. -To access a DocumentArray formerly persisted, one can specify `index_name` and the hosts. +To access a DocumentArray formerly persisted, you can specify `index_name` and the hosts. The following example will build a DocumentArray with previously stored data from `old_stuff` on `http://localhost:9200`: @@ -160,7 +160,7 @@ You can read more about parallel bulk config and their default values [here](htt ### Vector search with filter query -One can perform Approximate Nearest Neighbor Search and pre-filter results using a filter query that follows [ElasticSearch's DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html). +You can perform Approximate Nearest Neighbor Search and pre-filter results using a filter query that follows [ElasticSearch's DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html). Consider Documents with embeddings `[0,0,0]` up to `[9,9,9]` where the document with embedding `[i,i,i]` has as tag `price` with value `i`. We can create such example with the following code: @@ -238,7 +238,7 @@ You can read more about approximate kNN tuning [here](https://www.elastic.co/gui ### Search by filter query -One can search with user-defined query filters using the `.find` method. Such queries can be constructed following the +You can search with user-defined query filters using the `.find` method. Such queries can be constructed following the guidelines in [ElasticSearch's Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html). Consider you store Documents with a certain tag `price` into ElasticSearch and you want to retrieve all Documents diff --git a/docs/advanced/document-store/extend.md b/docs/advanced/document-store/extend.md index 591d2ce8832..d7664d4ff62 100644 --- a/docs/advanced/document-store/extend.md +++ b/docs/advanced/document-store/extend.md @@ -25,7 +25,7 @@ Let's get started! ## Step 1: create the folder -Go to `docarray/array/storage` folder, create a sub-folder for your document store. Let's call it `mydocstore`. You will need to create four empty files in that folder: +Go to `docarray/array/storage` folder, create a sub-folder for your document store. Let's call it `mydocstore`. You need to create four empty files in that folder: ```{code-block} --- @@ -80,7 +80,7 @@ class GetSetDelMixin(BaseGetSetDelMixin): ... ``` -You will need to implement the above five functions, which correspond to the logics of get/set/delete items via a string `.id`. They are essential to ensure DocumentArray works. +You need to implement the above five functions, which correspond to the logics of get/set/delete items via a string `.id`. They are essential to ensure DocumentArray works. Note that DocumentArray maintains an `offset2ids` mapping to allow a list-like behaviour. This mapping is inherited from the `BaseGetSetDelMixin`. Therefore, you need to implement methods to persist this mapping, in case you @@ -111,9 +111,9 @@ upper level. Also, make sure that `_set_doc_by_id` performs an **upsert operatio ```{tip} Let's call the above five functions as **the essentials**. -If you aim for high performance, it is recommeneded to implement other methods *without* leveraging your essentials. They are: `_get_docs_by_ids`, `_del_docs_by_ids`, `_clear_storage`, `_set_doc_value_pairs`, `_set_doc_value_pairs_nested`, `_set_docs_by_ids`. One can get their full signatures from {class}`~docarray.array.storage.base.getsetdel.BaseGetSetDelMixin`. These functions define more fine-grained get/set/delete logics that are frequently used in DocumentArray. +If you aim for high performance, it is recommeneded to implement other methods *without* leveraging your essentials. They are: `_get_docs_by_ids`, `_del_docs_by_ids`, `_clear_storage`, `_set_doc_value_pairs`, `_set_doc_value_pairs_nested`, `_set_docs_by_ids`. You can get their full signatures from {class}`~docarray.array.storage.base.getsetdel.BaseGetSetDelMixin`. These functions define more fine-grained get/set/delete logics that are frequently used in DocumentArray. -Implementing them is fully optional, and you can only implement some of them not all of them. If you are not implementing them, those methods will use a generic-but-slow version that is based on your five essentials. +Implementing them is fully optional, and you can only implement some of them not all of them. If you are not implementing them, those methods use a generic-but-slow version based on your five essentials. ``` ```{seealso} @@ -149,7 +149,7 @@ class SequenceLikeMixin(BaseSequenceLikeMixin): ... def insert(self, index: int, value: 'Document'): - # Optional. By default, this will add a new item and update offset2id + # Optional. By default, this adds a new item and update offset2id # if you want to customize this, make sure to handle offset2id ... @@ -162,7 +162,7 @@ class SequenceLikeMixin(BaseSequenceLikeMixin): ... def __iter__(self) -> Iterator['Document']: - # Optional. By default, this will rely on offset2id to iterate + # Optional. By default, this relies on offset2id to iterate ... ``` @@ -244,7 +244,7 @@ By default, this should be set to `True`. Further, you have to store the value of this flag in `self._list_like`. Some methods that are handled outside of your control will take the value form there and use it appropriately. `_init_storage` is a very important function to be called during the DocumentArray construction. -You will need to handle different construction & copy behaviors in this function. +You need to handle different construction and copy behaviors in this function. `_ensure_unique_config` is needed to support DocArray's subindex feature. A subindex inherits its configuration from the root index, unless a field of the configuration is explicitly provided to the subindex. @@ -308,7 +308,7 @@ class StorageMixins(BackendMixin, GetSetDelMixin, SequenceLikeMixin, ABC): ... ``` -Just copy-paste it will do the work. +Just copying and pasting it should work. If you have implemented a `find.py` module, make sure to also inherit the `FindMixin`: ```python @@ -391,7 +391,7 @@ Done! Now you should be able to use it like `DocumentArrayMyDocStore`! ## On pull request: add tests and type-hint -Welcome to contribute your extension back to DocArray. You will need to include `DocumentArrayMyDocStore` in at least the following tests: +You are welcome to contribute your extension back to DocArray. You need to include `DocumentArrayMyDocStore` in at least the following tests: ```text tests/unit/array/test_advance_indexing.py diff --git a/docs/advanced/document-store/index.md b/docs/advanced/document-store/index.md index c3d49770e0a..4784eeca576 100644 --- a/docs/advanced/document-store/index.md +++ b/docs/advanced/document-store/index.md @@ -15,12 +15,11 @@ extend benchmark ``` -Documents inside a DocumentArray can live in a [document store](https://en.wikipedia.org/wiki/Document-oriented_database) instead of in memory, e.g. in SQLite, Redis. -The benefit of using an external store over an in-memory store is often about longer persistence and faster retrieval. +Documents inside a DocumentArray can live in a [document store](https://en.wikipedia.org/wiki/Document-oriented_database) instead of in memory (e.g. in SQLite or Redis). Compared to an in-memory store, document stores offer longer persistence and faster retrieval. -The look-and-feel of a DocumentArray with external store is **almost the same** as a regular in-memory DocumentArray. This allows users to easily switch between backends under the same DocArray idiom. +DocumentArrays with a document store look and feel **almost the same** as a regular in-memory DocumentArray. This lets you easily switch backends under the same DocArray idiom. -Take SQLite as an example. Using it as the storage backend of a DocumentArray is as simple as follows: +Let's take SQLite as an example. Using it as the storage backend of a DocumentArray is simple: ```python from docarray import DocumentArray, Document @@ -59,19 +58,18 @@ da.summary() │ │ ╰────────────────────────────────────────────────────────────────────────────╯ ``` -Note that `da` was modified inside a `with` statement. This context manager ensures that the the `DocumentArray` indices, -which allow users to access the `DocumentArray` by position (allowing statements such as `da[1]`), +Note that `da` was modified inside a `with` statement. This context manager ensures that `DocumentArray` indices, +which let you access the `DocumentArray` by position (allowing statements such as `da[1]`), are properly mapped and saved to the storage backend. -This is the recommended default usage to modify a DocumentArray that lives on a document store to avoid -unexpected behaviors that can yield to, for example, inaccessible elements by position. +This is the recommended way to modify a DocumentArray that lives in a document store to avoid +unexpected behaviors that can lead to, for example, inaccessible elements by position. - -The procedures for creating, retrieving, updating, and deleting Documents are identical to those for a regular {ref}`DocumentArray`. All DocumentArray methods such as `.summary()`, `.embed()`, `.plot_embeddings()` should also work out of the box. +The procedures for creating, retrieving, updating, and deleting Documents are just the same as for a regular {ref}`DocumentArray`. All DocumentArray methods like `.summary()`, `.embed()`, `.plot_embeddings()` also work out of the box. ## Construct -There are two ways to initialize a DocumentArray with an external storage backend. +You can initialize a DocumentArray with an external storage backend in one of two ways: ````{tab} Specify storage @@ -87,7 +85,7 @@ da = DocumentArray(storage='sqlite') ``` ```` -````{tab} Import the class and alias it +````{tab} Import and alias the class ```python from docarray.array.sqlite import DocumentArraySqlite as DocumentArray @@ -101,7 +99,7 @@ da = DocumentArray() ```` -Depending on the context, you can choose the style that fits better. For example, if you want to use a class method such as `DocumentArray.empty(10)`, then explicitly importing `DocumentArraySqlite` is the way to go. Of course, you can choose not to alias the imported class to make the code even more explicit. +Depending on the context, you can choose the style that fits best; If you want to use a class method like `DocumentArray.empty(10)`, then you should explicitly import `DocumentArraySqlite`; Alternatively, you can choose not to alias the imported class to make the code even more explicit. ```{admonition} Subindices :class: seealso @@ -113,11 +111,11 @@ To learn how to do that, see {ref}`here `. ``` -### Construct with config +### Construct with configuration -The config of a store backend is either store-specific dataclass object or a `dict` that can be parsed into the former. +The document store's configuration is either a store-specific dataclass object or a `dict` that can be parsed into that object. -You can pass the config in the constructor via `config`: +You can pass the configuration in the constructor via `config`: ````{tab} Use dataclass @@ -144,21 +142,31 @@ da = DocumentArray( ```` -Using dataclass gives you better type-checking in IDE but requires an extra import; using dict is more flexible but can be error-prone. You can choose the style that fits best to your context. +Dataclasses gives you better type-checking in IDE but require an extra import; dict is more flexible but can be error-prone. You can choose the style that best fits your context. ```{admonition} Creating DocumentArrays without specifying index :class: warning -When you specify an index (table name for SQL stores) in the config, the index will be used to persist the DocumentArray in the document store. -If you create a DocumentArray but do not specify an index, a randomized placeholder index will be created to persist the data. +When you specify an index (table name for SQL stores) in the configuration, the index will be used to persist the DocumentArray in the document store. +If you create a DocumentArray but do not specify an index, a random placeholder index will be created to persist the data. -Creating DocumentArrays without indexes is useful during prototyping but should not be used in a production setting as randomized placeholder data will be persisted in the document store unnecessarily. +Creating DocumentArrays without indexes is useful during prototyping but shouldn't be used in production, as random placeholder data will be persisted in the document store unnecessarily. ``` - ## Feature summary -DocArray supports multiple storage backends with different search features. The following table showcases relevant functionalities that are supported (✅) or not supported (❌) in DocArray depending on the backend: +Each document store supports different functionalities. The three key ones are: + +- **vector search**: perform approximate nearest neighbor search (or exact full scan search). The search function's input is a numpy array or a DocumentArray containing an embedding. +- **vector search + filter**: perform approximate nearest neighbor search (or exact full scan search). The search function's input is a numpy array or a DocumentArray containing an embedding and a filter. + +- **filter**: perform a filter step over the data. The search function's input is a filter. + +You can use **vector search** and **vector search + filter** via the DocumentArray's {meth}`~docarray.array.mixins.find.FindMixin.find` or {func}`~docarray.array.mixins.match.MatchMixin.match` methods. **Filter** functionality, on the other hand, is only available via the `.find()` method. + +A detailed explanation of the differences between `.find` and `.match` can be found [here](./../../../fundamentals/documentarray/matching) + +This table shows which of these functionalities each document store supports (✅) or doesn't support (❌): | Name | Construction | Vector search | Vector search + Filter | Filter | |---------------------------------------|------------------------------------------|---------------|------------------------|--------| @@ -171,28 +179,13 @@ DocArray supports multiple storage backends with different search features. The | [`Redis`](./redis.md) | `DocumentArray(storage='redis')` | ✅ | ✅ | ✅ | | [`Milvus`](./milvus.md) | `DocumentArray(storage='milvus')` | ✅ | ✅ | ✅ | -The right backend choice depends on the scale of your data, the required performance and the desired ease of setup. For most use cases we recommend starting with [`AnnLite`](./annlite.md). +The right backend choice for you depends on the scale of your data, the required performance and the desired ease of setup. For most use cases we recommend starting with [`AnnLite`](./annlite.md). [**Check our One Million Scale Benchmark for more details**](./benchmark#conclusion). - -Here we understand by - -- **vector search**: perform approximate nearest neighbour search (or exact full scan search). The input of the search function is a numpy array or a DocumentArray containing an embedding. - -- **vector search + filter**: perform approximate nearest neighbour search (or exact full scan search). The input of the search function is a numpy array or a DocumentArray containing an embedding and a filter. - -- **filter**: perform a filter step over the data. The input of the search function is a filter. - -The capabilities of **vector search**, **vector search + filter** can be used using the {meth}`~docarray.array.mixins.find.FindMixin.find` or {func}`~docarray.array.mixins.match.MatchMixin.match` methods through a `DocumentArray`. -The **filter** functionality is available using the `.find` method in a `DocumentArray`. -A detailed explanation of the differences between `.find` and `.match` can be found [here](./../../../fundamentals/documentarray/matching) - ### Vector search example -Example of **vector search** - -````{tab} .find +````{tab} .find() ```python from docarray import Document, DocumentArray @@ -212,7 +205,7 @@ result[:, 'embedding'] ``` ```` -````{tab} .match +````{tab} .match() ```python from docarray import Document, DocumentArray @@ -244,9 +237,7 @@ array([[2., 2., 2.], ### Vector search with filter example -Example of **vector search + filter** - -````{tab} .find +````{tab} .find() ```python from docarray import Document, DocumentArray @@ -278,7 +269,7 @@ results[:, 'embedding'] ``` ```` -````{tab} .match +````{tab} .match() ```python from docarray import Document, DocumentArray @@ -319,8 +310,6 @@ array([[2., 2., 2.], ### Filter example -Example of **filter** - ```python from docarray import Document, DocumentArray import numpy as np @@ -359,7 +348,7 @@ array([[7., 7., 7.], (backend-context-mngr)= ## Persistence, mutations and context manager -Having DocumentArrays that are backed by a document store introduces an extra consideration into the way you think about DocumentArrays. +Using DocumentArrays backed by a document store introduces an extra consideration into the way you think about DocumentArrays. The DocumentArray object created in your Python program is now a view of the underlying implementation in the document store. This means that your DocumentArray object in Python can be out of sync with what is persisted to the document store. @@ -385,18 +374,18 @@ Executing this script multiple times yields the same result. When you run the line `da1.append(Document())`, you expect the DocumentArray with `index_name='my_index'` to now have a length of `1`. However, when you try to create another view of the DocumentArray in `da2`, you get a fresh DocumentArray. -You also expect the script to increment the length of the DocumentArrays every time you run it. -This is because the previous run should have saved the length of the DocumentArray with `index_name="my_index"` and your most recent run will append a new document, incrementing the length by `+1` each time. +You would also expect the script to increment the length of the DocumentArrays every time you run it. +This is because the previous run _should_ have saved the length of the DocumentArray with `index_name="my_index"` and your most recent run appends a new Document, incrementing the length by `1` each time. However, it seems like your append operation is also not being persisted. ````{dropdown} What actually happened here? The DocumentArray actually did persist, but not in the way you might expect. -Since you did not use the `with` context manager or scope your mutation, the persistence logic is being evaluated when the program exits. +Since you didn't use the `with` context manager or scope your mutation, the persistence logic is being evaluated when the program exits. `da1` is destroyed first, persisting the DocumentArray of length `1`. But when `da2` is destroyed, it persists a DocumentArray of length `0` to the same index in Redis as `da1`, overriding its value. -This means that if you had not created `da2`, the overriding would not have occured and the script would actually increment the length of the DocumentArray correctly. +This means that if you had not created `da2`, the override wouldn't have occured and the script would actually increment the length of the DocumentArray correctly. You can prove this to yourself by commenting out the last 2 lines of the script and running the script repeatedly. **Script** @@ -425,7 +414,7 @@ Length of da1 is 3 ``` ```` -Now that you know the issue, let's explore what you should do to work with DocumentArrays backed by document store in a more predictable manner. +Now that you know the issue, let's explore how to work more predictably with DocumentArrays backed by a document store. ````{tab} Use with @@ -446,7 +435,7 @@ print(f"Length of da2 is {len(da2)}") ````{tab} Use sync -Explicitly calling the `sync` method of the DocumentArray will save the data to the document store. +Explicitly calling the `sync` method of the DocumentArray saves the data to the document store. ```python from docarray import DocumentArray, Document @@ -476,16 +465,16 @@ Length of da1 is 3 Length of da2 is 3 ``` -The append you made to the DocumentArray is now persisted properly. Hurray! +The `append()` you made to the DocumentArray is now persisted properly. Hooray! -The recommended way to sync data to the document store is to use the DocumentArray inside the `with` context manager. +We recommended syncing data to the document store by using the DocumentArray inside the `with` context manager. ## Known limitations -### Multiple references to the same storage backend +### Multiple references to the same document store -Let's see an example with ANNLite storage backend, other storage backends would also have the same problem. Let's create two DocumentArrays `da` and `db` that point the same storage backend: +Let's see an example with the AnnLite document store (other document stores would also have the same problem). Let's create two DocumentArrays `da` and `db` that point the same document store: ```python from docarray import DocumentArray, Document @@ -504,10 +493,10 @@ The output is: 0 ``` -Looks like `db` is not really up-to-date with `da`. This is true and false. True in the sense that `1` is not `0`, number speaks by itself. -False in the sense that, the Document is already written to the storage backend. You just can't see it. +It looks like `db` is not really up-to-date with `da`. This is both true and false. True because `1` is clearly not `0`. +False because the Document is already written to the storage backend -- you just can't see it. -To prove it does persist, run the following code snippet multiple times and you will see the length is increasing one at a time: +To prove it persists, run the following code snippet multiple times and you'll see the length is incrementing one at a time: ```python from docarray import DocumentArray, Document @@ -517,10 +506,10 @@ da.append(Document(text='hello')) print(len(da)) ``` -Simply put, the reason of this behavior is that certain meta information **not synced immediately** to the backend on *every* operation; it would be very costly to do so. -As a consequence, your multiple references to the same backend would look different if they are written in one code block as the example above. +Simply put, the reason of this behavior is that certain meta information is **not synced immediately** to the document store on *every* operation -- it would be very costly to do so. +As a consequence, your multiple references to the same document store would look different if they were written in one code block as the example above. -To solve this problem, simply use `with` statement and use DocumentArray as a context manager. The last example can be refactored into the following: +To solve this problem, simply use `with` statement and use DocumentArray as a context manager. The prior example can be refactored as follows: ```{code-block} python --- @@ -543,13 +532,13 @@ Now you get the correct output: 1 ``` -Take home message is, use the context manager and put your write operations into the `with` block, when you work with multiple references in a row. +In short, use the context manager and put your write operations into the `with` block when you work with multiple references in a row. ### Out-of-array modification -You can not take a Document *out* from a DocumentArray and modify it, then expect its modification to be committed back to the DocumentArray. +You can't take a Document *out* of a DocumentArray and modify it and then expect its modification to be committed back to the DocumentArray. -Specifically, the pattern below is not supported by any external store backend: +Specifically, no document store supports the pattern below: ```python from docarray import DocumentArray @@ -567,21 +556,21 @@ The solution is simple: use {ref}`column-selector`: da[0, 'text'] = 'hello' ``` -### Performance issue caused by list-like structure +### Performance issues caused by list-like structure + +DocArray allows list-like behavior by adding an offset-to-id mapping structure to document stores. This feature stores meta information about Document order along with the Documents themselves in the document store. -DocArray allows list-like behavior by adding an offset-to-id mapping structure to storage backends. Such feature (adding this structure) means the database stores, -along with documents, meta information about document order. -However, list_like behavior is not useful in indexers where concurrent usage is possible and users do not need information about document location. -Besides, updating list-like operation comes with a cost. -You can disable list-like behavior in the config as follows +However, list-like behavior has no use in indexers where you can use concurrent usage and don't care about a Document's location, and updating list-like operations can be costly. + +You can disable list-like behavior as follows: ```python from docarray import DocumentArray da = DocumentArray(storage='annlite', config={'n_dim': 2, 'list_like': False}) ``` -When list_like is disabled, all the list-like operations will not be allowed and raise errors. -like this: +When `list_like` is disabled, list-like operations are not allowed and raise errors: + ```python from docarray import DocumentArray, Document import numpy as np @@ -595,17 +584,17 @@ def docs(): da = DocumentArray(docs, storage='annlite', config={'n_dim': 2, 'list_like': False}) -da[0] # This will raise an error. +da[0] # This raises an error. ``` ```{admonition} Hint -By default, `list_like` will be true. +By default, `list_like` is true. ``` -### Elements access is slower +### Slower element access -Obviously, a DocumentArray with on-disk storage is slower than in-memory DocumentArray. However, if you choose to use on-disk storage, then often your concern of persistence overwhelms the concern of efficiency. +Obviously, a DocumentArray with on-disk storage is slower than an in-memory DocumentArray. However, if you choose on-disk storage, then often your concern of persistence overwhelms the concern of efficiency. -Slowness can affect all functions of DocumentArray. On the bright side, they may not be that severe as you would expect. Modern database are highly optimized. Moreover, some database provides faster method for resolving certain queries, e.g. nearest-neighbour queries. We are actively and continuously improving DocArray to better leverage those features. +Slowness can affect all functions of DocumentArray. On the bright side, they may not be as severe as you would expect -- modern databases are highly optimized. Moreover, some databases provide faster methods for resolving certain queries, e.g. nearest-neighbor queries. We are actively and continuously improving DocArray to better leverage those features. diff --git a/docs/advanced/document-store/qdrant.md b/docs/advanced/document-store/qdrant.md index 10993db1ab5..0daae8dfa46 100644 --- a/docs/advanced/document-store/qdrant.md +++ b/docs/advanced/document-store/qdrant.md @@ -1,18 +1,17 @@ (qdrant)= # Qdrant -One can use [Qdrant](https://qdrant.tech) as the document store for DocumentArray. It is useful when one wants to have faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. +You can use [Qdrant](https://qdrant.tech) as a document store for DocumentArray. It's suitable for faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. ````{tip} -This feature requires `qdrant-client`. You can install it via `pip install "docarray[qdrant]".` +This feature requires `qdrant-client`. You can install it with `pip install "docarray[qdrant]".` ```` ## Usage ### Start Qdrant service -To use Qdrant as the storage backend, you need a running Qdrant server. You can use the Qdrant Docker image to run a -server. Create `docker-compose.yml` as follows: +To use Qdrant as the storage backend, you need a running Qdrant server. You can create `docker-compose.yml` to use the Qdrant Docker image: ```yaml --- @@ -38,7 +37,7 @@ docker-compose up ### Create DocumentArray with Qdrant backend -Assuming service is started using the default configuration (i.e. server address is `http://localhost:6333`), one can +Assuming you start the service with the default configuration (i.e. server address is `http://localhost:6333`), you can instantiate a DocumentArray with Qdrant storage like so: ```python @@ -47,9 +46,9 @@ from docarray import DocumentArray da = DocumentArray(storage='qdrant', config={'n_dim': 10}) ``` -The usage would be the same as the ordinary DocumentArray. +The usage is the same as an ordinary DocumentArray. -To access a DocumentArray formerly persisted, one can specify the `collection_name`, the `host` and the `port`. +To access a formerly-persisted DocumentArray, you can specify the `collection_name`, `host` and `port`: ```python @@ -68,13 +67,11 @@ da = DocumentArray( da.summary() ``` -Note that specifying the `n_dim` is mandatory before using Qdrant as a backend for DocumentArray. +Note that you must specify `n_dim` before using Qdrant as a backend for DocumentArray. -Other functions behave the same as in-memory DocumentArray. +Other functions behave the same as an in-memory DocumentArray. -## Config - -The following configs can be set: +## Configuration | Name | Description | Default | |-----------------------|----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------| @@ -152,11 +149,11 @@ print(da.find(np.random.random(D), limit=10)) Search with `.find` can be restricted by user-defined filters. The supported tag types for filter are `'int'`, `'float'`, `'bool'`, `'str'`, `'text'` and `'geo'` as in [Qdrant](https://qdrant.tech/documentation/payload/). Such filters can be constructed following the guidelines in [Qdrant's Documentation](https://qdrant.tech/documentation/filtering/) -### Example of `.find` with a filter +### Example of `.find` with filter -Consider Documents with embeddings `[0,0,0]` up to ` [9,9,9]` where the document with embedding `[i,i,i]` -has as tag `price` with value `i`. We can create such example with the following code: +Let's create Documents with embeddings `[0,0,0]` up to `[9,9,9]`, where each Document (which has an embedding `[i,i,i]`) +has a tag `price` with value `i`: ```python from docarray import Document, DocumentArray @@ -185,9 +182,9 @@ for embedding, price in zip(da.embeddings, da[:, 'tags__price']): print(f'\tembedding={embedding},\t price={price}') ``` -Consider we want the nearest vectors to the embedding `[8. 8. 8.]`, with the restriction that prices must follow a filter. As an example, retrieved Documents must have `price` value lower than or equal to `max_price`. We can encode this information in Qdrant using `filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}`. You can also pass additional `search_params` following [Qdrant's Search API](https://qdrant.tech/documentation/search/#search-api). +We want the nearest vectors to the embedding `[8. 8. 8.]`, with the restriction that prices must follow a filter. For example, retrieved Documents must have `price` value lower than or equal to `max_price`. You can encode this information in Qdrant using `filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}`. You can also pass additional `search_params` following [Qdrant's Search API](https://qdrant.tech/documentation/search/#search-api). -Then you can implement and use the search with the proposed filter: +You can then implement and search with the proposed filter: ```python max_price = 7 @@ -204,7 +201,7 @@ for embedding, price in zip(results.embeddings, results[:, 'tags__price']): print(f'\tembedding={embedding},\t price={price}') ``` -This would print: +This prints: ``` Query vector: [8. 8. 8.] @@ -223,8 +220,9 @@ For Qdrant, the distance scores can be accessed in the Document's `.scores` dict ```` ### Example of `.filter` with a filter -The following example shows how to use DocArray with Qdrant Document Store in order to filter text documents. -Consider Documents have the tag `price` with a value of `i`. We can create these with the following code: + +The following example shows how to use DocArray with Qdrant document store to filter text documents. +Let's create Documents with the tag `price` with a value of `i`: ```python from docarray import Document, DocumentArray import numpy as np @@ -248,11 +246,13 @@ print('\nIndexed Prices:\n') for embedding, price in zip(da.embeddings, da[:, 'tags__price']): print(f'\tembedding={embedding},\t price={price}') ``` -For example, suppose we want to filter results such that -retrieved documents must have a `price` value less than or equal to `max_price`. We can encode -this information in Qdrant using `filter = {'price': {'$lte': max_price}}`. -Then you can implement and use the search with the proposed filter: +If you want to filter only for results +with a `price` less than or equal to `max_price`, you can encode +this information using `filter = {'price': {'$lte': max_price}}`. + +You can then implement and search with the proposed filter: + ```python max_price = 7 n_limit = 4 diff --git a/docs/advanced/document-store/redis.md b/docs/advanced/document-store/redis.md index 94d23413b01..5d5e4d20dbd 100644 --- a/docs/advanced/document-store/redis.md +++ b/docs/advanced/document-store/redis.md @@ -1,7 +1,7 @@ (redis)= # Redis -You can use [Redis](https://redis.io) as the document store for DocumentArray. It is useful when you want to have faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. +You can use [Redis](https://redis.io) as a document store for DocumentArray. It's suitable for faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. ````{tip} This feature requires `redis`. You can install it via `pip install "docarray[redis]".` @@ -232,7 +232,7 @@ for doc in results: ) ``` -This will print: +This prints: ```console Embeddings Approximate Nearest Neighbours with "price" at most 7, "color" blue and "stock" True: @@ -252,7 +252,7 @@ integer in `columns` configuration (`'field': 'int'`) and use a filter query tha ### Search by filter query -One can search with user-defined query filters using the `.find` method. Such queries follow the [Redis Search Query Syntax](https://redis.io/docs/stack/search/reference/query_syntax/). +You can search with user-defined query filters using the `.find` method. Such queries follow the [Redis Search Query Syntax](https://redis.io/docs/stack/search/reference/query_syntax/). Consider a case where you store Documents with a tag of `price` into Redis and you want to retrieve all Documents with `price` less than or equal to some `max_price` value. @@ -377,7 +377,7 @@ for doc in results: print(f" embedding={doc.embedding},\t score={doc.scores['score'].value}") ``` -This will print: +This prints: ```console Embeddings Approximate Nearest Neighbours: @@ -408,7 +408,7 @@ for doc in results: print(f" embedding={doc.embedding},\t score={doc.scores['score'].value}") ``` -This will print: +This prints: ```console Embeddings Approximate Nearest Neighbours: @@ -444,7 +444,7 @@ results = da.find('token1') print(results[:, 'text']) ``` -This will print: +This prints: ```console ['token1 token2 token3', 'token1 token2'] @@ -462,7 +462,7 @@ print('scorer=BM25:') print(results[:, 'text']) ``` -This will print: +This prints: ```console scorer=TFIDF.DOCNORM: @@ -516,7 +516,7 @@ results_italian = da.find('italian', index='food_type') print('searching "italian" in :\n\t', results_italian[:, 'tags__food_type']) ``` -This will print: +This prints: ```console searching "cheap" in : diff --git a/docs/advanced/document-store/sqlite.md b/docs/advanced/document-store/sqlite.md index 980af482527..23b1cc9947b 100644 --- a/docs/advanced/document-store/sqlite.md +++ b/docs/advanced/document-store/sqlite.md @@ -1,7 +1,7 @@ (sqlite)= # SQLite -One can use SQLite as the document store for DocumentArray. It is useful when you want to access a large number Document which can not fit into memory. +You can use SQLite as a document store for DocumentArray. It's suitable for accessing a large number of Documents which can't fit in memory. ## Usage @@ -15,7 +15,7 @@ da1 = DocumentArray( ) # with customize config ``` -To reconnect a formerly persisted database, one can need to specify *both* `connection` and `table_name` in `config`: +To reconnect a formerly persisted database, you can need to specify *both* `connection` and `table_name` in `config`: ```python from docarray import DocumentArray diff --git a/docs/advanced/document-store/weaviate.md b/docs/advanced/document-store/weaviate.md index faf9faef689..0b46b5e8907 100644 --- a/docs/advanced/document-store/weaviate.md +++ b/docs/advanced/document-store/weaviate.md @@ -1,13 +1,13 @@ (weaviate)= # Weaviate -One can use [Weaviate](https://weaviate.io) as the document store for DocumentArray. It is useful when one wants to have faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. +You can use [Weaviate](https://weaviate.io) as a document store for DocumentArray. It's suitable for faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. ````{tip} This feature requires `weaviate-client`. You can install it via `pip install "docarray[weaviate]".` ```` -Here is a video tutorial that guides you to build a simple image search using Weaviate and Docarray. +Here's a video tutorial on building a simple image search using Weaviate and DocArray:
@@ -17,7 +17,7 @@ Here is a video tutorial that guides you to build a simple image search using We ### Start Weaviate service -To use Weaviate as the storage backend, it is required to have the Weaviate service started. Create `docker-compose.yml` as follows: +To use Weaviate as the storage backend, you need to start the Weaviate service. Create `docker-compose.yml` as follows: ```yaml --- @@ -54,7 +54,7 @@ docker-compose up ### Create DocumentArray with Weaviate backend -Assuming service is started using the default configuration (i.e. server address is `http://localhost:8080`), one can instantiate a DocumentArray with Weaviate storage as such: +Assuming you've started the service with the default configuration (i.e. server address is `http://localhost:8080`), you can instantiate a DocumentArray with Weaviate storage: ```python from docarray import DocumentArray @@ -62,11 +62,11 @@ from docarray import DocumentArray da = DocumentArray(storage='weaviate') ``` -The usage would be the same as the ordinary DocumentArray. +You can use it just the same as an ordinary DocumentArray. -To access a DocumentArray formerly persisted, one can specify the name, the host, the port and the protocol to connect to the server. `name` is required in this case but other connection parameters are optional. If they are not provided, then it will connect to the Weaviate service bound to `http://localhost:8080`. +To access a formerly-persisted DocumentArray, you can specify the name, host, port and protocol to connect to the server. `name` is required in this case but other connection parameters are optional. If you don't provide them, it will connect to the Weaviate service bound to `http://localhost:8080`. -Note, that the `name` parameter in `config` needs to be capitalized. +Note that the `name` parameter in `config` needs to be capitalized. ```python from docarray import DocumentArray @@ -78,18 +78,16 @@ da = DocumentArray( da.summary() ``` -Other functions behave the same as in-memory DocumentArray. +Other functions behave the same as an in-memory DocumentArray. -## Config - -The following configs can be set: +## Configuration | Name | Description | Default | |----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------| | `host` | Hostname of the Weaviate server | 'localhost' | -| `port` | port of the Weaviate server | 8080 | -| `protocol` | protocol to be used. Can be 'http' or 'https' | 'http' | -| `name` | Weaviate class name; the class name of Weaviate object to presesent this DocumentArray | None | +| `port` | Port of the Weaviate server | 8080 | +| `protocol` | Protocol to use. Can be 'http' or 'https' | 'http' | +| `name` | Weaviate class name; the class name of Weaviate object to present this DocumentArray | None | | `serialize_config` | [Serialization config of each Document](../../../fundamentals/document/serialization.md) | None | | `distance` | The distance metric used to compute the distance between vectors. Must be either `cosine` or `l2-squared`. | `None`, defaults to the default value in Weaviate* | | `ef` | The size of the dynamic list for the nearest neighbors (used during the search). The higher ef is chosen, the more accurate, but also slower a search becomes. | `None`, defaults to the default value in Weaviate* | @@ -111,10 +109,10 @@ The following configs can be set: ## Minimum example -The following example shows how to use DocArray with Weaviate Document Store in order to index and search text +The following example shows how to use DocArray with Weaviate Document Store to index and search text Documents. -First, let's run the create the `DocumentArray` instance (make sure a Weaviate server is up and running): +First, let's create the `DocumentArray` instance (ensure a Weaviate server is up and running): ```python from docarray import DocumentArray @@ -139,7 +137,7 @@ with da: ) ``` -Now, we can generate embeddings inside the database using BERT model: +Now, we can generate embeddings inside the database using the BERT model: ```python from transformers import AutoModel, AutoTokenizer @@ -175,13 +173,13 @@ Persist Documents with Weaviate. ## Filtering -Search with `.find` can be restricted by user-defined filters. Such filters can be constructed following the guidelines -in [Weaviate's Documentation](https://weaviate.io/developers/weaviate/current/graphql-references/filters.html). +You can restrict search with `.find` using user-defined filters. You can construct these filters by following the guidelines +in [Weaviate's documentation](https://weaviate.io/developers/weaviate/current/graphql-references/filters.html). ### Example of `.find` with a filter only -Consider you store Documents with a certain tag `price` into weaviate and you want to retrieve all Documents -with `price` lower or equal to some `max_price` value. +Consider you store Documents with a certain tag `price` into Weaviate and want to retrieve all Documents +with `price` lower then or equal to a `max_price` value. You can index such Documents as follows: @@ -206,7 +204,7 @@ for price in da[:, 'tags__price']: print(f'\t price={price}') ``` -Then you can retrieve all documents whose price is lower than or equal to `max_price` by applying the following +Then you can retrieve all Documents whose price is lower than or equal to `max_price` by applying the following filter: ```python @@ -221,7 +219,7 @@ for price in results[:, 'tags__price']: print(f'\t price={price}') ``` -This would print +This prints: ``` Returned examples that satisfy condition "price at most 3": @@ -234,8 +232,8 @@ This would print ### Example of `.find` with query vector and filter -Consider Documents with embeddings `[0,0,0]` up to ` [9,9,9]` where the document with embedding `[i,i,i]` -has as tag `price` with value `i`. We can create such example with the following code: +Consider Documents with embeddings `[0,0,0]` up to ` [9,9,9]` where the Document with embedding `[i,i,i]` +has a tag `price` with value `i`. We can create such an example with the following code: ```python @@ -283,7 +281,7 @@ for embedding, price in zip(results.embeddings, results[:, 'tags__price']): print(f'\tembedding={embedding},\t price={price}') ``` -This would print: +This prints: ```bash Embeddings Nearest Neighbours with "price" at most 7: @@ -300,13 +298,13 @@ Embeddings Nearest Neighbours with "price" at most 7: `pip install --upgrade weaviate-client`*** You can sort results by any primitive property, typically a text, string, number, or int property. When a query has a -natural order (e.g. because of a near vector search), adding a sort operator will override the order. +natural order (e.g. because of a near vector search), adding a sort operator overrides the order. [Further documentation here.](https://weaviate.io/developers/weaviate/current/graphql-references/get.html#sorting) ### Example of `.find` with vector and sort -Consider Documents with the column 'price' and on the return you want to sort these documents by highest price to lowest +Consider Documents with the column 'price' and on the return you want to sort these Documents by highest price to lowest price. You can create an example with the following code: ```python @@ -353,7 +351,7 @@ for embedding, price in zip(results.embeddings, results[:, 'tags__price']): print(f'\tembedding={embedding},\t price={price}') ``` -This would print: +This prints: ```bash Returned examples that verify results are in order from highest price to lowest: @@ -370,7 +368,7 @@ Returned examples that verify results are in order from highest price to lowest: embedding=[0. 0. 0.], price=0 ``` -For ascending the results would be as expected: +In ascending order the results would be as expected: ```bash embedding=[0. 0. 0.], price=0 @@ -387,14 +385,14 @@ For ascending the results would be as expected: ## Set minimum certainty on query results -The DocArray/Weaviate find class uses the NearVector search argument since Weaviate is only being used in this combination to store -vectors generated by DocArray. Sometimes you want to set the certainty at a certain level to limit the return results. +The DocArray/Weaviate find class uses the NearVector search argument since Weaviate is only used in this combination to store +vectors generated by DocArray. Sometimes you want to set the certainty at a certain level to limit the returned results. You can do this with the `query_params` argument in the `find()` method. -`query_params` is a Dictionary element that combines itself with the request body. To set you must pass the value as a +`query_params` is a Dictionary element that combines itself with the request body. To set this you must pass the value as a Dict (`query_params={"key": "value}`) within the `find()` function -If you are familiar with Weaviates GraphQL structure then you can see where the `query_params` goes: +If you are familiar with Weaviate's GraphQL structure then you can see where the `query_params` goes: ```grapql { Get{ @@ -462,7 +460,7 @@ for res in results: print(f"\t scores={res[:, 'scores']}") ``` -This should return something similar to: +This returns something similar to: ```bash Only results that have a 'weaviate_certainty' of higher than 0.9 should show: @@ -477,19 +475,19 @@ For Weaviate, the distance scores can be accessed in the Document's `.scores` di ## Include additional properties in the return -GraphQL additional properties can be used on data objects in Get{} Queries to get additional information about the +GraphQL additional properties can be used on data objects in `Get{}` queries to get additional information about the returned data objects. Which additional properties are available depends on the modules that are attached to Weaviate. -The fields id, certainty, featureProjection and classification are available from Weaviate Core. On nested GraphQL -fields (references to other data classes), only the id can be returned. Explanation on specific additional properties +The fields `id`, `certainty`, `featureProjection` and `classification` are available from Weaviate Core. On nested GraphQL +fields (references to other data classes), only the `id` can be returned. An explanation on specific additional properties can be found on the module pages, see for example [text2vec-contextionary](https://weaviate.io/developers/weaviate/current/modules/text2vec-contextionary.html#additional-graphql-api-properties). [Further documentation here](https://weaviate.io/developers/weaviate/current/graphql-references/additional-properties.html) -In order to include additional properties on the request you can use the `additional` parameter of the `find()` function. +To include additional properties on the request you can use the `additional` parameter of the `find()` function. These will be included as Tags on the response. -Assume you want to know when the document was inserted and last updated in the DB. +Assume you want to know when the Document was inserted and last updated in the database. You can run the following: ```python @@ -538,7 +536,7 @@ for res in results: print(f"\t lastUpdateTimeUnix={res[:, 'tags__lastUpdateTimeUnix']}") ``` -This should return: +This returns: ```bash See when the Document was created and updated: diff --git a/docs/advanced/graphql-support/index.md b/docs/advanced/graphql-support/index.md index 642fb1fa8e6..45a1ce5f888 100644 --- a/docs/advanced/graphql-support/index.md +++ b/docs/advanced/graphql-support/index.md @@ -83,13 +83,13 @@ Finally, save all code snippets above into `toy.py` and run it from the console strawberry server toy ``` -You will see +This output: ```text Running strawberry on http://0.0.0.0:8000/graphql 🍓 ``` -Now open `http://0.0.0.0:8000/graphql` in your browser. You should be able to see a GraphiQL playground at this url. +Now open `http://0.0.0.0:8000/graphql` in your browser. You should be able to see a GraphQL playground at this URL. Try the following query ```gql diff --git a/docs/datatypes/image/index.md b/docs/datatypes/image/index.md index 2284b51c3c5..41120183851 100644 --- a/docs/datatypes/image/index.md +++ b/docs/datatypes/image/index.md @@ -143,7 +143,7 @@ print(d.tensor.shape) (180, 64, 64, 3) ``` -As one can see, it converts the single image tensor into 180 image tensors, each with the size of (64, 64, 3). You can also add all 180 image tensors into the chunks of this `Document`, simply do: +As you can see, it converts the single image tensor into 180 image tensors, each with the size of (64, 64, 3). You can also add all 180 image tensors into the chunks of this `Document`, simply do: ```python d.convert_image_tensor_to_sliding_windows(window_shape=(64, 64), as_chunks=True) diff --git a/docs/datatypes/index.md b/docs/datatypes/index.md index 6d0688793ca..8197fc31ca0 100644 --- a/docs/datatypes/index.md +++ b/docs/datatypes/index.md @@ -1,6 +1,6 @@ # Multimodal Data -Whether you’re working with text, image, video, audio, 3D meshes or the nested or the combined of them, you can always represent them as Documents and process them as DocumentArray. Here are some motivate examples: +DocArray lets you represent text, image, video, audio, and 3D meshes as Documents, whether separate, nested or combined, and process them as a DocumentArray. Here are some motivating examples: ```{toctree} @@ -11,4 +11,4 @@ audio/index mesh/index tabular/index multimodal/index -``` \ No newline at end of file +``` diff --git a/docs/datatypes/multimodal/index.md b/docs/datatypes/multimodal/index.md index 2865c3942d8..75dc019268e 100644 --- a/docs/datatypes/multimodal/index.md +++ b/docs/datatypes/multimodal/index.md @@ -1,8 +1,8 @@ (multimodal-example)= # {octicon}`stack` Multi-modal -This example will walk you through how to use DocArray to process multiple data modalities, jointly. -To do this comfortably and cleanly, you will use DocArray's {ref}`dataclass ` feature. +This example walks you through how to use DocArray to process multiple data modalities in tandem. +To do this comfortably and cleanly, you can use DocArray's {ref}`dataclass ` feature. ```{seealso} This example works with image and text data. @@ -648,4 +648,4 @@ OVERALL CLOSEST PAGE: ╰─────────────┴───────────────────────────────────────────────────────── ``` -```` \ No newline at end of file +```` diff --git a/docs/datatypes/tabular/index.md b/docs/datatypes/tabular/index.md index b97598426f1..714fc530789 100644 --- a/docs/datatypes/tabular/index.md +++ b/docs/datatypes/tabular/index.md @@ -1,11 +1,11 @@ (table-type)= # {octicon}`table` Table -One can freely convert between DocumentArray and `pandas.Dataframe`, read more details in {ref}`docarray-serialization`. Besides, one can load and write CSV file with DocumentArray. +You can freely convert between DocumentArray and `pandas.Dataframe`, read more details in {ref}`docarray-serialization`. Besides, you can load and write CSV file with DocumentArray. ## Load CSV table -One can easily load tabular data from `csv` file into a DocumentArray. For example, +You can easily load tabular data from `csv` file into a DocumentArray. For example, ```text Username;Identifier;First name;Last name @@ -37,10 +37,10 @@ da = DocumentArray.from_csv('toy.csv') tags ('dict',) 5 False ``` -One can observe that each row is loaded as a Document and the columns are loaded into `Document.tags`. +You can observe that each row is loaded as a Document and the columns are loaded into `Document.tags`. -In general, `from_csv` will try its best to resolve the column names of the table and map them into the corresponding Document attributes. If such attempt fails, one can always resolve the field manually via: +In general, `from_csv` will try its best to resolve the column names of the table and map them into the corresponding Document attributes. If such attempt fails, you can always resolve the field manually via: ```python from docarray import DocumentArray diff --git a/docs/datatypes/text/index.md b/docs/datatypes/text/index.md index 378b213ef19..e6a510911ea 100644 --- a/docs/datatypes/text/index.md +++ b/docs/datatypes/text/index.md @@ -1,14 +1,14 @@ (text-type)= # {octicon}`typography` Text -Representing text in DocArray is easy. Simply do: +Representing text in DocArray is as easy as: ```python from docarray import Document Document(text='hello, world.') ``` -If your text data is big and can not be written inline, or it comes from a URI, then you can also define `uri` first and load the text into Document later. +If your text data is larger and can't be written inline, or comes from a URI, then you can also define `uri` first and load the text into a Document later: ```python from docarray import Document @@ -23,7 +23,7 @@ d.summary() ``` -And of course, you can have characters from different languages. +And of course, you can use characters from different languages: ```python from docarray import Document @@ -32,9 +32,9 @@ d = Document(text='👋 नमस्ते दुनिया! 你好世界! ``` -## Segment long documents +## Segment long Documents -Often times when you index/search textual document, you don't want to consider thousands of words as one document, some finer granularity would be nice. You can do these by leveraging `chunks` of Document. For example, let's segment this simple document by `!` mark: +Often times when you index/search textual Documents, you don't want to consider thousands of words as one huge Document -- some finer granularity would be nice. You can do this by leveraging Document `chunks`. For example, let's split this simple Document at each `!` mark: ```python from docarray import Document @@ -56,11 +56,11 @@ d.summary() └─ ``` -Which creates five sub-documents under the original documents and stores them under `.chunks`. +This creates five sub-Documents under the original Document and stores them under the original Document's `.chunks`. -## Convert text into `ndarray` +## Convert text to `ndarray` -Sometimes you may need to encode the text into a `numpy.ndarray` before further computation. We provide some helper functions in Document and DocumentArray that allow you to convert easily. +Sometimes you need to encode the text into a `numpy.ndarray` before further computation. We provide some helper functions in Document and DocumentArray that allow you to do that easily. For example, we have a DocumentArray with three Documents: ```python @@ -85,9 +85,9 @@ vocab = da.get_vocabulary() {'hello': 2, 'world': 3, 'goodbye': 4} ``` -The vocabulary is 2-indexed as `0` is reserved for padding symbol and `1` is reserved for unknown symbol. +The vocabulary is 2-indexed as `0` is reserved for the padding symbol and `1` for the unknown symbol. -One can further use this vocabulary to convert `.text` field into `.tensor` via: +You can further use this vocabulary to convert `.text` field into `.tensor`: ```python for d in da: @@ -101,7 +101,7 @@ for d in da: [2 4] ``` -When you have text in different length and you want the output `.tensor` to have the same length, you can define `max_length` during converting: +When you have text of different lengths and want output `.tensor`s to have the same length, you can define `max_length` during conversion: ```python from docarray import Document, DocumentArray @@ -126,7 +126,7 @@ for d in da: [ 0 0 0 0 6 7 2 8 9 10] ``` -You can get also use `.tensors` of DocumentArray to get all tensors in one `ndarray`. +You can get also use a DocumentArray's `.tensors` to get all tensors in one `ndarray`. ```python print(da.tensors) @@ -140,7 +140,7 @@ print(da.tensors) ## Convert `ndarray` back to text -As a bonus, you can also easily convert an integer `ndarray` back to text based on some given vocabulary. This procedure is often termed as "decoding". +As a bonus, you can also easily convert an integer `ndarray` back to text based on a given vocabulary. This is often termed "decoding". ```python from docarray import Document, DocumentArray @@ -171,7 +171,7 @@ this is a much longer sentence ``` -## Simple text matching via feature hashing +## Simple text matching with feature hashing Let's search for `"she entered the room"` in *Pride and Prejudice*: @@ -208,9 +208,9 @@ print(q.matches[:, ('text', 'scores__jaccard')]) ## Searching at chunk level with subindex You can create applications that search at chunk level using a subindex. -Imagine you want an application that searches at a sentences granularity and returns the document title of the document -containing the sentence closest to the query. For example, you can have a database of lyrics of songs and you want to -search the song title of a song from which you might remember a small part of it (likely the chorus). +Imagine you want an application that searches at a sentence granularity and returns the title of the Document +containing the closest sentence to the query. For example, you have a database of song lyrics and want to +search a title from which you remember a small part of the lyrics (like the chorus). ```{admonition} Multi-modal Documents :class: seealso @@ -222,10 +222,10 @@ You can find the corresponding example {ref}`here `. ``` ```python -song1_title = 'Old Macdougal Had a Farm' +song1_title = 'Old MacDonald Had a Farm' song1 = """ -Old Macdougal had a farm, E-I-E-I-O +Old MacDonald had a farm, E-I-E-I-O And on that farm he had some dogs, E-I-E-I-O With a bow-wow here, and a bow-wow there, Here a bow, there a bow, everywhere a bow-wow. @@ -245,7 +245,7 @@ wo dein sanfter Flügel weilt. """ ``` -We can now create one document for each of the songs, containing as chunks the song sentences. +We can create one Document for each song, containing the song's lines as chunks: ```python from docarray import Document, DocumentArray @@ -261,7 +261,7 @@ da.extend([doc1, doc2]) ``` Now we can build a feature vector for each line of each song. Here we use a very simple Bag of Words descriptor as -feature vector. +the feature vector. ```python import re @@ -288,7 +288,7 @@ for d in da['@c']: d.embedding = bow_feature_vector(d, vocab, tokenizer) ``` -Once we have the data prepared, we can store it into a DocumentArray that supports a subindex. +Once we've prepared the data, we can store it in a DocumentArray that supports a subindex: ```buildoutcfg n_features = len(vocab)+2 @@ -305,7 +305,7 @@ with da_backend: da_backend.extend(da) ``` -Given a query such as `into death` we want to search which song contained a similar sentence. +Given a query like `into death` we want to search songs that contain a similar sentence. ```python def find_song_name_from_song_snippet(query: Document, da_backend) -> str: @@ -320,7 +320,7 @@ query.embedding = bow_feature_vector(query, vocab, tokenizer) similar_items = find_song_name_from_song_snippet(query, da_backend) print(similar_items) ``` -Will print +This prints: ```text -{'song_title': 'Old Macdougal Had a Farm'} +{'song_title': 'Old MacDonald Had a Farm'} ``` diff --git a/docs/datatypes/video/index.md b/docs/datatypes/video/index.md index bb511665ac9..7d7c163955c 100644 --- a/docs/datatypes/video/index.md +++ b/docs/datatypes/video/index.md @@ -48,7 +48,7 @@ d.chunks.plot_image_sprites('mov.png') ## Key frame extraction -From the sprite image one can observe our example video is quite redundant. Let's extract the key frames from this video and see: +From the sprite image you can observe our example video is quite redundant. Let's extract the key frames from this video and see: ```python from docarray import Document @@ -98,7 +98,7 @@ print(first_scene.shape) ## Save as video file -One can also save a Document `.tensor` as a video file. In this example, we load our `.mp4` video and store it into a 60fps video. +You can also save a Document `.tensor` as a video file. In this example, we load our `.mp4` video and store it into a 60fps video. ```python from docarray import Document @@ -116,7 +116,7 @@ d = ( ## Create Document from webcam -One can generate a stream of Documents from a webcam via {meth}`~docarray.document.mixins.video.VideoDataMixin.generator_from_webcam`: +You can generate a stream of Documents from a webcam via {meth}`~docarray.document.mixins.video.VideoDataMixin.generator_from_webcam`: ```python from docarray import Document @@ -125,7 +125,7 @@ for d in Document.generator_from_webcam(): pass ``` -This will create a generator that yields a Document for each frame. One can control the framerate via `fps` parameter. Note that the upper bound of the framerate is determined by the hardware of webcam, not the software. Press `Esc` to exit. +This creates a generator that yields a Document for each frame. You can control the framerate via `fps` parameter. Note that the upper bound of the framerate is determined by the hardware of webcam, not the software. Press `Esc` to exit.