From 2dac540dbe50d3bd2e820074f96356c127c5ef7a Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Tue, 6 Dec 2022 15:41:31 +0100 Subject: [PATCH 01/10] docs(datatypes/text): fix wording Signed-off-by: Alex C-G --- docs/datatypes/index.md | 4 +-- docs/datatypes/text/index.md | 50 ++++++++++++++++++------------------ 2 files changed, 27 insertions(+), 27 deletions(-) diff --git a/docs/datatypes/index.md b/docs/datatypes/index.md index 6d0688793ca..8197fc31ca0 100644 --- a/docs/datatypes/index.md +++ b/docs/datatypes/index.md @@ -1,6 +1,6 @@ # Multimodal Data -Whether youโ€™re working with text, image, video, audio, 3D meshes or the nested or the combined of them, you can always represent them as Documents and process them as DocumentArray. Here are some motivate examples: +DocArray lets you represent text, image, video, audio, and 3D meshes as Documents, whether separate, nested or combined, and process them as a DocumentArray. Here are some motivating examples: ```{toctree} @@ -11,4 +11,4 @@ audio/index mesh/index tabular/index multimodal/index -``` \ No newline at end of file +``` diff --git a/docs/datatypes/text/index.md b/docs/datatypes/text/index.md index 378b213ef19..e6a510911ea 100644 --- a/docs/datatypes/text/index.md +++ b/docs/datatypes/text/index.md @@ -1,14 +1,14 @@ (text-type)= # {octicon}`typography` Text -Representing text in DocArray is easy. Simply do: +Representing text in DocArray is as easy as: ```python from docarray import Document Document(text='hello, world.') ``` -If your text data is big and can not be written inline, or it comes from a URI, then you can also define `uri` first and load the text into Document later. +If your text data is larger and can't be written inline, or comes from a URI, then you can also define `uri` first and load the text into a Document later: ```python from docarray import Document @@ -23,7 +23,7 @@ d.summary() ``` -And of course, you can have characters from different languages. +And of course, you can use characters from different languages: ```python from docarray import Document @@ -32,9 +32,9 @@ d = Document(text='๐Ÿ‘‹ เคจเคฎเคธเฅเคคเฅ‡ เคฆเฅเคจเคฟเคฏเคพ! ไฝ ๅฅฝไธ–็•Œ๏ผ ``` -## Segment long documents +## Segment long Documents -Often times when you index/search textual document, you don't want to consider thousands of words as one document, some finer granularity would be nice. You can do these by leveraging `chunks` of Document. For example, let's segment this simple document by `!` mark: +Often times when you index/search textual Documents, you don't want to consider thousands of words as one huge Document -- some finer granularity would be nice. You can do this by leveraging Document `chunks`. For example, let's split this simple Document at each `!` mark: ```python from docarray import Document @@ -56,11 +56,11 @@ d.summary() โ””โ”€ ``` -Which creates five sub-documents under the original documents and stores them under `.chunks`. +This creates five sub-Documents under the original Document and stores them under the original Document's `.chunks`. -## Convert text into `ndarray` +## Convert text to `ndarray` -Sometimes you may need to encode the text into a `numpy.ndarray` before further computation. We provide some helper functions in Document and DocumentArray that allow you to convert easily. +Sometimes you need to encode the text into a `numpy.ndarray` before further computation. We provide some helper functions in Document and DocumentArray that allow you to do that easily. For example, we have a DocumentArray with three Documents: ```python @@ -85,9 +85,9 @@ vocab = da.get_vocabulary() {'hello': 2, 'world': 3, 'goodbye': 4} ``` -The vocabulary is 2-indexed as `0` is reserved for padding symbol and `1` is reserved for unknown symbol. +The vocabulary is 2-indexed as `0` is reserved for the padding symbol and `1` for the unknown symbol. -One can further use this vocabulary to convert `.text` field into `.tensor` via: +You can further use this vocabulary to convert `.text` field into `.tensor`: ```python for d in da: @@ -101,7 +101,7 @@ for d in da: [2 4] ``` -When you have text in different length and you want the output `.tensor` to have the same length, you can define `max_length` during converting: +When you have text of different lengths and want output `.tensor`s to have the same length, you can define `max_length` during conversion: ```python from docarray import Document, DocumentArray @@ -126,7 +126,7 @@ for d in da: [ 0 0 0 0 6 7 2 8 9 10] ``` -You can get also use `.tensors` of DocumentArray to get all tensors in one `ndarray`. +You can get also use a DocumentArray's `.tensors` to get all tensors in one `ndarray`. ```python print(da.tensors) @@ -140,7 +140,7 @@ print(da.tensors) ## Convert `ndarray` back to text -As a bonus, you can also easily convert an integer `ndarray` back to text based on some given vocabulary. This procedure is often termed as "decoding". +As a bonus, you can also easily convert an integer `ndarray` back to text based on a given vocabulary. This is often termed "decoding". ```python from docarray import Document, DocumentArray @@ -171,7 +171,7 @@ this is a much longer sentence ``` -## Simple text matching via feature hashing +## Simple text matching with feature hashing Let's search for `"she entered the room"` in *Pride and Prejudice*: @@ -208,9 +208,9 @@ print(q.matches[:, ('text', 'scores__jaccard')]) ## Searching at chunk level with subindex You can create applications that search at chunk level using a subindex. -Imagine you want an application that searches at a sentences granularity and returns the document title of the document -containing the sentence closest to the query. For example, you can have a database of lyrics of songs and you want to -search the song title of a song from which you might remember a small part of it (likely the chorus). +Imagine you want an application that searches at a sentence granularity and returns the title of the Document +containing the closest sentence to the query. For example, you have a database of song lyrics and want to +search a title from which you remember a small part of the lyrics (like the chorus). ```{admonition} Multi-modal Documents :class: seealso @@ -222,10 +222,10 @@ You can find the corresponding example {ref}`here `. ``` ```python -song1_title = 'Old Macdougal Had a Farm' +song1_title = 'Old MacDonald Had a Farm' song1 = """ -Old Macdougal had a farm, E-I-E-I-O +Old MacDonald had a farm, E-I-E-I-O And on that farm he had some dogs, E-I-E-I-O With a bow-wow here, and a bow-wow there, Here a bow, there a bow, everywhere a bow-wow. @@ -245,7 +245,7 @@ wo dein sanfter Flรผgel weilt. """ ``` -We can now create one document for each of the songs, containing as chunks the song sentences. +We can create one Document for each song, containing the song's lines as chunks: ```python from docarray import Document, DocumentArray @@ -261,7 +261,7 @@ da.extend([doc1, doc2]) ``` Now we can build a feature vector for each line of each song. Here we use a very simple Bag of Words descriptor as -feature vector. +the feature vector. ```python import re @@ -288,7 +288,7 @@ for d in da['@c']: d.embedding = bow_feature_vector(d, vocab, tokenizer) ``` -Once we have the data prepared, we can store it into a DocumentArray that supports a subindex. +Once we've prepared the data, we can store it in a DocumentArray that supports a subindex: ```buildoutcfg n_features = len(vocab)+2 @@ -305,7 +305,7 @@ with da_backend: da_backend.extend(da) ``` -Given a query such as `into death` we want to search which song contained a similar sentence. +Given a query like `into death` we want to search songs that contain a similar sentence. ```python def find_song_name_from_song_snippet(query: Document, da_backend) -> str: @@ -320,7 +320,7 @@ query.embedding = bow_feature_vector(query, vocab, tokenizer) similar_items = find_song_name_from_song_snippet(query, da_backend) print(similar_items) ``` -Will print +This prints: ```text -{'song_title': 'Old Macdougal Had a Farm'} +{'song_title': 'Old MacDonald Had a Farm'} ``` From d3421e5f1e38ef6702c24e4cdf5f9b96dcabbfc6 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Tue, 6 Dec 2022 16:53:56 +0100 Subject: [PATCH 02/10] docs(dataclass): polish wording Signed-off-by: Alex C-G --- docs/fundamentals/dataclass/index.md | 50 +++++++++++++--------------- 1 file changed, 24 insertions(+), 26 deletions(-) diff --git a/docs/fundamentals/dataclass/index.md b/docs/fundamentals/dataclass/index.md index fff8e7fe704..b44743e9fce 100644 --- a/docs/fundamentals/dataclass/index.md +++ b/docs/fundamentals/dataclass/index.md @@ -26,7 +26,7 @@ DocArray's dataclass is a high-level API for representing a multimodal document It follows the design and idiom of the standard [Python dataclass](https://docs.python.org/3/library/dataclasses.html), allowing users to represent a complicated multimodal document intuitively and process it easily via DocArray Document/DocumentArray API. -In a nutshell, DocArray provides a decorator `@dataclass` and a set of multimodal types in `docarray.typing`, +In a nutshell, DocArray provides a `@dataclass` decorator and a set of multimodal types in `docarray.typing`, which allows the multimodal document on the left to be represented as the code snippet on the right: ::::{grid} 2 @@ -74,13 +74,13 @@ doc = Document(a) :::: -Under the hood, `doc` is represented as a {class}`~docarray.document.Document` containing a {attr}`~docarray.document.Document.chunks` -each, for `banner`, `headline` and `meta`. +Under the hood, `doc` is represented as a {class}`~docarray.document.Document` containing {attr}`~docarray.document.Document.chunks` +for each of `banner`, `headline` and `meta`. But the beauty of DocArray's dataclass is that as a user you don't have to reason about `chunks` at all. -Instead, you define your data structure using your own words, and reason in the domain you are most familiar with. +Instead, you define your data structure in your own words, and reason in the domain you are most familiar with. -Before we continue, let's first spend some time to understand the problem and the rationale behind this feature. +Before we continue, let's spend some time understanding the problem and the rationale behind this feature. ## What is multi-modality? @@ -90,7 +90,7 @@ Before we continue, let's first spend some time to understand the problem and th It is highly recommended that you first read through the last two chapters on Document and DocumentArray before moving on, as they help you understand the problem we are solving here. ``` -A multimodal document is a document that consists of a mixture of data modalities, such as image, text, audio, etc. Let's see some examples in real-world. Considering an article card (left) from The Washington Post and a sound effect card (right) from BBC: +A multimodal document is a document that consists of a mixture of data modalities, such as image, text, audio, etc. Let's see some examples in real-world. Consider an article card (left) from The Washington Post and a sound effect card (right) from the BBC: ::::{grid} 2 @@ -113,19 +113,18 @@ A multimodal document is a document that consists of a mixture of data modalitie :::: -The left card can be seen as a multimodal document: it consists of a sentence, an image, and some tags (i.e. author, column section). The right one can be seen as a collection of multimodal documents, each of which consists of an audio clip and a sentence description. +The left card can be seen as a multimodal document: it consists of a sentence, an image, and some tags (i.e. author, name of column). The right card can be seen as a collection of multimodal documents, each of which consists of an audio clip and a description. - -In practice, we want to express such multimodal documents via Document and DocumentArray, so that we can process each modality and leverage all DocArray's API, e.g. to embed, search, store and transfer them. That's the purpose of DocArray dataclass. +In practice, we want to express such multimodal documents with Document and DocumentArray, so that we can process each modality and leverage DocArray's full API, e.g. to embed, search, store and transfer the documents. That's the purpose of DocArray dataclass. ## Understanding the problem -Given a multimodal document, we want to represent it via our [Document](../document/index.md) object. What we have learned so far is: -- A Document object is the basic IO unit for almost all [DocArray API](../document/fluent-interface.md). -- Each Document {ref}`can only contain one type ` of data modality. +Given a multimodal document, we want to represent it with our [Document](../document/index.md) object. What we've learned so far is: +- A Document object is the basic IO unit for almost all of [DocArray's API](../document/fluent-interface.md). +- Each Document {ref}`can only contain one ` data modality. - A Document can be {ref}`nested` under `.chunks` or `.matches`. -Having those in mind, to represent a multimodal document it seems that we need to put each modality as a separated Document and then nested them under a parent Document. For example, the article card from The Washington Post would be represented as follows: +With those in mind, to represent a multimodal document it seems that we need to put each modality in a separate Document and then nest them under a parent Document. For example, the article card from The Washington Post would look like: ::::{grid} 2 @@ -146,20 +145,20 @@ Having those in mind, to represent a multimodal document it seems that we need t :::: -- `Doc1` the image Document, containing `.uri` of the image and `.tensor` representation of that banner image. -- `Doc2` the text Document, containing `.text` field of the card -- `Doc0` the container Document of `Doc1` and `Doc2`, also contains some meta information such as author name, column name in `.tags`. +- `Doc1`, the image Document, containing the image's `.uri`, and `.tensor`. +- `Doc2`, the text Document, containing the card's `.text` field. +- `Doc0`, the container Document of `Doc1` and `Doc2`, also containing meta information like author name, column name in `.tags`. -Having this representation has many benefits, to name a few: -- One can process and apply deep learning methods on each Document (aka modality) separately. +This representation has many benefits: +- You can process and apply deep learning methods on each Document (aka modality) separately. - Or _jointly_, by leveraging the nested relationship at the parent level. -- One can enjoy all DocArray API, [Jina API](https://github.com/jina-ai/jina), [Hub Executors](https://cloud.jina.ai), [CLIP-as-service](https://clip-as-service.jina.ai/) and [Finetuner](https://github.com/jina-ai/finetuner) out of the box, without redesigning the data structure. +- You can enjoy the full DocArray API, [Jina API](https://github.com/jina-ai/jina), [Hub Executors](https://cloud.jina.ai), [CLIP-as-service](https://clip-as-service.jina.ai/) and [Finetuner](https://github.com/jina-ai/finetuner) out of the box, without redesigning the data structure. ## Understanding the challenges -But why do we need a dataclass module, what are the challenges here? +But why do we need a dataclass module? What are the challenges we're trying to solve? -The first challenge is that such mapping is **arbitrary and implicit**. Given a real-world multimodal document, it is not straightforward to construct such nested structure for new users of DocArray. The example above is simple, so the answer seems trivial. But what if I want to represent the following newspaper article as one Document? +The first challenge is that such mapping is **arbitrary and implicit**. Given a real-world multimodal document, it's not straightforward to construct such a nested structure for new users of DocArray. The example above is simple, so the answer seems trivial. But what if you want to represent the following newspaper article as one Document? ::::{grid} 2 @@ -180,11 +179,10 @@ The first challenge is that such mapping is **arbitrary and implicit**. Given a :::: -The second challenge is accessing the nested sub-Document. We want to provide users an easy way to access the nested sub-Document. It should be as easy and consistent as how they construct such Document in the first place. - -The final challenge is how to play well with DocArray and Jina Ecosystem, allowing users to leverage existing API, algorithms and models to handle such multimodal documents. To be specific, the user can use multimodal document as the I/O without changing their algorithms and models. +The second challenge is accessing the nested sub-Documents. It should be as easy and consistent as constructing the Document in the first place. -## What's next +The final challenge is playing well with the Jina ecosystem, letting users leverage existing APIs, algorithms and models to handle such multimodal documents. To be specific, users can use multimodal documents as I/O without changing their algorithms and models. -DocArray's dataclass is designed to tackle these challenges by providing an elegant solution based on Python dataclass. It shares the same idiom as Python dataclass, allowing the user to define a multimodal document by adding type annotations. In the next sections, we shall see how it works. +## What's next? +DocArray's dataclass is designed to tackle these challenges by providing an elegant solution based on Python dataclass. It shares the same idiom as Python dataclass, allowing the user to define a multimodal document by adding type annotations. In the next sections, we'll see how it works. From a210fea29a9d7c6a539b0b41007d66d51c005a1e Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Tue, 6 Dec 2022 16:53:56 +0100 Subject: [PATCH 03/10] docs(dataclass): polish wording Signed-off-by: Alex C-G --- docs/fundamentals/dataclass/access.md | 38 +++++++------- docs/fundamentals/dataclass/construct.md | 64 +++++++++++++----------- docs/fundamentals/dataclass/example.md | 25 ++++----- docs/fundamentals/dataclass/new-type.md | 16 +++--- 4 files changed, 72 insertions(+), 71 deletions(-) diff --git a/docs/fundamentals/dataclass/access.md b/docs/fundamentals/dataclass/access.md index e896be0bdc9..9bee17811f7 100644 --- a/docs/fundamentals/dataclass/access.md +++ b/docs/fundamentals/dataclass/access.md @@ -7,16 +7,16 @@ modalities by their names. :class: seealso Accessing a modality always returns a Document or a DocumentArray, instead of directly returning the data stored in them. -This ensures maximum flexibility for the use. +This ensures maximum flexibility. -If you want to learn more about the rationale behind this design, you can read our [blog post](https://medium.com/jina-ai/the-next-level-of-multi-modality-in-docarray-and-jina-a97b38280ab0). +To learn more about the rationale behind this design, read our [blog post](https://medium.com/jina-ai/the-next-level-of-multi-modality-in-docarray-and-jina-a97b38280ab0). ``` (mm-access-doc)= ## Document level access -Even after conversion to {class}`~docarray.document.Document`, custom-defines modalities can be accessed by their names, returning a -{class}`~docarray.document.Document` or, for list-types, a {class}`~docarray.array.document.DocumentArray`. +Even after conversion to {class}`~docarray.document.Document`, custom-defined modalities can be accessed by their names, returning a +{class}`~docarray.document.Document` or, for list-types, a {class}`~docarray.array.document.DocumentArray`: ```python from docarray import Document, dataclass @@ -38,9 +38,9 @@ doc = Document( ) print(doc.banner) # returns a Document with the test.jpg image tensor -print(doc.banner.tensor) # returns the image tensor +print(doc.banner.tensor) # returns the image tensor directly print(doc.paragraphs) # returns a DocumentArray with one Document per paragraph -print(doc.paragraphs.texts) # returns the paragraph texts +print(doc.paragraphs.texts) # returns the paragraph texts directly ``` @@ -65,7 +65,7 @@ doc.banner.embedding = model(banner_tensor) ### Select nested fields -Nested field, coming from {ref}`nested dataclasses `, can be accessed by selecting the outer field, +Nested fields, coming from {ref}`nested dataclasses `, can be accessed by selecting the outer field, and then selecting the inner field: ```python @@ -98,7 +98,7 @@ this is a description ``` (mm-access-da)= -## DocumentArray level access +## DocumentArray-level access Custom modalities can be accessed through the familiar {ref}`selector syntax `. @@ -111,14 +111,14 @@ The fact that a custom modality is accessed is denoted through the addition of a || | | || |-------| || | -|| | --- indicate the field of dataclass (modality name) +|| | --- dataclass field (modality name) || -|| ------ indicate the start of modality selector +|| ------ start of modality selector | -| ---- indicate the start of selector +| ---- start of selector ``` -Selecting a modality form a DocumentArray always results in another DocumentArray: +Selecting a modality from a DocumentArray always results in another DocumentArray: ```python from docarray import Document, dataclass, DocumentArray @@ -218,7 +218,7 @@ da['@.[description]'] ### Select multiple fields -You can select multiple fields by including them in the square brackets, separated by a comma `,`: +You can select multiple fields by including them in the square brackets, separated by commas `,`: ```python da['@.[description, banner]'] @@ -258,7 +258,7 @@ da['@.[description, banner]'] ### Slice dataclass objects -Remember each dataclass object corresponds to one Document object, you can first slice the DocumentArray before selecting the field. Specifically, you can do: +Remember each dataclass object corresponds to one Document object. You can first slice the DocumentArray before selecting the field. Specifically, you can do: ```text @r[slice].[field1, field2, ...] @@ -306,7 +306,7 @@ da['@r[:1].[banner]'] ### Slice `List[Type]` fields -If a field is annotated as a List of DocArray types, it will create a DocumentArray, one can add slicing after the field selector to further restrict the size of the sub-Documents. +If a field is annotated as a List of DocArray types, it creates a DocumentArray. You can add slicing after the field selector to further restrict the number of sub-Documents. ```{code-block} python --- @@ -351,17 +351,17 @@ test-1.jpeg test-1.jpeg ``` -To summarize, slicing can be put in front of the field selector to restrict the number of dataclass objects; or can be put after the field selector to restrict the number of sub-Documents. +To summarize, slicing can be put in front of the field selector to restrict the number of dataclass objects, or after the field selector to restrict the number of sub-Documents. ### Select nested fields -A field can be annotated as a DocArray dataclass. In this case, the nested structure from the latter dataclass is copied to the former's `.chunks`. To select the deeply nested field, one can simply follow: +A field can be annotated as a DocArray dataclass. In this case, the nested structure from the latter dataclass is copied to the former's `.chunks`. To select the deeply nested field, you can simply follow: ```text @.[field1, field2, ...].[nested_field1, nested_field1, ...] ``` -For example, +For example: ```{code-block} python --- @@ -396,4 +396,4 @@ for d in da['@.[featured].[banner]']: ```text test-1.jpeg test-2.jpeg -``` \ No newline at end of file +``` diff --git a/docs/fundamentals/dataclass/construct.md b/docs/fundamentals/dataclass/construct.md index 8fcdf162c34..ea2563e4724 100644 --- a/docs/fundamentals/dataclass/construct.md +++ b/docs/fundamentals/dataclass/construct.md @@ -2,11 +2,11 @@ # Construct ```{tip} -In DocArray, a Document object can contain sub-Document in `.chunks`. If you are still unaware of this design, make sure to read {ref}`this chapter` before continuing. +In DocArray, a Document object can contain sub-Documents in `.chunks`. If you're unaware of this design, read {ref}`this chapter` before continuing. ``` -Just like the Python dataclasses module, DocArray provides a decorator {meth}`~docarray.dataclasses.types.dataclass` and a set of type annotations in {mod}`docarray.typing` such as `Image`, `Text`, `Audio`, that allow you to construct multimodal Document in the following way: +Just like the Python dataclasses module, DocArray provides a {meth}`~docarray.dataclasses.types.dataclass` decorator and a set of type annotations in {mod}`docarray.typing` like `Image`, `Text`, `Audio`, that let you construct a multimodal Document: ```python from docarray import dataclass @@ -28,16 +28,16 @@ m = MyMultiModalDoc(avatar='test-1.jpeg', description='hello, world') Be careful when assigning names to your modalities. -Do not use names that are properties of {class}`~docarray.document.Document`, such as +Don't use names that are properties of {class}`~docarray.document.Document`, like `text`, `tensor`, `embedding`, etc. -Instead, use more specific names that fit your domain, such as `avatar` and `description` in the example above. +Instead, use more specific names that fit your domain, like `avatar` and `description` in the example above. -If there is a conflict between the name of a modality and a property of {class}`~docarray.document.Document`, -no guarantees about the behavior while {ref}`accessing ` such a name can be made. +If there's a conflict between a modality name and a {class}`~docarray.document.Document` property, +there may be unexpected behavior when {ref}`accessing ` such a name. ``` -To convert it into a `Document` object, simply: +To convert a `MyMultiModalDoc` to a `Document` object, simply: ```python from docarray import Document @@ -77,7 +77,7 @@ This creates a Document object with two chunks: ```` -To convert a Document object back to a `MyMultiModalDoc` object, do: +To convert a Document object back to a `MyMultiModalDoc` object: ```python m = MyMultiModalDoc(d) @@ -87,7 +87,7 @@ m = MyMultiModalDoc(d) ## Dataclass decorator -First, you need to import `dataclass` decorator from DocArray package: +First, import `dataclass` decorator from the DocArray package: ```python from docarray import dataclass @@ -119,9 +119,9 @@ True True ``` -That means, [arguments accepted by standard `dataclass`](https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass) are also accepted here. Methods that can be applied to Python `dataclass` can be also be applied to DocArray `dataclass`. +That means, [arguments accepted by standard `dataclass`](https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass) are also accepted here. Methods that can be applied to Python's `dataclass` can be also be applied to DocArray's `dataclass`. -To tell if a class or object is DocArray's dataclass, you can use {meth}`~docarray.dataclasses.types.is_multimodal`: +To tell if a class or object is a DocArray dataclass, you can use {meth}`~docarray.dataclasses.types.is_multimodal`: ```python from docarray.typing import Image @@ -151,12 +151,16 @@ False ``` -In the sequel, unless otherwise specified `dataclass` always refers to `docarray.dataclass`, not the Python built-in `dataclass`. +Moving forwards, unless otherwise specified, `dataclass` always refers to `docarray.dataclass`, not Python's built-in `dataclass`. ## Annotate class fields -DocArray provides {mod}`docarray.typing` that allows one to annotate a class field as `Image`, `Text`, `JSON`, `Audio`, `Video`, `Mesh`, `Tabular`, `Blob`; or as primitive Python types; or as other `docarray.dataclass`. +DocArray provides {mod}`docarray.typing` that allows you to annotate class fields as + +- `Image`, `Text`, `JSON`, `Audio`, `Video`, `Mesh`, `Tabular`, `Blob` +- primitive Python types +- other `docarray.dataclass` ```python from docarray import dataclass @@ -170,7 +174,7 @@ class MMDoc2: soundfx: Audio = 'white-noise.wav' ``` -Convert `MMDoc2` object into a `Document` object is easy, simply via +Converting `MMDoc2` object into a `Document` object is easy: ```python from docarray import Document @@ -178,7 +182,7 @@ m = MMDoc2() d = Document(m) ``` -One can look at the structure of `d` via `d.summary()`: +You can look at the structure of `d` via `d.summary()`: ````{dropdown} Nested structure (chunks) @@ -221,7 +225,7 @@ One can look at the structure of `d` via `d.summary()`: (mm-annotation)= ## Behavior of field annotation -This section explains the behavior of field annotations in details. +This section explains the behavior of field annotations in detail. - A `dataclass` corresponds to a `Document` object, let's call it `root`. - Unannotated fields are ignored. @@ -329,14 +333,14 @@ This section explains the behavior of field annotations in details. | Type annotation | Accepted value types | Behavior | |-----------------|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `Image` | `str`, `numpy.ndarray` | Creates a sub-Document, fills in `doc.tensor` by reading the image and sets `.modality='image'` | -| `Text` | `str` | Creates a sub-Document, fills in `doc.text` by the given value and sets `.modality='text'` | -| `URI` | `str` | Creates a sub-Document, fills in `doc.uri` by the given value | +| `Text` | `str` | Creates a sub-Document, fills in `doc.text` from the given value and sets `.modality='text'` | +| `URI` | `str` | Creates a sub-Document, fills in `doc.uri` from the given value | | `Audio` | `str`, `numpy.ndarray` | Creates a sub-Document, fills in `doc.tensor` by reading the audio and sets `.modality='audio'` | -| `JSON` | `Dict` | Creates a sub-Document, fills in `doc.tags` by the given value and sets `.modality='json'` | +| `JSON` | `Dict` | Creates a sub-Document, fills in `doc.tags` from the given value and sets `.modality='json'` | | `Video` | `str`, `numpy.ndarray` | Creates a sub-Document, fills in `doc.tensor` by reading the video and sets `.modality='video'` | | `Mesh` | `str`, `numpy.ndarray` | Creates a sub-Document, fills in `doc.tensor` by sub-sampling the mesh as point-cloud and sets `.modality='mesh'` | -| `Blob` | `str`, `bytes` | Creates a sub-Document, fills in `doc.blob` by the given value or reading from the path | -| `Tabular` | `str` (file name) | Reads a CSV file, creates a sub-Document for each line and fills in `doc.tags` by considering the first row as the column names and mapping the following lines into the corresponding values. | +| `Blob` | `str`, `bytes` | Creates a sub-Document, fills in `doc.blob` from the given value or reading from the path | +| `Tabular` | `str` (file name) | Reads a CSV file, creates a sub-Document for each line and fills in `doc.tags` by considering the first row as column names and mapping subsequent rows into corresponding values. | - A class field labeled with `List[Type]` will create sub-Documents under `root.chunks[0].chunks`. For example, @@ -395,7 +399,7 @@ This section explains the behavior of field annotations in details. โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ ``` ```` -- A field annotated with another `dataclass` will create the full nested structure under the corresponding chunk. +- A field annotated with another `dataclass` will create the full nested structure under the corresponding chunk: ````{tab} Field in another dataclass @@ -471,11 +475,11 @@ This section explains the behavior of field annotations in details. ``` ```` -- A dataclass that has only one field annotated with `docarray.typing` will still create a nested structure under `root.chunks`. In this case, `len(root.chunks)=1` and your multimodal Document has basically a single modality, which may encourage you to think if this is really necessary to use a `dataclass`. After all, each Document represents single modality, and you can just use `Document`. +- A dataclass that has only one field annotated with `docarray.typing` will still create a nested structure under `root.chunks`. In this case, `len(root.chunks)=1` and your multimodal Document has basically a single modality, which may encourage you to consider if you really need to use a `dataclass`. After all, each Document represents a single modality, so in this case you could just use `Document`. ## Construct from/to Document -It is easy to convert a `dataclass` object from/to a `Document` object: +It's easy to convert a `dataclass` object from/to a `Document` object: ```python from docarray import dataclass, Document @@ -496,9 +500,9 @@ assert m == m_r ## Use `field()` for advanced configs -For common and simple use cases, no other functionality is required. There are, however, some dataclass features that require additional per-field information. To satisfy this need for additional information, you can replace the default field value with a call to the provided {meth}`~docarray.dataclasses.types.field` function. +For common and simple use cases, no other functionality is required. There are, however, some dataclass features that require additional per-field information. For this, you can replace the default field value with a call to the provided {meth}`~docarray.dataclasses.types.field` function. -For example, mutable object is not allowed as the default value of any dataclass field. One can solve it via: +For example, a mutable object is not allowed as the default value of any dataclass field. You can solve it via: ```python from typing import List @@ -512,13 +516,13 @@ class MMDoc: banner: List[Image] = field(default_factory=lambda: ['test-1.jpeg', 'test-2.jpeg']) ``` -Other parameters from the standard the Python field such as `init`, `compare`, `hash`, `repr` are also supported. More details can be [found here](https://docs.python.org/3/library/dataclasses.html#dataclasses.field). +Other parameters from the standard the Python field like `init`, `compare`, `hash`, `repr` are also supported. More details can be [found here](https://docs.python.org/3/library/dataclasses.html#dataclasses.field). ## What's next? -In this chapter, we have learned to use `@dataclass` decorator and type annotation to build multimodal documents. The look and feel is exactly the same as Python builtin dataclass. +In this chapter, we've learned to use the `@dataclass` decorator and type annotation to build multimodal documents. The look and feel is exactly the same as Python's builtin dataclass. -Leveraging {ref}`the nested Document structure`, DocArray's dataclass offers great expressiveness for data scientists and machine learning engineers who work with multimodal data, allowing them to represent image, text, video, mesh, tabular data in a very intuitive way. Converting a multimodal dataclass object from/to a Document is very straightforward. +Leveraging {ref}`the nested Document structure`, DocArray's dataclass offers great expressiveness for data scientists and machine learning engineers who work with multimodal data, allowing them to represent image, text, video, mesh, and tabular data in an intuitive way. Converting a multimodal dataclass object from/to a Document is straightforward. -In the next chapter, we shall see how to select modality (aka sub-document) via the selector syntax. \ No newline at end of file +In the next chapter, we'll see how to select modality (aka sub-Document) via selector syntax. diff --git a/docs/fundamentals/dataclass/example.md b/docs/fundamentals/dataclass/example.md index bf7d246eceb..5943c2f5c94 100644 --- a/docs/fundamentals/dataclass/example.md +++ b/docs/fundamentals/dataclass/example.md @@ -1,19 +1,19 @@ # Process Modality -So far we have learned how to construct and select multimodal Document, we are now ready to leverage DocArray API/Jina/Hub Executor to process the modalities. +So far we've learned to construct and select multimodal Documents. Now we're ready to leverage DocArray API/Jina/Hub Executor to process the modalities. -In a nutshell, you need to convert a multimodal dataclass to a Document object (or DocumentArray) before processing it. This is because DocArray API/Jina/Hub Executor always take Document/DocumentArray as the basic IO unit. The following figure illustrates the idea. +In a nutshell, you need to convert a multimodal dataclass to a Document object (or DocumentArray) before processing it. This is because DocArray API/Jina/Hub Executor always takes Document/DocumentArray as the basic IO unit: ```{figure} img/process-mmdoc.svg ``` -## Embed image and text via CLIP +## Embed image and text with CLIP Developed by OpenAI, CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It is also a perfect model to showcase multimodal dataclass processing. -Take the code snippet from [the original CLIP repository](https://github.com/openai/CLIP) as an example, +Take the code snippet from [the original CLIP repository](https://github.com/openai/CLIP) as an example: ```python import torch @@ -40,7 +40,7 @@ tensor([[ 0.0547, -0.0061, 0.0495, ..., -0.6638, -0.1281, -0.4950], [ 0.1981, -0.2040, -0.1533, ..., -0.4514, -0.5664, 0.0596]]) ``` -Let's refactor it via dataclass. +Let's refactor it with dataclass: ```{code-block} python --- @@ -85,13 +85,13 @@ tensor([[ 0.0547, -0.0061, 0.0495, ..., -0.6638, -0.1281, -0.4950], [ 0.1981, -0.2040, -0.1533, ..., -0.4514, -0.5664, 0.0596]]) ``` -## Embed via CLIP-as-service +## Embed with CLIP-as-service -[CLIP-as-service](https://github.com/jina-ai/clip-as-service) is a low-latency high-scalability service for embedding images and text. It can be easily integrated as a microservice into neural search solutions. +[CLIP-as-service](https://github.com/jina-ai/clip-as-service) is a low-latency high-scalability service for embedding images and text. You can easily integrate it into neural search solutions as a microservice. -To use CLIP-as-service to process a dataclass object is extremely simple, which should also show you the idea to use existing Executors or services without touching their codebase. +Using CLIP-as-service to process a dataclass object is simple, which also shows you the idea of using existing Executors or services without touching their codebase. -1. Construct the dataclass. +1. Construct the dataclass: ```python from docarray import dataclass, field, Document, DocumentArray from docarray.typing import Text, Image @@ -111,13 +111,13 @@ To use CLIP-as-service to process a dataclass object is extremely simple, which m3 = MMDoc(banner='CLIP.png', title='a cat') ``` -3. Convert them into a DocumentArray. +3. Convert them into a DocumentArray: ```python da = DocumentArray([Document(m1), Document(m2), Document(m3)]) ``` -4. Select the modality via the selector syntax and send via client +4. Select the modality via the selector syntax and send with client: ```python from clip_client import Client @@ -136,6 +136,3 @@ To use CLIP-as-service to process a dataclass object is extremely simple, which [ 0.1442 0.02275 -0.291 ... -0.4468 -0.3416 0.1798 ] [ 0.1985 -0.204 -0.1534 ... -0.4507 -0.5664 0.0598 ]] ``` - - - diff --git a/docs/fundamentals/dataclass/new-type.md b/docs/fundamentals/dataclass/new-type.md index 45274b7ba83..78f4d09cb1d 100644 --- a/docs/fundamentals/dataclass/new-type.md +++ b/docs/fundamentals/dataclass/new-type.md @@ -1,12 +1,12 @@ # Support New Modality -Each type in `docarray.typing` corresponds to one modality. Supporting a new modality means adding a new type, and specifying how it is translated from/to Document. +Each type in `docarray.typing` corresponds to one modality. Supporting a new modality means adding a new type, and specifying how it is translated to/from Document. -Whether it is about adding a new type, or changing the behavior of an existing type, you can leverage the {meth}`~docarray.dataclasses.types.field` function. +Whether you're adding a new type or changing the behavior of an existing type, you can leverage the {meth}`~docarray.dataclasses.types.field` function. -## Create new types +## Create a new type -Say you want to define a new type `MyImage`, where image is accepted as a URI, but instead of loading it to `.tensor` of the sub-document, you want to load it to `.blob`. This is different from the built-in `Image` type {ref}`behavior`. +Say you want to define a new type `MyImage`, where image is accepted as a URI. However, instead of loading it to `.tensor` of the sub-document, you want to load it to `.blob`. This is different from the built-in `Image` type {ref}`behavior`. All you need to do is: @@ -63,13 +63,13 @@ Document(MMDoc()).summary() ```` -Specifically, `setter` defines how you want to store the value in the sub-document. Usually you need to process it and fill the value into one of the attributes {ref}`defined by the Document schema`. You may also want to keep the original value so that you can recover it in `getter` later. `setter` will be invoked when calling `Document()` on this dataclass. +Specifically, `setter` defines how you want to store the value in the sub-Document. Usually you need to process it and store the value in one of the attributes {ref}`defined by the Document schema`. You may also want to keep the original value so that you can recover it in `getter` later. `setter` is invoked when calling `Document()` on this dataclass. -`getter` defines how you want to recover the original value from the sub-Document. `getter` will be invoked when calling dataclass constructor given a Document object. +`getter` defines how you want to recover the original value from the sub-Document. `getter` is invoked when calling the dataclass constructor given a Document object. ## Override existing types -To override `getter`, `setter` behavior of the existing types, you can define a map and pass it to the argument of `type_var_map` in the {meth}`~docarray.dataclasses.types.dataclass` function. +To override the `getter` and `setter` behaviors of existing types, define a map and pass it to the argument of `type_var_map` in the {meth}`~docarray.dataclasses.types.dataclass` function: ```python from docarray import dataclass, field, Document @@ -104,4 +104,4 @@ assert m1 == m2 ```text im setting .uri only not loading it! im returning .uri! -``` \ No newline at end of file +``` From e158407db688e165c89dd92606a96bd22ef637cf Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Thu, 8 Dec 2022 11:48:22 +0100 Subject: [PATCH 04/10] docs(document store): polish wording Signed-off-by: Alex C-G --- docs/advanced/document-store/annlite.md | 2 +- docs/advanced/document-store/elasticsearch.md | 2 +- docs/advanced/document-store/index.md | 141 ++++++++---------- docs/advanced/document-store/qdrant.md | 63 ++++---- docs/advanced/document-store/redis.md | 2 +- docs/advanced/document-store/sqlite.md | 2 +- docs/advanced/document-store/weaviate.md | 110 +++++++------- 7 files changed, 154 insertions(+), 168 deletions(-) diff --git a/docs/advanced/document-store/annlite.md b/docs/advanced/document-store/annlite.md index 922b4938306..ead843b4ee6 100644 --- a/docs/advanced/document-store/annlite.md +++ b/docs/advanced/document-store/annlite.md @@ -1,7 +1,7 @@ (annlite)= # Annlite -One can use [Annlite](https://github.com/jina-ai/annlite) as the document store for DocumentArray. It is useful when one wants to have faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. +You can use [Annlite](https://github.com/jina-ai/annlite) as a document store for DocumentArray. It's suitable for faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. ````{tip} This feature requires `annlite`. You can install it via `pip install "docarray[annlite]".` diff --git a/docs/advanced/document-store/elasticsearch.md b/docs/advanced/document-store/elasticsearch.md index b55e3ba3172..391d5da8a00 100644 --- a/docs/advanced/document-store/elasticsearch.md +++ b/docs/advanced/document-store/elasticsearch.md @@ -2,7 +2,7 @@ # Elasticsearch -One can use [Elasticsearch](https://www.elastic.co) as the document store for DocumentArray. It is useful when one wants to have faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. +You can use [Elasticsearch](https://www.elastic.co) as a document store for DocumentArray. It's suitable for faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. ````{tip} This feature requires `elasticsearch`. You can install it via `pip install "docarray[elasticsearch]".` diff --git a/docs/advanced/document-store/index.md b/docs/advanced/document-store/index.md index ee71868fbab..665fc80b2ef 100644 --- a/docs/advanced/document-store/index.md +++ b/docs/advanced/document-store/index.md @@ -14,12 +14,11 @@ extend benchmark ``` -Documents inside a DocumentArray can live in a [document store](https://en.wikipedia.org/wiki/Document-oriented_database) instead of in memory, e.g. in SQLite, Redis. -The benefit of using an external store over an in-memory store is often about longer persistence and faster retrieval. +Documents inside a DocumentArray can live in a [document store](https://en.wikipedia.org/wiki/Document-oriented_database) instead of in memory (e.g. in SQLite or Redis). Compared to an in-memory store, document stores offer longer persistence and faster retrieval. -The look-and-feel of a DocumentArray with external store is **almost the same** as a regular in-memory DocumentArray. This allows users to easily switch between backends under the same DocArray idiom. +DocumentArrays with a document store look and feel **almost the same** as a regular in-memory DocumentArray. This lets you easily switch backends under the same DocArray idiom. -Take SQLite as an example. Using it as the storage backend of a DocumentArray is as simple as follows: +Let's take SQLite as an example. Using it as the storage backend of a DocumentArray is simple: ```python from docarray import DocumentArray, Document @@ -58,19 +57,18 @@ da.summary() โ”‚ โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ ``` -Note that `da` was modified inside a `with` statement. This context manager ensures that the the `DocumentArray` indices, -which allow users to access the `DocumentArray` by position (allowing statements such as `da[1]`), +Note that `da` was modified inside a `with` statement. This context manager ensures that `DocumentArray` indices, +which let you access the `DocumentArray` by position (allowing statements such as `da[1]`), are properly mapped and saved to the storage backend. -This is the recommended default usage to modify a DocumentArray that lives on a document store to avoid -unexpected behaviors that can yield to, for example, inaccessible elements by position. +This is the recommended way to modify a DocumentArray that lives in a document store to avoid +unexpected behaviors that can lead to, for example, inaccessible elements by position. - -The procedures for creating, retrieving, updating, and deleting Documents are identical to those for a regular {ref}`DocumentArray`. All DocumentArray methods such as `.summary()`, `.embed()`, `.plot_embeddings()` should also work out of the box. +The procedures for creating, retrieving, updating, and deleting Documents are just the same as for a regular {ref}`DocumentArray`. All DocumentArray methods like `.summary()`, `.embed()`, `.plot_embeddings()` also work out of the box. ## Construct -There are two ways to initialize a DocumentArray with an external storage backend. +You can initialize a DocumentArray with an external storage backend in one of two ways: ````{tab} Specify storage @@ -86,7 +84,7 @@ da = DocumentArray(storage='sqlite') ``` ```` -````{tab} Import the class and alias it +````{tab} Import and alias the class ```python from docarray.array.sqlite import DocumentArraySqlite as DocumentArray @@ -100,7 +98,7 @@ da = DocumentArray() ```` -Depending on the context, you can choose the style that fits better. For example, if you want to use a class method such as `DocumentArray.empty(10)`, then explicitly importing `DocumentArraySqlite` is the way to go. Of course, you can choose not to alias the imported class to make the code even more explicit. +Depending on the context, you can choose the style that fits best; If you want to use a class method like `DocumentArray.empty(10)`, then you should explicitly import `DocumentArraySqlite`; Alternatively, you can choose not to alias the imported class to make the code even more explicit. ```{admonition} Subindices :class: seealso @@ -112,11 +110,11 @@ To learn how to do that, see {ref}`here `. ``` -### Construct with config +### Construct with configuration -The config of a store backend is either store-specific dataclass object or a `dict` that can be parsed into the former. +The document store's configuration is either a store-specific dataclass object or a `dict` that can be parsed into that object. -You can pass the config in the constructor via `config`: +You can pass the configuration in the constructor via `config`: ````{tab} Use dataclass @@ -143,21 +141,31 @@ da = DocumentArray( ```` -Using dataclass gives you better type-checking in IDE but requires an extra import; using dict is more flexible but can be error-prone. You can choose the style that fits best to your context. +Dataclasses gives you better type-checking in IDE but require an extra import; dict is more flexible but can be error-prone. You can choose the style that best fits your context. ```{admonition} Creating DocumentArrays without specifying index :class: warning -When you specify an index (table name for SQL stores) in the config, the index will be used to persist the DocumentArray in the document store. -If you create a DocumentArray but do not specify an index, a randomized placeholder index will be created to persist the data. +When you specify an index (table name for SQL stores) in the configuration, the index will be used to persist the DocumentArray in the document store. +If you create a DocumentArray but do not specify an index, a random placeholder index will be created to persist the data. -Creating DocumentArrays without indexes is useful during prototyping but should not be used in a production setting as randomized placeholder data will be persisted in the document store unnecessarily. +Creating DocumentArrays without indexes is useful during prototyping but shouldn't be used in production, as random placeholder data will be persisted in the document store unnecessarily. ``` - ## Feature summary -DocArray supports multiple storage backends with different search features. The following table showcases relevant functionalities that are supported (โœ…) or not supported (โŒ) in DocArray depending on the backend: +Each document store supports different functionalities. The three key ones are: + +- **vector search**: perform approximate nearest neighbour search (or exact full scan search). The search function's input is a numpy array or a DocumentArray containing an embedding. +- **vector search + filter**: perform approximate nearest neighbour search (or exact full scan search). The search function's input is a numpy array or a DocumentArray containing an embedding and a filter. + +- **filter**: perform a filter step over the data. The search function's input is a filter. + +You can use **vector search** and **vector search + filter** via the DocumentArray's {meth}`~docarray.array.mixins.find.FindMixin.find` or {func}`~docarray.array.mixins.match.MatchMixin.match` methods. **Filter** functionality, on the other hand, is only available via the `.find()` method. + +A detailed explanation of the differences between `.find` and `.match` can be found [here](./../../../fundamentals/documentarray/matching) + +This table shows which of these functionalities each document store supports (โœ…) or doesn't support (โŒ): | Name | Construction | Vector search | Vector search + Filter | Filter | |---------------------------------------|------------------------------------------|---------------|------------------------|--------| @@ -169,28 +177,13 @@ DocArray supports multiple storage backends with different search features. The | [`ElasticSearch`](./elasticsearch.md) | `DocumentArray(storage='elasticsearch')` | โœ… | โœ… | โœ… | | [`Redis`](./redis.md) | `DocumentArray(storage='redis')` | โœ… | โœ… | โœ… | -The right backend choice depends on the scale of your data, the required performance and the desired ease of setup. For most use cases we recommend starting with [`AnnLite`](./annlite.md). +The right backend choice for you depends on the scale of your data, the required performance and the desired ease of setup. For most use cases we recommend starting with [`AnnLite`](./annlite.md). [**Check our One Million Scale Benchmark for more details**](./benchmark#conclusion). - -Here we understand by - -- **vector search**: perform approximate nearest neighbour search (or exact full scan search). The input of the search function is a numpy array or a DocumentArray containing an embedding. - -- **vector search + filter**: perform approximate nearest neighbour search (or exact full scan search). The input of the search function is a numpy array or a DocumentArray containing an embedding and a filter. - -- **filter**: perform a filter step over the data. The input of the search function is a filter. - -The capabilities of **vector search**, **vector search + filter** can be used using the {meth}`~docarray.array.mixins.find.FindMixin.find` or {func}`~docarray.array.mixins.match.MatchMixin.match` methods through a `DocumentArray`. -The **filter** functionality is available using the `.find` method in a `DocumentArray`. -A detailed explanation of the differences between `.find` and `.match` can be found [here](./../../../fundamentals/documentarray/matching) - ### Vector search example -Example of **vector search** - -````{tab} .find +````{tab} .find() ```python from docarray import Document, DocumentArray @@ -210,7 +203,7 @@ result[:, 'embedding'] ``` ```` -````{tab} .match +````{tab} .match() ```python from docarray import Document, DocumentArray @@ -242,9 +235,7 @@ array([[2., 2., 2.], ### Vector search with filter example -Example of **vector search + filter** - -````{tab} .find +````{tab} .find() ```python from docarray import Document, DocumentArray @@ -276,7 +267,7 @@ results[:, 'embedding'] ``` ```` -````{tab} .match +````{tab} .match() ```python from docarray import Document, DocumentArray @@ -317,8 +308,6 @@ array([[2., 2., 2.], ### Filter example -Example of **filter** - ```python from docarray import Document, DocumentArray import numpy as np @@ -356,7 +345,7 @@ array([[7., 7., 7.], ## Persistence, mutations and context manager -Having DocumentArrays that are backed by a document store introduces an extra consideration into the way you think about DocumentArrays. +Using DocumentArrays backed by a document store introduces an extra consideration into the way you think about DocumentArrays. The DocumentArray object created in your Python program is now a view of the underlying implementation in the document store. This means that your DocumentArray object in Python can be out of sync with what is persisted to the document store. @@ -382,18 +371,18 @@ Executing this script multiple times yields the same result. When you run the line `da1.append(Document())`, you expect the DocumentArray with `index_name='my_index'` to now have a length of `1`. However, when you try to create another view of the DocumentArray in `da2`, you get a fresh DocumentArray. -You also expect the script to increment the length of the DocumentArrays every time you run it. -This is because the previous run should have saved the length of the DocumentArray with `index_name="my_index"` and your most recent run will append a new document, incrementing the length by `+1` each time. +You would also expect the script to increment the length of the DocumentArrays every time you run it. +This is because the previous run _should_ have saved the length of the DocumentArray with `index_name="my_index"` and your most recent run will append a new Document, incrementing the length by `1` each time. However, it seems like your append operation is also not being persisted. ````{dropdown} What actually happened here? The DocumentArray actually did persist, but not in the way you might expect. -Since you did not use the `with` context manager or scope your mutation, the persistence logic is being evaluated when the program exits. +Since you didn't use the `with` context manager or scope your mutation, the persistence logic is being evaluated when the program exits. `da1` is destroyed first, persisting the DocumentArray of length `1`. But when `da2` is destroyed, it persists a DocumentArray of length `0` to the same index in Redis as `da1`, overriding its value. -This means that if you had not created `da2`, the overriding would not have occured and the script would actually increment the length of the DocumentArray correctly. +This means that if you had not created `da2`, the override wouldn't have occured and the script would actually increment the length of the DocumentArray correctly. You can prove this to yourself by commenting out the last 2 lines of the script and running the script repeatedly. **Script** @@ -422,7 +411,7 @@ Length of da1 is 3 ``` ```` -Now that you know the issue, let's explore what you should do to work with DocumentArrays backed by document store in a more predictable manner. +Now that you know the issue, let's explore how to work more predictably with DocumentArrays backed by a document store. ````{tab} Use with @@ -473,16 +462,16 @@ Length of da1 is 3 Length of da2 is 3 ``` -The append you made to the DocumentArray is now persisted properly. Hurray! +The `append()` you made to the DocumentArray is now persisted properly. Hooray! -The recommended way to sync data to the document store is to use the DocumentArray inside the `with` context manager. +We recommended syncing data to the document store by using the DocumentArray inside the `with` context manager. ## Known limitations -### Multiple references to the same storage backend +### Multiple references to the same document store -Let's see an example with ANNLite storage backend, other storage backends would also have the same problem. Let's create two DocumentArrays `da` and `db` that point the same storage backend: +Let's see an example with the AnnLite document store (other document stores would also have the same problem). Let's create two DocumentArrays `da` and `db` that point the same document store: ```python from docarray import DocumentArray, Document @@ -501,10 +490,10 @@ The output is: 0 ``` -Looks like `db` is not really up-to-date with `da`. This is true and false. True in the sense that `1` is not `0`, number speaks by itself. -False in the sense that, the Document is already written to the storage backend. You just can't see it. +It looks like `db` is not really up-to-date with `da`. This is both true and false. True because `1` is clearly not `0`. +False because the Document is already written to the storage backend -- you just can't see it. -To prove it does persist, run the following code snippet multiple times and you will see the length is increasing one at a time: +To prove it persists, run the following code snippet multiple times and you'll see the length is incrementing one at a time: ```python from docarray import DocumentArray, Document @@ -514,10 +503,10 @@ da.append(Document(text='hello')) print(len(da)) ``` -Simply put, the reason of this behavior is that certain meta information **not synced immediately** to the backend on *every* operation; it would be very costly to do so. -As a consequence, your multiple references to the same backend would look different if they are written in one code block as the example above. +Simply put, the reason of this behavior is that certain meta information is **not synced immediately** to the document store on *every* operation -- it would be very costly to do so. +As a consequence, your multiple references to the same document store would look different if they were written in one code block as the example above. -To solve this problem, simply use `with` statement and use DocumentArray as a context manager. The last example can be refactored into the following: +To solve this problem, simply use `with` statement and use DocumentArray as a context manager. The prior example can be refactored as follows: ```{code-block} python --- @@ -540,13 +529,13 @@ Now you get the correct output: 1 ``` -Take home message is, use the context manager and put your write operations into the `with` block, when you work with multiple references in a row. +In short, use the context manager and put your write operations into the `with` block when you work with multiple references in a row. ### Out-of-array modification -You can not take a Document *out* from a DocumentArray and modify it, then expect its modification to be committed back to the DocumentArray. +You can't take a Document *out* of a DocumentArray and modify it and then expect its modification to be committed back to the DocumentArray. -Specifically, the pattern below is not supported by any external store backend: +Specifically, no document store supports the pattern below: ```python from docarray import DocumentArray @@ -564,21 +553,21 @@ The solution is simple: use {ref}`column-selector`: da[0, 'text'] = 'hello' ``` -### Performance issue caused by list-like structure +### Performance issues caused by list-like structure + +DocArray allows list-like behavior by adding an offset-to-id mapping structure to document stores. This feature stores meta information about Document order along with the Documents themselves in the document store. -DocArray allows list-like behavior by adding an offset-to-id mapping structure to storage backends. Such feature (adding this structure) means the database stores, -along with documents, meta information about document order. -However, list_like behavior is not useful in indexers where concurrent usage is possible and users do not need information about document location. -Besides, updating list-like operation comes with a cost. -You can disable list-like behavior in the config as follows +However, list-like behavior has no use in indexers where you can use concurrent usage and don't care about a Document's location, and updating list-like operations can be costly. + +You can disable list-like behavior as follows: ```python from docarray import DocumentArray da = DocumentArray(storage='annlite', config={'n_dim': 2, 'list_like': False}) ``` -When list_like is disabled, all the list-like operations will not be allowed and raise errors. -like this: +When `list_like` is disabled, list-like operations will not be allowed and raise errors: + ```python from docarray import DocumentArray, Document import numpy as np @@ -601,8 +590,8 @@ By default, `list_like` will be true. -### Elements access is slower +### Slower element access -Obviously, a DocumentArray with on-disk storage is slower than in-memory DocumentArray. However, if you choose to use on-disk storage, then often your concern of persistence overwhelms the concern of efficiency. +Obviously, a DocumentArray with on-disk storage is slower than an in-memory DocumentArray. However, if you choose on-disk storage, then often your concern of persistence overwhelms the concern of efficiency. -Slowness can affect all functions of DocumentArray. On the bright side, they may not be that severe as you would expect. Modern database are highly optimized. Moreover, some database provides faster method for resolving certain queries, e.g. nearest-neighbour queries. We are actively and continuously improving DocArray to better leverage those features. +Slowness can affect all functions of DocumentArray. On the bright side, they may not be as severe as you would expect -- modern databases are highly optimized. Moreover, some databases provide faster methods for resolving certain queries, e.g. nearest-neighbour queries. We are actively and continuously improving DocArray to better leverage those features. diff --git a/docs/advanced/document-store/qdrant.md b/docs/advanced/document-store/qdrant.md index b7438f5499d..77df8a80596 100644 --- a/docs/advanced/document-store/qdrant.md +++ b/docs/advanced/document-store/qdrant.md @@ -1,18 +1,17 @@ (qdrant)= # Qdrant -One can use [Qdrant](https://qdrant.tech) as the document store for DocumentArray. It is useful when one wants to have faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. +You can use [Qdrant](https://qdrant.tech) as a document store for DocumentArray. It's suitable for faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. ````{tip} -This feature requires `qdrant-client`. You can install it via `pip install "docarray[qdrant]".` +This feature requires `qdrant-client`. You can install it with `pip install "docarray[qdrant]".` ```` ## Usage ### Start Qdrant service -To use Qdrant as the storage backend, you need a running Qdrant server. You can use the Qdrant Docker image to run a -server. Create `docker-compose.yml` as follows: +To use Qdrant as the storage backend, you need a running Qdrant server. You can create `docker-compose.yml` to use the Qdrant Docker image: ```yaml --- @@ -38,7 +37,7 @@ docker-compose up ### Create DocumentArray with Qdrant backend -Assuming service is started using the default configuration (i.e. server address is `http://localhost:6333`), one can +Assuming you start the service with the default configuration (i.e. server address is `http://localhost:6333`), you can instantiate a DocumentArray with Qdrant storage like so: ```python @@ -47,9 +46,9 @@ from docarray import DocumentArray da = DocumentArray(storage='qdrant', config={'n_dim': 10}) ``` -The usage would be the same as the ordinary DocumentArray. +The usage is the same as an ordinary DocumentArray. -To access a DocumentArray formerly persisted, one can specify the `collection_name`, the `host` and the `port`. +To access a formerly-persisted DocumentArray, you can specify the `collection_name`, `host` and `port`: ```python @@ -68,13 +67,11 @@ da = DocumentArray( da.summary() ``` -Note that specifying the `n_dim` is mandatory before using Qdrant as a backend for DocumentArray. +Note that you must specify `n_dim` before using Qdrant as a backend for DocumentArray. -Other functions behave the same as in-memory DocumentArray. +Other functions behave the same as an in-memory DocumentArray. -## Config - -The following configs can be set: +## Configuration | Name | Description | Default | |-----------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------| @@ -86,12 +83,12 @@ The following configs can be set: | `grpc_port` | Port of the Qdrant gRPC interface | `6334` | | `prefer_grpc` | Set `True` to use gPRC interface whenever possible in custom methods | `False` | | `api_key` | API key for authentication in Qdrant Cloud | `None` | -| `https` | Set `True` to use HTTPS(SSL) protocol | `None` | -| `serialize_config` | [Serialization config of each Document](../../../fundamentals/document/serialization.md) | `None` | +| `https` | Set `True` to use HTTPS (SSL) protocol | `None` | +| `serialize_config` | [Serialization configuration of each Document](../../../fundamentals/document/serialization.md) | `None` | | `scroll_batch_size` | Batch size used when scrolling over the storage | `64` | | `ef_construct` | Number of neighbours to consider during the index building. Larger = more accurate search, more time to build index | `None`, defaults to the default value in Qdrant* | -| `full_scan_threshold` | Minimal size (in KiloBytes) of vectors for additional payload-based indexing | `None`, defaults to the default value in Qdrant* | -| `m` | Number of edges per node in the index graph. Larger = more accurate search, more space required | `None`, defaults to the default value in Qdrant* | +| `full_scan_threshold` | Minimum size (in kilobytes) of vectors for additional payload-based indexing | `None`, defaults to the default value in Qdrant* | +| `m` | Number of edges per node in the index graph. Higher = more accurate search, more space required | `None`, defaults to the default value in Qdrant* | | `columns` | Other fields to store in Document | `None` | | `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True | @@ -147,15 +144,14 @@ print(da.find(np.random.random(D), limit=10)) (qdrant-filter)= ## Vector search with filter -Search with `.find` can be restricted by user-defined filters. Such filters can be constructed following the guidelines -in [Qdrant's Documentation](https://qdrant.tech/documentation/filtering/) - +You can restrict search with `.find` by using user-defined filters. These can be constructed following the guidelines +in [Qdrant's documentation](https://qdrant.tech/documentation/filtering/) -### Example of `.find` with a filter +### Example of `.find` with filter -Consider Documents with embeddings `[0,0,0]` up to ` [9,9,9]` where the document with embedding `[i,i,i]` -has as tag `price` with value `i`. We can create such example with the following code: +Let's create Documents with embeddings `[0,0,0]` up to `[9,9,9]`, where each Document (which has an embedding `[i,i,i]`) +has a tag `price` with value `i`: ```python from docarray import Document, DocumentArray @@ -184,9 +180,9 @@ for embedding, price in zip(da.embeddings, da[:, 'tags__price']): print(f'\tembedding={embedding},\t price={price}') ``` -Consider we want the nearest vectors to the embedding `[8. 8. 8.]`, with the restriction that prices must follow a filter. As an example, retrieved Documents must have `price` value lower than or equal to `max_price`. We can encode this information in Qdrant using `filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}`. You can also pass additional `search_params` following [Qdrant's Search API](https://qdrant.tech/documentation/search/#search-api). +We want the nearest vectors to the embedding `[8. 8. 8.]`, with the restriction that prices must follow a filter. For example, retrieved Documents must have `price` value lower than or equal to `max_price`. You can encode this information in Qdrant using `filter = {'must': [{'key': 'price', 'range': {'lte': max_price}}]}`. You can also pass additional `search_params` following [Qdrant's Search API](https://qdrant.tech/documentation/search/#search-api). -Then you can implement and use the search with the proposed filter: +You can then implement and search with the proposed filter: ```python max_price = 7 @@ -203,7 +199,7 @@ for embedding, price in zip(results.embeddings, results[:, 'tags__price']): print(f'\tembedding={embedding},\t price={price}') ``` -This would print: +This prints: ``` Query vector: [8. 8. 8.] @@ -216,8 +212,9 @@ Embeddings Nearest Neighbours with "price" at most 7: embedding=[4. 4. 4.], price=4 ``` ### Example of `.filter` with a filter -The following example shows how to use DocArray with Qdrant Document Store in order to filter text documents. -Consider Documents have the tag `price` with a value of `i`. We can create these with the following code: + +The following example shows how to use DocArray with Qdrant document store to filter text documents. +Let's create Documents with the tag `price` with a value of `i`: ```python from docarray import Document, DocumentArray import numpy as np @@ -241,11 +238,13 @@ print('\nIndexed Prices:\n') for embedding, price in zip(da.embeddings, da[:, 'tags__price']): print(f'\tembedding={embedding},\t price={price}') ``` -For example, suppose we want to filter results such that -retrieved documents must have a `price` value less than or equal to `max_price`. We can encode -this information in Qdrant using `filter = {'price': {'$lte': max_price}}`. -Then you can implement and use the search with the proposed filter: +If you want to filter only for results +with a `price` less than or equal to `max_price`, you can encode +this information using `filter = {'price': {'$lte': max_price}}`. + +You can then implement and search with the proposed filter: + ```python max_price = 7 n_limit = 4 @@ -267,4 +266,4 @@ Points with "price" at most 7: embedding=[7. 7. 7.], price=7 embedding=[1. 1. 1.], price=1 embedding=[2. 2. 2.], price=2 -``` \ No newline at end of file +``` diff --git a/docs/advanced/document-store/redis.md b/docs/advanced/document-store/redis.md index aee5d43560b..c0d289b9292 100644 --- a/docs/advanced/document-store/redis.md +++ b/docs/advanced/document-store/redis.md @@ -1,7 +1,7 @@ (redis)= # Redis -You can use [Redis](https://redis.io) as the document store for DocumentArray. It is useful when you want to have faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. +You can use [Redis](https://redis.io) as a document store for DocumentArray. It's suitable for faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. ````{tip} This feature requires `redis`. You can install it via `pip install "docarray[redis]".` diff --git a/docs/advanced/document-store/sqlite.md b/docs/advanced/document-store/sqlite.md index 6184588d1de..aac3fdc56ed 100644 --- a/docs/advanced/document-store/sqlite.md +++ b/docs/advanced/document-store/sqlite.md @@ -1,7 +1,7 @@ (sqlite)= # SQLite -One can use SQLite as the document store for DocumentArray. It is useful when you want to access a large number Document which can not fit into memory. +You can use SQLite as a document store for DocumentArray. It's suitable for accessing a large number of Documents which can't fit in memory. ## Usage diff --git a/docs/advanced/document-store/weaviate.md b/docs/advanced/document-store/weaviate.md index 3563a9232cd..47407042385 100644 --- a/docs/advanced/document-store/weaviate.md +++ b/docs/advanced/document-store/weaviate.md @@ -1,13 +1,13 @@ (weaviate)= # Weaviate -One can use [Weaviate](https://weaviate.io) as the document store for DocumentArray. It is useful when one wants to have faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. +You can use [Weaviate](https://weaviate.io) as a document store for DocumentArray. It's suitable for faster Document retrieval on embeddings, i.e. `.match()`, `.find()`. ````{tip} This feature requires `weaviate-client`. You can install it via `pip install "docarray[weaviate]".` ```` -Here is a video tutorial that guides you to build a simple image search using Weaviate and Docarray. +Here's a video tutorial on building a simple image search using Weaviate and DocArray:
@@ -17,7 +17,7 @@ Here is a video tutorial that guides you to build a simple image search using We ### Start Weaviate service -To use Weaviate as the storage backend, it is required to have the Weaviate service started. Create `docker-compose.yml` as follows: +To use Weaviate as the storage backend, you need to start the Weaviate service. Create `docker-compose.yml` as follows: ```yaml --- @@ -54,7 +54,7 @@ docker-compose up ### Create DocumentArray with Weaviate backend -Assuming service is started using the default configuration (i.e. server address is `http://localhost:8080`), one can instantiate a DocumentArray with Weaviate storage as such: +Assuming you've started the service with the default configuration (i.e. server address is `http://localhost:8080`), you can instantiate a DocumentArray with Weaviate storage: ```python from docarray import DocumentArray @@ -62,11 +62,11 @@ from docarray import DocumentArray da = DocumentArray(storage='weaviate') ``` -The usage would be the same as the ordinary DocumentArray. +You can use it just the same as an ordinary DocumentArray. -To access a DocumentArray formerly persisted, one can specify the name, the host, the port and the protocol to connect to the server. `name` is required in this case but other connection parameters are optional. If they are not provided, then it will connect to the Weaviate service bound to `http://localhost:8080`. +To access a formerly-persisted DocumentArray, you can specify the name, host, port and protocol to connect to the server. `name` is required in this case but other connection parameters are optional. If you don't provide them, it will connect to the Weaviate service bound to `http://localhost:8080`. -Note, that the `name` parameter in `config` needs to be capitalized. +Note that the `name` parameter in `config` needs to be capitalized. ```python from docarray import DocumentArray @@ -78,41 +78,39 @@ da = DocumentArray( da.summary() ``` -Other functions behave the same as in-memory DocumentArray. +Other functions behave the same as an in-memory DocumentArray. -## Config - -The following configs can be set: +## Configuration | Name | Description | Default | |----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------| | `host` | Hostname of the Weaviate server | 'localhost' | -| `port` | port of the Weaviate server | 8080 | -| `protocol` | protocol to be used. Can be 'http' or 'https' | 'http' | -| `name` | Weaviate class name; the class name of Weaviate object to presesent this DocumentArray | None | +| `port` | Port of the Weaviate server | 8080 | +| `protocol` | Protocol to use. Can be 'http' or 'https' | 'http' | +| `name` | Weaviate class name; the class name of Weaviate object to present this DocumentArray | None | | `serialize_config` | [Serialization config of each Document](../../../fundamentals/document/serialization.md) | None | -| `distance` | The distance metric used to compute the distance between vectors. Must be either `cosine` or `l2-squared`. | `None`, defaults to the default value in Weaviate* | -| `ef` | The size of the dynamic list for the nearest neighbors (used during the search). The higher ef is chosen, the more accurate, but also slower a search becomes. | `None`, defaults to the default value in Weaviate* | -| `ef_construction` | The size of the dynamic list for the nearest neighbors (used during the construction). Controls index search speed/build speed tradeoff. | `None`, defaults to the default value in Weaviate* | -| `timeout_config` | Set the timeout configuration for all requests to the Weaviate server. | `None`, defaults to the default value in Weaviate* | -| `max_connections` | The maximum number of connections per element in all layers. | `None`, defaults to the default value in Weaviate* | -| `dynamic_ef_min` | If using dynamic ef (set to -1), this value acts as a lower boundary. Even if the limit is small enough to suggest a lower value, ef will never drop below this value. This helps in keeping search accuracy high even when setting very low limits, such as 1, 2, or 3. | `None`, defaults to the default value in Weaviate* | -| `dynamic_ef_max` | If using dynamic ef (set to -1), this value acts as an upper boundary. Even if the limit is large enough to suggest a lower value, ef will be capped at this value. This helps to keep search speed reasonable when retrieving massive search result sets, e.g. 500+. | `None`, defaults to the default value in Weaviate* | -| `dynamic_ef_factor` | If using dynamic ef (set to -1), this value controls how ef is determined based on the given limit. E.g. with a factor of 8, ef will be set to 8*limit as long as this value is between the lower and upper boundary. It will be capped on either end, otherwise. | `None`, defaults to the default value in Weaviate* | -| `vector_cache_max_objects` | For optimal search and import performance all previously imported vectors need to be held in memory. However, Weaviate also allows for limiting the number of vectors in memory. By default, when creating a new class, this limit is set to 2M objects. A disk lookup for a vector is orders of magnitudes slower than memory lookup, so the cache should be used sparingly. | `None`, defaults to the default value in Weaviate* | -| `flat_search_cutoff` | Absolute number of objects configured as the threshold for a flat-search cutoff. If a filter on a filtered vector search matches fewer than the specified elements, the HNSW index is bypassed entirely and a flat (brute-force) search is performed instead. This can speed up queries with very restrictive filters considerably. Optional, defaults to 40000. Set to 0 to turn off flat-search cutoff entirely. | `None`, defaults to the default value in Weaviate* | -| `cleanup_interval_seconds` | How often the async process runs that โ€œrepairsโ€ the HNSW graph after deletes and updates. (Prior to the repair/cleanup process, deleted objects are simply marked as deleted, but still a fully connected member of the HNSW graph. After the repair has run, the edges are reassigned and the datapoints deleted for good). Typically this value does not need to be adjusted, but if deletes or updates are very frequent it might make sense to adjust the value up or down. (Higher value means it runs less frequently, but cleans up more in a single batch. Lower value means it runs more frequently, but might not be as efficient with each run). | `None`, defaults to the default value in Weaviate* | -| `skip` | There are situations where it doesnโ€™t make sense to vectorize a class. For example if the class is just meant as glue between two other class (consisting only of references) or if the class contains mostly duplicate elements (Note that importing duplicate vectors into HNSW is very expensive as the algorithm uses a check whether a candidateโ€™s distance is higher than the worst candidateโ€™s distance for an early exit condition. With (mostly) identical vectors, this early exit condition is never met leading to an exhaustive search on each import or query). In this case, you can skip indexing a vector all-together. To do so, set "skip" to "true". skip defaults to false; if not set to true, classes will be indexed normally. This setting is immutable after class initialization. | `None`, defaults to the default value in Weaviate* | -| `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True | - -*You can read more about the HNSW parameters and their default values [here](https://weaviate.io/developers/weaviate/current/vector-index-plugins/hnsw.html#how-to-use-hnsw-and-parameters) +| `distance` | Distance metric to compute distance between vectors. Must be either `cosine` or `l2-squared`. | `None`, defaults to the default value in Weaviate* | +| `ef` | Size of dynamic list for nearest neighbors (used during search). Higher ef is more accurate, but also slower for searching. | `None`, defaults to the default value in Weaviate* | +| `ef_construction` | Size of dynamic list for nearest neighbors (used during construction). Controls index search speed/build speed tradeoff. | `None`, defaults to the default value in Weaviate* | +| `timeout_config` | Set timeout configuration for all requests to Weaviate server. | `None`, defaults to the default value in Weaviate* | +| `max_connections` | Maximum connections per element in all layers. | `None`, defaults to the default value in Weaviate* | +| `dynamic_ef_min` | If using dynamic ef (set to -1), this value acts as a lower boundary. Even if limit is small enough to suggest a lower value, ef will never drop below this value. This helps in keeping search accuracy high even when setting very low limits, such as 1, 2, or 3. | `None`, defaults to the default value in Weaviate* | +| `dynamic_ef_max` | If using dynamic ef (set to -1), this value acts as an upper boundary. Even if limit is large enough to suggest a lower value, ef will be capped at this value. This helps to keep search speed reasonable when retrieving massive search result sets, e.g. 500+. | `None`, defaults to the default value in Weaviate* | +| `dynamic_ef_factor` | If using dynamic ef (set to -1), this value controls how ef is determined based on given limit. E.g. with a factor of 8, ef will be set to 8*limit as long as this value is between lower and upper boundary. Otherwise it will be capped on either end. | `None`, defaults to the default value in Weaviate* | +| `vector_cache_max_objects` | For optimal search and import performance all previously imported vectors need to be held in memory. However, Weaviate also allows for limiting number of vectors in memory. By default, when creating a new class, this limit is set to 2M objects. A disk lookup for a vector is orders of magnitudes slower than memory lookup, so cache should be used sparingly. | `None`, defaults to the default value in Weaviate* | +| `flat_search_cutoff` | Absolute number of objects configured as threshold for a flat-search cutoff. If a filter on a filtered vector search matches fewer than specified elements, the HNSW index is bypassed entirely and a flat (brute-force) search is performed instead. This can speed up queries with very restrictive filters considerably. Optional, defaults to 40000. Set to 0 to turn off flat-search cutoff entirely. | `None`, defaults to the default value in Weaviate* | +| `cleanup_interval_seconds` | How often async process runs that โ€œrepairsโ€ the HNSW graph after deletes and updates. (Prior to repair/cleanup process, deleted objects are simply marked as deleted, but still a fully connected member of the HNSW graph. After repair has run, edges are reassigned and datapoints deleted for good). Typically this value does not need to be adjusted, but if deletes or updates are very frequent it might make sense to adjust value up or down. (Higher value means it runs less frequently, but cleans up more in a single batch. Lower value means it runs more frequently, but might not be as efficient with each run). | `None`, defaults to the default value in Weaviate* | +| `skip` | There are situations where it doesnโ€™t make sense to vectorize a class. For example if the class is just meant as glue between two other class (consisting only of references) or if the class contains mostly duplicate elements (Note that importing duplicate vectors into HNSW is very expensive as the algorithm uses a check whether a candidateโ€™s distance is higher than worst candidateโ€™s distance for an early exit condition. With (mostly) identical vectors, this early exit condition is never met leading to an exhaustive search on each import or query). In this case, you can skip indexing a vector all-together. To do so, set "skip" to "true". skip defaults to false; if not set to true, classes will be indexed normally. This setting is immutable after class initialization. | `None`, defaults to the default value in Weaviate* | +| `list_like` | Controls if ordering of Documents is persisted in database. Disabling this breaks list-like features, but can improve performance. | True | + +*You can read more about HNSW parameters and their default values [here](https://weaviate.io/developers/weaviate/current/vector-index-plugins/hnsw.html#how-to-use-hnsw-and-parameters) ## Minimum example -The following example shows how to use DocArray with Weaviate Document Store in order to index and search text +The following example shows how to use DocArray with Weaviate Document Store to index and search text Documents. -First, let's run the create the `DocumentArray` instance (make sure a Weaviate server is up and running): +First, let's create the `DocumentArray` instance (ensure a Weaviate server is up and running): ```python from docarray import DocumentArray @@ -137,7 +135,7 @@ with da: ) ``` -Now, we can generate embeddings inside the database using BERT model: +Now, we can generate embeddings inside the database using the BERT model: ```python from transformers import AutoModel, AutoTokenizer @@ -173,13 +171,13 @@ Persist Documents with Weaviate. ## Filtering -Search with `.find` can be restricted by user-defined filters. Such filters can be constructed following the guidelines -in [Weaviate's Documentation](https://weaviate.io/developers/weaviate/current/graphql-references/filters.html). +You can restrict search with `.find` using user-defined filters. You can construct these filters by following the guidelines +in [Weaviate's documentation](https://weaviate.io/developers/weaviate/current/graphql-references/filters.html). ### Example of `.find` with a filter only -Consider you store Documents with a certain tag `price` into weaviate and you want to retrieve all Documents -with `price` lower or equal to some `max_price` value. +Consider you store Documents with a certain tag `price` into Weaviate and want to retrieve all Documents +with `price` lower then or equal to a `max_price` value. You can index such Documents as follows: @@ -204,7 +202,7 @@ for price in da[:, 'tags__price']: print(f'\t price={price}') ``` -Then you can retrieve all documents whose price is lower than or equal to `max_price` by applying the following +Then you can retrieve all Documents whose price is lower than or equal to `max_price` by applying the following filter: ```python @@ -219,7 +217,7 @@ for price in results[:, 'tags__price']: print(f'\t price={price}') ``` -This would print +This prints: ``` Returned examples that satisfy condition "price at most 3": @@ -232,8 +230,8 @@ This would print ### Example of `.find` with query vector and filter -Consider Documents with embeddings `[0,0,0]` up to ` [9,9,9]` where the document with embedding `[i,i,i]` -has as tag `price` with value `i`. We can create such example with the following code: +Consider Documents with embeddings `[0,0,0]` up to ` [9,9,9]` where the Document with embedding `[i,i,i]` +has a tag `price` with value `i`. We can create such an example with the following code: ```python @@ -281,7 +279,7 @@ for embedding, price in zip(results.embeddings, results[:, 'tags__price']): print(f'\tembedding={embedding},\t price={price}') ``` -This would print: +This prints: ```bash Embeddings Nearest Neighbours with "price" at most 7: @@ -298,13 +296,13 @@ Embeddings Nearest Neighbours with "price" at most 7: `pip install --upgrade weaviate-client`*** You can sort results by any primitive property, typically a text, string, number, or int property. When a query has a -natural order (e.g. because of a near vector search), adding a sort operator will override the order. +natural order (e.g. because of a near vector search), adding a sort operator overrides the order. [Further documentation here.](https://weaviate.io/developers/weaviate/current/graphql-references/get.html#sorting) ### Example of `.find` with vector and sort -Consider Documents with the column 'price' and on the return you want to sort these documents by highest price to lowest +Consider Documents with the column 'price' and on the return you want to sort these Documents by highest price to lowest price. You can create an example with the following code: ```python @@ -351,7 +349,7 @@ for embedding, price in zip(results.embeddings, results[:, 'tags__price']): print(f'\tembedding={embedding},\t price={price}') ``` -This would print: +This prints: ```bash Returned examples that verify results are in order from highest price to lowest: @@ -368,7 +366,7 @@ Returned examples that verify results are in order from highest price to lowest: embedding=[0. 0. 0.], price=0 ``` -For ascending the results would be as expected: +In ascending order the results would be as expected: ```bash embedding=[0. 0. 0.], price=0 @@ -385,14 +383,14 @@ For ascending the results would be as expected: ## Set minimum certainty on query results -The DocArray/Weaviate find class uses the NearVector search argument since Weaviate is only being used in this combination to store -vectors generated by DocArray. Sometimes you want to set the certainty at a certain level to limit the return results. +The DocArray/Weaviate find class uses the NearVector search argument since Weaviate is only used in this combination to store +vectors generated by DocArray. Sometimes you want to set the certainty at a certain level to limit the returned results. You can do this with the `query_params` argument in the `find()` method. -`query_params` is a Dictionary element that combines itself with the request body. To set you must pass the value as a +`query_params` is a Dictionary element that combines itself with the request body. To set this you must pass the value as a Dict (`query_params={"key": "value}`) within the `find()` function -If you are familiar with Weaviates GraphQL structure then you can see where the `query_params` goes: +If you are familiar with Weaviate's GraphQL structure then you can see where the `query_params` goes: ```grapql { Get{ @@ -460,7 +458,7 @@ for res in results: print(f"\t scores={res[:, 'scores']}") ``` -This should return something similar to: +This returns something similar to: ```bash Only results that have a 'weaviate_certainty' of higher than 0.9 should show: @@ -470,19 +468,19 @@ Only results that have a 'weaviate_certainty' of higher than 0.9 should show: ## Include additional properties in the return -GraphQL additional properties can be used on data objects in Get{} Queries to get additional information about the +GraphQL additional properties can be used on data objects in `Get{}` queries to get additional information about the returned data objects. Which additional properties are available depends on the modules that are attached to Weaviate. -The fields id, certainty, featureProjection and classification are available from Weaviate Core. On nested GraphQL -fields (references to other data classes), only the id can be returned. Explanation on specific additional properties +The fields `id`, `certainty`, `featureProjection` and `classification` are available from Weaviate Core. On nested GraphQL +fields (references to other data classes), only the `id` can be returned. An explanation on specific additional properties can be found on the module pages, see for example [text2vec-contextionary](https://weaviate.io/developers/weaviate/current/modules/text2vec-contextionary.html#additional-graphql-api-properties). [Further documentation here](https://weaviate.io/developers/weaviate/current/graphql-references/additional-properties.html) -In order to include additional properties on the request you can use the `additional` parameter of the `find()` function. +To include additional properties on the request you can use the `additional` parameter of the `find()` function. These will be included as Tags on the response. -Assume you want to know when the document was inserted and last updated in the DB. +Assume you want to know when the Document was inserted and last updated in the database. You can run the following: ```python @@ -531,7 +529,7 @@ for res in results: print(f"\t lastUpdateTimeUnix={res[:, 'tags__lastUpdateTimeUnix']}") ``` -This should return: +This returns: ```bash See when the Document was created and updated: From 6952f8879de06c7f246eb33474495ab27030a66d Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Thu, 8 Dec 2022 12:41:34 +0100 Subject: [PATCH 05/10] docs(document): polish wording Signed-off-by: Alex C-G --- docs/fundamentals/document/attribute.md | 28 ++++----- docs/fundamentals/document/construct.md | 42 ++++++------- docs/fundamentals/document/embedding.md | 21 +++---- .../fundamentals/document/fluent-interface.md | 11 ++-- docs/fundamentals/document/index.md | 7 +-- docs/fundamentals/document/nested.md | 8 +-- docs/fundamentals/document/serialization.md | 61 +++++++++---------- docs/fundamentals/document/visualization.md | 6 +- 8 files changed, 89 insertions(+), 95 deletions(-) diff --git a/docs/fundamentals/document/attribute.md b/docs/fundamentals/document/attribute.md index 7e6776bc329..9f2c8b7c6d9 100644 --- a/docs/fundamentals/document/attribute.md +++ b/docs/fundamentals/document/attribute.md @@ -15,19 +15,19 @@ print(d.text) hello world ``` -To unset attribute, simply assign it to `None`: +To unset an attribute, assign it to `None`: ```python d.text = None ``` -or use {meth}`~docarray.base.BaseDCType.pop`: +Or use {meth}`~docarray.base.BaseDCType.pop`: ```python d.pop('text') ``` -One can unset multiple attributes with `.pop()`: +You can unset multiple attributes with `.pop()`: ```python d.pop('text', 'id', 'mime_type') @@ -38,16 +38,16 @@ You can check which attributes are set by `.non_empty_fields`. ## Content attributes -Among all attributes, content attributes, namely `.text`, `.tensor`, and `.blob`, are super important. They contain the actual content. +Among all attributes, the most important are content attributes, namely `.text`, `.tensor`, and `.blob`. They contain the actual content. ```{seealso} -If you are working with a Document that was created through DocArray's {ref}`dataclass ` API, -you can not only access the attributes that are described here, but also the attributes that you defined yourself. +If you're working with a Document that was created through DocArray's {ref}`dataclass ` API, +you can not only access the attributes described here, but also attributes that you defined yourself. To see how to do that, see {ref}`here `. ``` -They correspond to string-like data (e.g. for natural language), `ndarray`-like data (e.g. for image/audio/video data), and binary data for general purpose, respectively. +They correspond to string-like data (e.g. for natural language), `ndarray`-like data (e.g. for image/audio/video data), and binary data (for general purpose), respectively. | Attribute | Accept type | Use case | @@ -76,7 +76,7 @@ print(d) As you can see, the `text` field is reset to empty. -But what if you want to represent more than one kind of information? Say, to fully represent a PDF page you need to store both image and text. In this case, you can use {ref}`nested Document`s and store image in one sub-Document, and text in another sub-Document. +But what if you want to represent more than one kind of information? Say, to fully represent a PDF page you need to store both image and text. In this case, you can use {ref}`nested Document`s and store the image in one sub-Document, and text in another sub-Document. ```python from docarray import Document @@ -84,9 +84,9 @@ from docarray import Document d = Document(chunks=[Document(tensor=...), Document(text=...)]) ``` -The principle is: each Document contains only one modality of information. In practice, this principle makes your full solution more clear and easier to maintain. +The principle is: Each Document contains only one modality of information. In practice, this makes your full solution clearer and easier to maintain. -There is also a `.content` getter/setter of the content fields. The content will be automatically grabbed or assigned to either `text`, `blob`, or `tensor` field, based on the given type. +There's also a `.content` getter/setter for the content fields. Content is automatically grabbed or assigned to either the `text`, `blob`, or `tensor` field, based on the given type. ```python from docarray import Document @@ -111,11 +111,11 @@ print(d) You can also check which content field is set by `.content_type`. (content-uri)= -## Load content from URI +## Load content from a URI A common pattern is loading content from a URI instead of assigning it directly in the code. -This can easily be done with `.uri` attribute. The value of `.uri` can point to either a local URI, remote URI or [data URI](https://en.wikipedia.org/wiki/Data_URI_scheme). +You can do this with the `.uri` attribute. The value of `.uri` can point to either a local URI, remote URI or [data URI](https://en.wikipedia.org/wiki/Data_URI_scheme). ````{tab} Local image URI @@ -188,7 +188,7 @@ There are more `.load_uri_to_*` functions that allow you to read {ref}`text`. +Initializing a Document object is easy. This chapter introduces the ways of constructing both empty and filled Documents. You can also construct Documents from bytes, JSON, or Protobuf message as introduced {ref}`in the next chapter`. ## Construct an empty Document @@ -15,10 +15,10 @@ d = Document() ``` -Every Document will have a unique random `id` that helps you identify this Document. It can be used to {ref}`access this Document inside a DocumentArray`. +Each Document has a unique random `id` to identify it. It can be used to {ref}`access the Document inside a DocumentArray`. ````{tip} -The random `id` is the hex value of [UUID1](https://docs.python.org/3/library/uuid.html#uuid.uuid1). To convert it into the string of UUID: +The random `id` is the hex value of [UUID1](https://docs.python.org/3/library/uuid.html#uuid.uuid1). To convert it into the a UUID string: ```python import uuid @@ -27,12 +27,12 @@ str(uuid.UUID(d.id)) ``` ```` -Though possible, it is not recommended modifying `.id` of a Document frequently, as this will lead to unexpected behavior. +Though possible, we don't recommended modifying the `.id` of a Document frequently, as this leads to unexpected behavior. (construct-from-dict)= ## Construct with attributes -This is the most common usage of the constructor: initializing a Document object with given attributes. +This is the constructor's most common use: initializing a Document object with the given attributes: ```python from docarray import Document @@ -64,7 +64,7 @@ Don't forget to leverage autocomplete in your IDE. ``` ````{tip} -When you `print()` a Document, you get a string representation such as ``. It shows the non-empty attributes of that Document as well as its `id`, which helps you understand the content of that Document. +When you `print()` a Document, you get a string representation like ``. This shows the Document's non-empty attributes as well as its `id`. All of this helps you understand the content of that Document. ```text @@ -78,7 +78,7 @@ When you `print()` a Document, you get a string representation such as `` API. +To construct multimodal Documents in a more comfortabe, readable, and idiomatic way you should use DocArray's {ref}`dataclass ` API. -To learn more about nested Document, please read {ref}`recursive-nested-document`. +To learn more about nested Documents, please read {ref}`recursive-nested-document`. ``` -Document can be nested inside `.chunks` and `.matches`. The nested structure can be specified directly during construction: +Documents can be nested inside `.chunks` and `.matches`. You can specify this nested structure directly during construction: ```python from docarray import Document @@ -132,7 +132,7 @@ print(d) ``` -For a nested Document, print its root does not give you much information. You can use {meth}`~docarray.document.mixins.plot.PlotMixin.summary`. For example, `d.summary()` gives you a more intuitive overview of the structure. +For a nested Document, printing its root doesn't give much information. Instead, you can use {meth}`~docarray.document.mixins.plot.PlotMixin.summary` -- for example, `d.summary()` gives a more intuitive overview of the Document's structure. ```text @@ -144,15 +144,16 @@ For a nested Document, print its root does not give you much information. You ca โ””โ”€ ``` -When using in Jupyter notebook/Google Colab, Document is automatically prettified. +When using in Jupyter notebook/Google Colab, Documents are automatically prettified. ```{figure} images/doc-in-jupyter.png ``` (unk-attribute)= -### Unknown attributes handling -If you give an unknown attribute (i.e. not one of the built-in Document attributes), they will be automatically "caught" into `.tags` attributes. For example, +### Unknown attribute handling + +If you give an unknown attribute (i.e. not one of the built-in Document attributes), it is automatically "caught" into the `.tags` attribute. For example: ```python from docarray import Document @@ -167,11 +168,11 @@ print(d, d.tags) {'hello': 'world'} ``` -You can change this "`catch`" behavior to `drop` (silently drop unknown attributes) or `raise` (raise a `AttributeError`) by specifying `unknown_fields_handler`. +You can change this `catch` behavior to `drop` (silently drop unknown attributes) or `raise` (raise an `AttributeError`) by specifying `unknown_fields_handler`. ### Resolve unknown attributes with rules -One can resolve external fields into built-in attributes by specifying a mapping in `field_resolver`. For example, to resolve the field `hello` as the `id` attribute: +You can resolve external fields into built-in attributes by specifying a mapping in `field_resolver`. For example, to resolve the field `hello` as the `id` attribute: ```python from docarray import Document @@ -185,7 +186,7 @@ print(d) ``` -One can see `id` of the Document object is set to `world`. +You can see `id` of the Document object is set to `world`. ## Copy from another Document @@ -205,8 +206,7 @@ print(d == d1, id(d) == id(d1)) True False ``` -That indicates `d` and `d1` have identical content, but they are different objects in memory. - +This indicates `d` and `d1` have identical content, but they are different objects in memory. If you want to keep the memory address of a Document object while only copying the content from another Document, you can use {meth}`~docarray.base.BaseDCType.copy_from`. @@ -230,4 +230,4 @@ world ## What's next? -One can also construct Document from bytes, JSON, Protobuf message. These methods are introduced {ref}`in the next chapter`. +You can also construct Documents from bytes, JSON, and Protobuf message. These methods are introduced {ref}`in the next chapter`. diff --git a/docs/fundamentals/document/embedding.md b/docs/fundamentals/document/embedding.md index 88e7154511d..d74d7d24bdc 100644 --- a/docs/fundamentals/document/embedding.md +++ b/docs/fundamentals/document/embedding.md @@ -1,6 +1,6 @@ # Embedding -Embedding is a multi-dimensional representation of a Document (often a `[1, D]` vector). It serves as a very important piece in machine learning. The attribute {attr}`~docarray.Document.embedding` is designed to contain the embedding information of a Document. +An embedding is a multi-dimensional representation of a Document (often a `[1, D]` vector). It serves as a very important piece of machine learning. The attribute {attr}`~docarray.Document.embedding` is designed to contain a Document's embedding information. Like `.tensor`, you can assign it with a Python (nested) List/Tuple, Numpy `ndarray`, SciPy sparse matrix (`spmatrix`), TensorFlow dense and sparse tensor, PyTorch dense and sparse tensor, or PaddlePaddle dense tensor. @@ -20,9 +20,9 @@ d4 = Document(embedding=torch.tensor([1, 2, 3])) d5 = Document(embedding=tf.sparse.from_dense(np.array([[1, 2, 3], [4, 5, 6]]))) ``` -Unlike some other packages, DocArray will not actively cast `dtype` into `float32`. If the right-hand assigment `dtype` is `float64` in PyTorch, it will stay as a PyTorch `float64` tensor. +Unlike some other packages, DocArray doesn't actively cast `dtype` into `float32`. If the right-hand assignment `dtype` is `float64` in PyTorch, it will stay as a PyTorch `float64` tensor. -To assign multiple Documents `.tensor` and `.embedding` in bulk, you {ref}`should use DocumentArray`. It is much faster and smarter than using for-loop. +To assign `.tensor`s and `.embedding`s to multiple Documents in bulk, {ref}`use DocumentArray`. It's much faster and smarter than using a for-loop. ## Fill embedding via neural network @@ -30,10 +30,10 @@ To assign multiple Documents `.tensor` and `.embedding` in bulk, you {ref}`shoul ```{admonition} On multiple Documents use DocumentArray :class: tip -To embed multiple Documents, do not use this feature in a for-loop. Instead, put all Documents in a DocumentArray and call `.embed()`. You can find out more in {ref}`embed-via-model`. +To embed multiple Documents, don't use this feature in a for-loop. Instead, put all Documents in a DocumentArray and call `.embed()`. You can find out more in {ref}`embed-via-model`. ``` -Usually you don't want to assign embedding manually, but instead doing something like: +Usually you don't want to assign an embedding manually, but instead doing something like: ```text d.tensor \ @@ -41,7 +41,7 @@ d.text ---> some DNN model ---> d.embedding d.blob / ``` -Once a Document has content field set, you can use a deep neural network to {meth}`~docarray.document.mixins.sugar.SingletonSugarMixin.embed` it, which means filling `.embedding`. For example, our Document looks like the following: +Once a Document has its content field set, you can use a deep neural network to {meth}`~docarray.document.mixins.sugar.SingletonSugarMixin.embed` it, which means filling its `.embedding`. For example, take this Document: ```python q = (Document(uri='/Users/hanxiao/Downloads/left/00003.jpg') @@ -50,7 +50,7 @@ q = (Document(uri='/Users/hanxiao/Downloads/left/00003.jpg') .set_image_tensor_channel_axis(-1, 0)) ``` -Let's embed it into vector via ResNet50: +Let's embed it into a vector with ResNet50: ```python import torchvision @@ -63,10 +63,10 @@ q.embed(model) ```{admonition} On multiple Documents use DocumentArray :class: tip -To match multiple Documents, do not use this feature in a for-loop. Instead, find out more in {ref}`match-documentarray`. +To match multiple Documents, don't use this feature in a for-loop. Instead, find out more in {ref}`match-documentarray`. ``` -Documents have `.embedding` set can be "matched" against each other. In this example, we build ten Documents and put them into a {ref}`DocumentArray`, and then use another Document to search against them. +Documents with an `.embedding` can be "matched" against each other. In this example, we create ten Documents and put them into a {ref}`DocumentArray`, and then use another Document to search against them. ```python from docarray import DocumentArray, Document @@ -95,6 +95,3 @@ q.summary() โ”œโ”€ โ””โ”€ ``` - - - diff --git a/docs/fundamentals/document/fluent-interface.md b/docs/fundamentals/document/fluent-interface.md index 7ad18e804b0..f6d5f279faf 100644 --- a/docs/fundamentals/document/fluent-interface.md +++ b/docs/fundamentals/document/fluent-interface.md @@ -1,6 +1,6 @@ # Fluent Interface -Document provides a simple fluent interface that allows one to process (often preprocess) a Document object by chaining methods. For example to read an image file as `numpy.ndarray`, resize it, normalize it and then store it to another file; one can simply do: +Documents provide a simple fluent interface that let you process (often preprocess) a Document object by chaining methods. For example to read an image file as `numpy.ndarray`, resize it, normalize it and then store it to another file: ```python from docarray import Document @@ -27,7 +27,7 @@ Processed `apple1.png` ``` -Note that, chaining methods always modify the original Document in-place. That means the above example is equivalent to: +Note that chaining methods always modifies the original Document in-place. That means the above example is equivalent to: ```python from docarray import Document @@ -92,7 +92,7 @@ Provide helper functions for {class}`Document` to support text data. ### SingletonSugar -Provide sugary syntax for {class}`Document` by inheriting methods from {class}`DocumentArray` +Provide sugary syntax for {class}`Document` by inheriting methods from {class}`DocumentArray`. - {meth}`~docarray.document.mixins.sugar.SingletonSugarMixin.embed` - {meth}`~docarray.document.mixins.sugar.SingletonSugarMixin.match` @@ -103,6 +103,7 @@ Provide helper functions for feature hashing. ### Porting +Provide helper functions for {class}`Document` to convert between serialization types. - {meth}`~docarray.document.mixins.porting.PortingMixin.from_base64` - {meth}`~docarray.document.mixins.porting.PortingMixin.from_bytes` @@ -116,12 +117,12 @@ Provide helper functions for feature hashing. ### Pydantic -Provide helper functions to convert to/from a Pydantic model +Provide helper functions to convert to/from a Pydantic model. - {meth}`~docarray.document.mixins.pydantic.PydanticMixin.from_pydantic_model` ### Strawberry -Provide helper functions to convert to/from a Strawberry model +Provide helper functions to convert to/from a Strawberry model. - {meth}`~docarray.document.mixins.strawberry.StrawberryMixin.from_strawberry_type` diff --git a/docs/fundamentals/document/index.md b/docs/fundamentals/document/index.md index 46ea3b87f47..ac5a5042b58 100644 --- a/docs/fundamentals/document/index.md +++ b/docs/fundamentals/document/index.md @@ -32,7 +32,7 @@ A Document object has a predefined data schema as below, each of the attributes An `ndarray`-like object can be a Python (nested) List/Tuple, Numpy ndarray, SciPy sparse matrix (spmatrix), TensorFlow dense and sparse tensor, PyTorch dense and sparse tensor, or PaddlePaddle dense tensor. ``` -The data schema of the Document is comprehensive and well-organized. One can categorize those attributes into the following groups: +The Document's data schema is comprehensive and well-organized. You can categorize its attributes into several groups: - Content related: `uri`, `text`, `tensor`, `blob`; - Nest structure related: `chunks`, `matches`, `granularity`, `adjacency`, `parent_id`; @@ -45,8 +45,7 @@ This picture depicts how you may want to construct or comprehend a Document obje ```{figure} images/document-attributes.svg ``` - -Document also provides a set of functions frequently used in data science and machine learning community. +Documents also provide a set of functions frequently used in the data science and machine learning community. ## What's next? @@ -64,4 +63,4 @@ embedding nested visualization fluent-interface -``` \ No newline at end of file +``` diff --git a/docs/fundamentals/document/nested.md b/docs/fundamentals/document/nested.md index 62b8ecaccbe..41010f20dbc 100644 --- a/docs/fundamentals/document/nested.md +++ b/docs/fundamentals/document/nested.md @@ -1,7 +1,7 @@ (recursive-nested-document)= # Nested Structure -Document can be nested both horizontally and vertically via `.matches` and `.chunks`. The picture below illustrates the recursive Document structure. +Documents can be nested both horizontally and vertically via `.matches` and `.chunks`. The picture below illustrates the recursive Document structure. ```{figure} images/nested-structure.svg ``` @@ -57,8 +57,6 @@ d.summary() ## What's next? -When you have multiple Documents with nested structures, traversing over certain chunks and matches can be crucial. Fortunately, this is extremely simple thanks to DocumentArray as shown in {ref}`access-elements`. - -Note that some methods rely on these two attributes, some methods require these two attributes to be filled in advance. For example, {meth}`~docarray.array.mixins.match.MatchMixin.match` will fill `.matches`, whereas {meth}`~docarray.array.mixins.evaluation.EvaluationMixin.evaluate` requires `.matches` to be filled. - +When you have multiple Documents with nested structures, traversing over certain chunks and matches can be crucial. This is simple thanks to DocumentArray as shown in {ref}`access-elements`. +Note that some methods rely on these two attributes, while other methods require these two attributes to be filled in advance. For example, {meth}`~docarray.array.mixins.match.MatchMixin.match` will fill `.matches`, whereas {meth}`~docarray.array.mixins.evaluation.EvaluationMixin.evaluate` requires `.matches` to be filled. diff --git a/docs/fundamentals/document/serialization.md b/docs/fundamentals/document/serialization.md index 1edabee4987..3ec5e14b478 100644 --- a/docs/fundamentals/document/serialization.md +++ b/docs/fundamentals/document/serialization.md @@ -1,21 +1,21 @@ (serialize)= # Serialization -DocArray is designed to be "ready-to-wire": it assumes you always want to send/receive Document over network across microservices. Hence, serialization of Document is important. This chapter introduces multiple serialization methods of a single Document. +DocArray is designed to be "ready-to-wire": it assumes you always want to send/receive Documents over the network across microservices. This chapter introduces a Document's multiple serialization methods. ```{tip} -One should use {ref}`DocumentArray for serializing multiple Documents`, instead of looping over Documents one by one. The former is much faster and yield more compact serialization. +You should use {ref}`DocumentArray for serializing multiple Documents`, instead of looping over Documents one by one. The former is much faster and yields more compact serialization. ``` ```{hint} -Some readers may wonder: why aren't serialization a part of constructor? They do have similarity. Nonetheless, serialization often contains elements that do not really fit into constructor: input & output model, data schema, compression, extra-dependencies. DocArray made a decision to separate the constructor and serialization for the sake of clarity and maintainability. +You may wonder: why isn't serialization part of constructor? Both are similar. Nonetheless, serialization often contains elements that don't really fit into constructor: input & output model, data schema, compression, extra-dependencies. We made a decision with DocArray to separate the constructor and serialization for the sake of clarity and maintainability. ``` (doc-json)= ## From/to JSON ```{tip} -If you are building a webservice and want to use JSON for passing DocArray objects, then data validation and field-filtering can be crucial. In this case, it is highly recommended to check out {ref}`fastapi-support` and follow the methods there. +If you're building a webservice and want to use JSON for passing DocArray objects, then data validation and field-filtering can be crucial. In this case, we highly recommend checking out {ref}`fastapi-support` and following the methods there. ``` ```{important} @@ -24,7 +24,7 @@ Depending on which protocol you use, this feature requires `pydantic` or `protob -You can serialize a Document as a JSON string via {meth}`~docarray.document.mixins.porting.PortingMixin.to_json`, and then read from it via {meth}`~docarray.document.mixins.porting.PortingMixin.from_json`. +You can serialize a Document as a JSON string with {meth}`~docarray.document.mixins.porting.PortingMixin.to_json`, and then read from it with {meth}`~docarray.document.mixins.porting.PortingMixin.from_json`. ```python from docarray import Document @@ -43,7 +43,7 @@ print(d_as_json, d) ``` -By default, it uses {ref}`JSON Schema and pydantic model` for serialization, i.e. `protocol='jsonschema'`. You can switch the method to `protocol='protobuf'`, which leverages Protobuf as the JSON serialization backend. +By default, Documents use {ref}`JSON Schema and pydantic model` for serialization, i.e. `protocol='jsonschema'`. To use Protobuf as the JSON serialization backend, pass `protocol='protobuf'` to the method: ```python @@ -71,9 +71,9 @@ d.to_json(protocol='protobuf') } ``` -When using it for REST API, it is recommended to use `protocol='jsonschema'` as the resulted JSON will follow a pre-defined schema. This is highly appreciated for modern webservice engineering. +When using a RESTful API, you should use `protocol='jsonschema'` as the resulting JSON will follow a pre-defined schema. This is highly appreciated for modern webservice engineering. -Note that you can pass extra arguments to control the include/exclude fields, lower/uppercase of the resulted JSON. For example, we can remove those fields that are empty or `none` from JSON via: +Note that you can pass extra arguments to control field inclusion/exclusion, lower/uppercase. For example, you can remove fields that are empty or `none` with: ```python from docarray import Document @@ -86,10 +86,10 @@ d.to_json(exclude_none=True) {"id": "cdbc4f7a77b411ec96ad1e008a366d49", "mime_type": "text/plain", "text": "hello, world", "embedding": [1, 2, 3]} ``` -It is easier to eyes. But when building REST API, you do not need to explicitly do this, pydantic model handle everything for you. More information can be found in {ref}`fastapi-support`. +This is easier on the eyes. But when building a RESTful API, you don't need to explicitly do this -- the pydantic model handles everything for you. More information can be found in {ref}`fastapi-support`. ```{seealso} -To find out what extra parameters you can pass to `to_json()`/`to_dict()`, please check out: +To find out what extra parameters you can pass to `to_json()`/`to_dict()`, check out: - [`protocol='jsonschema', **kwargs`](https://pydantic-docs.helpmanual.io/usage/exporting_models/#modeljson) - [`protocol='protobuf', **kwargs`](https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html#google.protobuf.json_format.MessageToJson) ``` @@ -99,13 +99,13 @@ To find out what extra parameters you can pass to `to_json()`/`to_dict()`, pleas ### From/to arbitrary JSON -Arbitrary JSON is unschema-ed JSON. It often comes from a handcrafted JSON, or an export file from other libraries. Its schema is unknown to DocArray, so by principle we can not load it. +Arbitrary JSON is unschema-ed JSON. It often comes from handcrafted JSON, or an export file from other libraries. Its schema is unknown to DocArray, so by principle you can't load it. -But load it, we do. To load an arbitrary JSON file set `protocol=None`. +But principles be damned. To load an arbitrary JSON file set `protocol=None`. -As an _arbitrary_ JSON, you should not expect it always works smoothly. DocArray will try its best reasonable effort to parse its fields: by first loading the JSON into a `dict` object; and then building a Document via `Document(dict)`; when encountering unknown attributes it follows the behavior {ref}`described here`. +As an _arbitrary_ JSON, don't expect it to work smoothly. DocArray will try its best to parse the fields: by first loading the JSON into a `dict` object; and then building a Document with `Document(dict)`; when encountering unknown attributes it follows the behavior {ref}`described here`. -Rule of thumb, if you only work inside DocArray's ecosystem, please always prefer schema-ed JSON (`.to_json(protocol='jsonschema')`, or `.to_json(protocol='protobuf')`) over unschema-ed JSON. If you are exporting DocArray's JSON to other ecosystems, also prefer schema-ed JSON. Your engineer friends will appreciate it as it is easier for integration. In fact, DocArray does **not** unschema-ed JSON export, and your engineer friends will never be upset. +As a rule of thumb, if you only work inside DocArray's ecosystem, always use schema-ed JSON (`.to_json(protocol='jsonschema')`, or `.to_json(protocol='protobuf')`) over unschema-ed JSON. If you're exporting DocArray's JSON to other ecosystems, also use schema-ed JSON. Your engineer friends will appreciate it as it is easier to integrate. In fact, DocArray does **not** offer unschema-ed JSON export, so your engineer friends will never be upset. Read more about {ref}`schema-gen` support in DocArray. @@ -114,11 +114,10 @@ Read more about {ref}`schema-gen` support in DocArray. ## From/to bytes ```{important} -Depending on your values of `protocol` and `compress` arguments, this feature may require `protobuf` and `lz4` dependencies. You can do `pip install "docarray[full]"` to install it. +Depending on your `protocol` and `compress` argument values, this feature may require `protobuf` and `lz4` dependencies. You can run `pip install "docarray[full]"` to install it. ``` - -Bytes or binary or buffer, how ever you want to call it, it probably the most common & compact wire format. DocArray provides {meth}`~docarray.document.mixins.porting.PortingMixin.to_bytes` and {meth}`~docarray.document.mixins.porting.PortingMixin.from_bytes` to serialize Document object into bytes. +Bytes or binary or buffer, however you want to call it, is probably the most common and compact wire format. DocArray provides {meth}`~docarray.document.mixins.porting.PortingMixin.to_bytes` and {meth}`~docarray.document.mixins.porting.PortingMixin.from_bytes` to serialize Document objects into bytes. ```python from docarray import Document @@ -138,7 +137,7 @@ b'\x80\x03cdocarray.document\nDocument\nq\x00)\x81q\x01}q\x02X\x05\x00\x00\x00_d ``` -Default serialization protocol is `pickle`, you can change it to `protobuf` by specifying `.to_bytes(protocol='protobuf')`. You can also add compression to it and make the result bytes smaller. For example, +The default serialization protocol is `pickle` -- you can change it to `protobuf` by specifying `.to_bytes(protocol='protobuf')`. You can also add compression to make the resulting bytes smaller: ```python d = Document(text='hello, world', embedding=np.array([1, 2, 3])) @@ -151,9 +150,9 @@ gives: 110 ``` -whereas the default `.to_bytes()` gives `666` (spooky~). +whereas the default `.to_bytes()` gives `666`. -Note that when deserializing from a non-default binary serialization, you need to specify the correct `protocol` and `compress` arguments used at the serialization time: +Note that when deserializing from a non-default binary serialization, you need to specify the correct `protocol` and `compress` arguments used at serialization time: ```python d = Document.from_bytes(d_bytes, protocol='protobuf', compress='gzip') @@ -167,10 +166,10 @@ If you go with default `protcol` and `compress` settings, you can simply use `by ## From/to base64 ```{important} -Depending on your values of `protocol` and `compress` arguments, this feature may require `protobuf` and `lz4` dependencies. You can do `pip install "docarray[full]"` to install it. +Depending on your `protocol` and `compress` argument values, this feature may require `protobuf` and `lz4` dependencies. You can run `pip install "docarray[full]"` to install it. ``` -In some cases such as in REST API, you are allowed only to send/receive string not bytes. You can serialize Document into base64 string via {meth}`~docarray.document.mixins.porting.PortingMixin.to_base64` and load it via {meth}`~docarray.document.mixins.porting.PortingMixin.from_base64`. +Sometimes, such as with RESTful APIs, you can only send/receive strings, not bytes. You can serialize a Document into a base64 string with {meth}`~docarray.document.mixins.porting.PortingMixin.to_base64` and load it with {meth}`~docarray.document.mixins.porting.PortingMixin.from_base64`. ```python from docarray import Document @@ -198,16 +197,16 @@ print(len(d.to_base64(protocol='protobuf', compress='lz4'))) 156 ``` -Note that the same `protocol` and `compress` must be followed when using `.from_base64`. +Note that you must follow the same `protocol` and `compress` when using `.from_base64`. (doc-dict)= ## From/to dict ```{important} -This feature requires `protobuf` or `pydantic` dependency. You can do `pip install "docarray[full]"` to install it. +This feature requires the `protobuf` or `pydantic` dependency. You can run `pip install "docarray[full]"` to install it. ``` -You can serialize a Document as a Python `dict` via {meth}`~docarray.document.mixins.porting.PortingMixin.to_dict`, and then read from it via {meth}`~docarray.document.mixins.porting.PortingMixin.from_dict`. +You can serialize a Document as a Python `dict` with {meth}`~docarray.document.mixins.porting.PortingMixin.to_dict`, and then read from it with {meth}`~docarray.document.mixins.porting.PortingMixin.from_dict`. ```python from docarray import Document @@ -226,15 +225,15 @@ print(d_as_dict, d) ``` -As the intermediate step of `to_json()`/`from_json()` it is unlikely to use dict IO directly. Nonetheless, you can pass the same `protocol` and `kwargs` as described in {ref}`doc-json` to control the serialization behavior. +As the intermediate step of `to_json()`/`from_json()` it's unlikely you'll use dict IO directly. Nonetheless, you can pass the same `protocol` and `kwargs` as described in {ref}`doc-json` to control serialization behavior. ## From/to Protobuf ```{important} -This feature requires `protobuf` dependency. You can do `pip install "docarray[full]"` to install it. +This feature requires `protobuf` dependency. You can run `pip install "docarray[full]"` to install it. ``` -You can also serialize a Document object into a Protobuf Message object. This is less frequently used as it is often an intermediate step when serializing into bytes, as in `to_dict()`. However, if you work with Python Protobuf API, having a Python Protobuf Message object at hand can be useful. +You can also serialize a Document object into a Protobuf Message object. This is used less frequently as it's often an intermediate step when serializing into bytes, as in `to_dict()`. However, if you work with Python's Protobuf API, having a Python Protobuf Message object at hand can be useful. ```python @@ -255,10 +254,10 @@ mime_type: "image/jpeg" ``` -One can refer to the [Protobuf specification of `Document`](../../proto/index.md) for details. +Refer to the [Protobuf specification of `Document`](../../proto/index.md) for details. -When `.tensor` or `.embedding` contains frameworks-specific ndarray-like object, you can use `.to_protobuf(..., ndarray_type='numpy')` or `.to_protobuf(..., ndarray_type='list')` to cast them into `list` or `numpy.ndarray` automatically. This will help to ensure the maximum compatability between different microservices. +When `.tensor` or `.embedding` contains a framework-specific ndarray-like object, you can use `.to_protobuf(..., ndarray_type='numpy')` or `.to_protobuf(..., ndarray_type='list')` to cast them into `list` or `numpy.ndarray` automatically. This helps ensure maximum compatibility between different microservices. ## What's next? -Serializing single Document can be useful but often we want to do things in bulk, say hundreds or one million Documents at once. In that case, looping over each Document and serializing one by one is inefficient. In DocumentArray, we will introduce the similar interfaces {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_bytes`, {meth}`~docarray.array.mixins.io.json.JsonIOMixin.to_json`, and {meth}`~docarray.array.mixins.io.json.JsonIOMixin.to_list` that allows one to [serialize multiple Documents much faster and more compact](../documentarray/serialization.md). \ No newline at end of file +Serializing a single Document can be useful, but often we want to do things in bulk, say one hundred or one million Documents at once. In that case, looping over each Document and serializing one by one is inefficient. In DocumentArray, we introduce the similar interfaces {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_bytes`, {meth}`~docarray.array.mixins.io.json.JsonIOMixin.to_json`, and {meth}`~docarray.array.mixins.io.json.JsonIOMixin.to_list` that let you [serialize multiple Documents more quickly and compactly](../documentarray/serialization.md). diff --git a/docs/fundamentals/document/visualization.md b/docs/fundamentals/document/visualization.md index 2a1f86980b0..ef4aedebc4e 100644 --- a/docs/fundamentals/document/visualization.md +++ b/docs/fundamentals/document/visualization.md @@ -1,12 +1,12 @@ # Visualization -If you have an image Document (with possible image data in `.uri`/`.tensor`), you can directly visualize it via {meth}`~docarray.document.mixins.plot.PlotMixin.display`. +If you have an image Document (with image data in `.uri`/`.tensor`), you can visualize it with {meth}`~docarray.document.mixins.plot.PlotMixin.display`. ```{figure} images/doc-plot-in-jupyter.jpg ``` -To better see the Document's nested structure, you can use {meth}`~docarray.document.mixins.plot.PlotMixin.summary`. +To better see a Document's nested structure, you can use {meth}`~docarray.document.mixins.plot.PlotMixin.summary`. ```{code-block} python --- @@ -37,7 +37,7 @@ d0.summary() โ””โ”€ ``` -When using Notebook/Colab, this is auto-rendered. +When using Notebook/Colab, this is auto-rendered: ```{figure} images/doc-auto-summary.png ``` From e7c4153c2b6f4279ba1831a97d0b913ec93fe095 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Thu, 8 Dec 2022 16:40:36 +0100 Subject: [PATCH 06/10] docs(documentarray): polish wording Signed-off-by: Alex C-G --- .../documentarray/access-attributes.md | 42 ++++---- .../documentarray/access-elements.md | 71 ++++++------ docs/fundamentals/documentarray/construct.md | 26 ++--- docs/fundamentals/documentarray/embedding.md | 18 ++-- docs/fundamentals/documentarray/evaluation.md | 102 +++++++++--------- docs/fundamentals/documentarray/find.md | 24 ++--- docs/fundamentals/documentarray/index.md | 14 +-- docs/fundamentals/documentarray/matching.md | 54 +++++----- .../documentarray/parallelization.md | 50 ++++----- .../documentarray/post-external.md | 40 +++---- .../documentarray/serialization.md | 80 +++++++------- docs/fundamentals/documentarray/subindex.md | 26 +++-- .../documentarray/visualization.md | 14 +-- 13 files changed, 270 insertions(+), 291 deletions(-) diff --git a/docs/fundamentals/documentarray/access-attributes.md b/docs/fundamentals/documentarray/access-attributes.md index aae35736a87..59da051a3d7 100644 --- a/docs/fundamentals/documentarray/access-attributes.md +++ b/docs/fundamentals/documentarray/access-attributes.md @@ -1,9 +1,9 @@ (bulk-access)= # Access Attributes -DocumentArray itself has no attribute. Accessing attributes in this context means access attributes of the contained Documents in bulk. +A DocumentArray itself has no attributes. Accessing attributes in this context means accessing attributes of the contained Documents in bulk. -In the last chapter, we get a taste of the powerful element selector of the DocumentArray. This chapter will continue talking about the attribute selector. +In the last chapter, we got a taste of DocumentArray's powerful element selector. This chapter continues talking about the attribute selector. ## Attribute selector @@ -12,9 +12,9 @@ In the last chapter, we get a taste of the powerful element selector of the Docu da[element_selector, attribute_selector] ``` -Here `element_selector` are the ones introduced {ref}`in the last chapter`. The attribute selector can be a string, or a list/tuple of string that represents the names of the attributes. +Here the `element_selector`s can be any element selector introduced {ref}`in the last chapter`. The attribute selector can be a string, or a list/tuple of string that represents attribute names. -As in element selector, one can use attribute selector to **get/set/delete** attributes in a DocumentArray. +As with element selectors, you can use attribute selectors to **get/set/delete** attributes in a DocumentArray. | Example | Return | |----------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------| @@ -27,7 +27,7 @@ As in element selector, one can use attribute selector to **get/set/delete** att | `da[:, 'tensor']`, `da.tensors` | a NdArray-like object of the all top-level Documents tensors | -Let's see an example. +Let's see an example: ```python from docarray import DocumentArray @@ -44,7 +44,7 @@ print(da[:, 'id']) ['8d41ce5c6f0d11eca2181e008a366d49', '8d41cfa66f0d11eca2181e008a366d49', '8d41cff66f0d11eca2181e008a366d49'] ``` -Of course you can use it with {ref}`the path-string selector`. +Of course you can use it with {ref}`the path-string selector`: ```python print(da['@c', 'id']) @@ -88,7 +88,7 @@ da.summary() mime_type ('str',) 2 False ``` -We can see `mime_type` are set. +You can see the `mime_type` is set for each Document. If you want to set an attribute of all Documents to the same value without looping: @@ -96,7 +96,7 @@ If you want to set an attribute of all Documents to the same value without loopi da[:, 'mime_type'] = 'hello' ``` -One can also select multiple attributes in one-shot: +You can also select multiple attributes in one shot: ```python da[:, ['mime_type', 'id']] @@ -106,7 +106,7 @@ da[:, ['mime_type', 'id']] [['image/jpg', 'image/png', 'image/jpg'], ['095cd76a6f0f11ec82211e008a366d49', '095cd8d26f0f11ec82211e008a366d49', '095cd92c6f0f11ec82211e008a366d49']] ``` -Now let's remove them. +Now let's remove them: ```python del da[:, 'mime_type'] @@ -134,11 +134,11 @@ da.summary() ## Auto-ravel on NdArray -Attribute selectors `tensor` and `embedding` behave a bit differently. Instead of relying on Python List for input & return when get/set, they automatically ravel/unravel the NdArray-like object [^1] for you. +The `tensor` and `embedding` attribute selectors behave a little differently. Instead of relying on Python List for input and return when get/set, they automatically ravel/unravel the ndarray-like object [^1] for you. -[^1]: NdArray-like can be Numpy/TensorFlow/PyTorch/SciPy/PaddlePaddle sparse & dense array. +[^1]: ndarray-like can be NumPy/TensorFlow/PyTorch/SciPy/PaddlePaddle sparse and dense array. -Here is an example, where one may expect that `da[:, 'embedding']` gives you a list of three `(1, 10)` COO matrices. But it auto ravels the results and returns as a `(3, 10)` COO matrix: +Here's an example, where you may expect that `da[:, 'embedding']` gives you a list of three `(1, 10)` COO matrices. But it auto-ravels the results and returns them as a `(3, 10)` COO matrix: ```python import numpy as np @@ -166,15 +166,13 @@ for d in da: (1, 10) ``` -Auto unravel works in a similar way, we just assign a `(3, 10)` COO matrix as `.embeddings` and it auto breaks into three and assign them into the three Documents. +Auto-unravel works in a similar way: We just assign a `(3, 10)` COO matrix as `.embeddings` and it auto-breaks into three and assigns them into the three Documents. -Of course, this is not limited to scipy sparse matrix. Any NdArray-like[^1] object would work. The same logic applies also to `.tensors` attribute. +Of course, this isn't limited to SciPy sparse matrices. Any ndarray-like[^1] object will work. The same logic also applies to the `.tensors` attribute. ## Dunder syntax for nested attributes -Some attributes are nested by nature, e.g. `.tags` and `.scores`. Accessing the deep nested value is easy thanks to the dunder (double under) expression. You can access `.tags['key1']` via `d[:, 'tags__key1']`. - -Let's see an example, +Some attributes are nested by nature, like `.tags` and `.scores`. Accessing the deep nested value is easy thanks to the dunder (double under) expression. You can access `.tags['key1']` via `d[:, 'tags__key1']`: ```python import numpy as np @@ -186,7 +184,7 @@ da.embeddings = np.random.random([3, 2]) da.match(da) ``` -Now to print `id` and matched score, one can simply do: +Now to print `id` and match score: ```python print(da['@m', ('id', 'scores__cosine__value')]) @@ -201,7 +199,7 @@ print(da['@m', ('id', 'scores__cosine__value')]) (da-content-embedding)= ## Content and embedding sugary attributes -DocumentArray provides `.texts`, `.blobs`, `.tensors`, `.contents` and `.embeddings` sugary attributes for quickly accessing the content and embedding of Documents. You can use them to get/set/delete attributes of all Documents at the top-level. +DocumentArray provides `.texts`, `.blobs`, `.tensors`, `.contents` and `.embeddings` sugary attributes for quickly accessing the content and embeddings of Documents. You can use them to get/set/delete attributes of all top-level Documents. ```python from docarray import DocumentArray @@ -216,9 +214,9 @@ print(da.texts) ['hello', 'world'] ``` -This is same as `da[:, 'text'] = ['hello', 'world']` and then `print(da[:, 'text'])` but more compact and probably more Pythonic. +This is the same as `da[:, 'text'] = ['hello', 'world']` followed by `print(da[:, 'text'])`, but more compact and probably more Pythonic. -Same for `.tensors` and `.embeddings`: +It's the same for `.tensors` and `.embeddings`: ```python import numpy as np @@ -241,4 +239,4 @@ for d in da: (10,) (10,) (10,) -``` \ No newline at end of file +``` diff --git a/docs/fundamentals/documentarray/access-elements.md b/docs/fundamentals/documentarray/access-elements.md index a4a4e181cc8..f1f7c1d1196 100644 --- a/docs/fundamentals/documentarray/access-elements.md +++ b/docs/fundamentals/documentarray/access-elements.md @@ -1,11 +1,11 @@ (access-elements)= # Access Documents -This is probably my favorite chapter so far. Readers come to this far may ask: okay you re-implement Python List coin it as DocumentArray, what's the big deal? +This is probably my favorite chapter so far. If you've come this far, you may be thinking: Okay, so you've re-implemented the Python List and called it DocumentArray. What's the big deal? -If it is just a `list` and you can only access elements via `[1]`, `[-1]`, `[1:3]`, then it is no big deal. However, DocumentArray offers much more than simple indexing. It allows you to fully exploit the rich & nested data structure of Document in an easy and efficient way. +If it really were just a `list` and you can only access elements via `[1]`, `[-1]`, `[1:3]`, then you'd be right. However, DocumentArray offers _much_ more than simple indexing. It lets you fully exploit the rich and nested data structure of Documents in an easy and efficient way. -The table below summarizes all indexing routines that DocumentArray supports. You can use them to **get, set, and delete** items in DocumentArray. +The table below summarizes all the indexing routines that DocumentArray supports. You can use them to **get, set, and delete** items in a DocumentArray. | Indexing routine | Example | Return | |-----------------------------------------|------------------------------------------------------------------------------|---------------| @@ -22,7 +22,7 @@ The table below summarizes all indexing routines that DocumentArray supports. Yo Sounds exciting? Let's continue then. ````{tip} -Most of the examples below only show getting Documents for the sake of clarity. Note that you can always use the same syntax for get/set/delete Documents. For example, +Most of the examples below only show getting Documents for the sake of clarity. Note that you can always use the same syntax to get/set/delete Documents. For example: ```python da = DocumentArray(...) @@ -37,7 +37,7 @@ del da[index] ## Basic indexing -Basic indexing such as by the integer offset, the slices are so common that I don't think we need more words. You can just use it as in Python List. +Basic indexing such as by integer offset or slices are so common that we think they can go without saying. You can just use them like you would in a Python List: ```python from docarray import DocumentArray @@ -59,7 +59,7 @@ da[1:100:10] ## Index by Document `id` -A more interesting one is selecting Documents by their `id`. +A more interesting use case is selecting Documents by their `id`s: ```python from docarray import DocumentArray @@ -84,9 +84,9 @@ print(da['7e27fa246e6611ec9a441e008a366d49', '7e27fb826e6611ec9a441e008a366d49'] ``` -No need to worry about efficiency here, it is `O(1)`. +No need to worry about efficiency here: It's `O(1)`. -Based on the same technique, one can check if a Document is inside a DocumentArray via Python `in` syntax: +Based on the same technique, you can check if a Document is inside a DocumentArray using Python's `in` syntax: ```python from docarray import DocumentArray, Document @@ -105,7 +105,7 @@ False ## Index by boolean mask -You can use a boolean mask to select Documents. This becomes useful when you want to update or filter our certain Documents: +Using a boolean mask to select Documents is useful for updating or filtering certain Documents: ```python from docarray import DocumentArray @@ -122,31 +122,31 @@ print(da) ``` -Note that if the length of the boolean mask is smaller than the length of a DocumentArray, then the remaining part is padded to `False`. +Note that if the boolean mask's length is smaller than the DocumentArray's length, the remaining part is padded to `False`. (path-string)= ## Index by nested structure -From early chapter, we already know {ref}`Document can be nested`. DocumentArray provides very easy way to traverse over the nested structure and select Documents. All you need to do is following the syntax below: +From an earlier chapter, we already know {ref}`Documents can be nested`. DocumentArray provides makes it easy to traverse over the nested structure and select Documents: ```python da['@path1,path2,path3'] ``` -- The path-string must starts with `@`. -- Multiple paths are separated by comma `,`. -- A path represents the route from the top-level Documents to the destination. You can use `c` to select chunks, `cc` to select chunks of the chunks, `m` to select matches, `mc` to select chunks of the matches, `r` to select the top-level Documents. -- A path can only go deep, not go back. You can use comma `,` to start a new path from the very top-level. -- Optionally, you can specify a slice or offset at each level, for example, `r[-1]m[:3]` will select the first 3 matches of the last root document. +- The path-string must start with `@`. +- Multiple paths are separated by commas `,`. +- A path represents the route from the top-level Documents to the destination. Use `c` to select chunks, `cc` to select chunks of chunks, `m` to select matches, `mc` to select chunks of matches, `r` to select top-level Documents. +- A path can only go deeper, not shallower. You can use commas `,` to start a new path from the very top-level. +- Optionally, specifying a slice or offset at each level (for example, `r[-1]m[:3]`) selects the first 3 matches of the last root document. ```{seealso} -If you are working with a DocumentArray that was created through DocArray's {ref}`dataclass ` API, -you can also directly access sub-documents by specifying the modality name that you have assigend to them. +If you're working with a DocumentArray that was created through DocArray's {ref}`dataclass ` API, +you can also directly access sub-documents by specifying the modality name that you assigend to them. To see how to do that, see {ref}`here `. ``` -Let's practice a bit. First construct a DocumentArray with nested Documents: +Let's practice. First construct a DocumentArray with nested Documents: ```python from docarray import DocumentArray @@ -176,7 +176,7 @@ da.summary() matches ('MatchArray',) 3 False ``` -This simple DocumentArray contains 3 Documents, each of which contains 2 matches and 2 chunks. Let's plot one of them. +This simple DocumentArray contains three Documents, each of which contains two matches and two chunks. Let's plot one of them. ```text @@ -188,13 +188,13 @@ This simple DocumentArray contains 3 Documents, each of which contains 2 matches โ””โ”€ ``` -That's still too much information, let's minimize it. +That's still too much information, let's minimize it: ```{figure} images/docarray-index-example.svg :width: 10% ``` -Now let's use the red circle to depict our intended selection. Here is what you can with the path-syntax: +Now let's use the red dot to depict our intended selection. Here's where we use the path-syntax: ```{figure} images/docarray-index-example-full1.svg ``` @@ -213,29 +213,29 @@ print(da['@c,m,r']) ``` -Let's now consider a deeper nested structure and use the path syntax to select Documents. +Let's now consider a deeper nested structure and use the path syntax to select Documents: ```{figure} images/docarray-index-example-full2.svg ``` -Last but not the least, you can use integer, or integer slice to restrict the selection. +Last but not the least, you can use integer, or integer slice to restrict the selection: ```{figure} images/docarray-index-example-full3.svg :width: 60% ``` -This can be useful when you want to get top matches of all matches from all Documents, e.g.: +This is useful to get the top matches of all matches from all Documents: ```python da['@m[:5]'] ``` -You can add space in the path-string for a better readability. +You can add spaces in the path-string for better readability. ## Index by flatten -What if I just want a flat DocumentArray without all nested structure, can I select all Documents regardless their nested structure? +What if I just want a flat DocumentArray without all nested structure? Can I select all Documents regardless of their nested structure? -Yes! Simply use ellipsis literal as the selector `da[...]`: +Yes! Simply use the ellipsis literal as the selector `da[...]`: ```python from docarray import DocumentArray @@ -267,21 +267,21 @@ da[...].summary() parent_id ('str',) 4 False ``` -Note that there is no `chunks` and `matches` in any of the Document from `da[...]` anymore. They are all flattened. +Note that there are no `chunks` or `matches` in any of the Documents from `da[...]` anymore. They have all been flattened. Documents in `da[...]` are in the chunks-and-depth-first order, i.e depth-first traversing to all chunks and then to all matches. -## Other Handy Helpers +## Other handy helpers ### Batching ```{tip} -To batch and process DocumentArray in parallel in a non-blocking way, please use {meth}`~docarray.array.mixins.parallel.ParallelMixin.map_batch` and refer to {ref}`map-batch`. +To batch and process a DocumentArray in parallel in a non-blocking way, use {meth}`~docarray.array.mixins.parallel.ParallelMixin.map_batch` and refer to {ref}`map-batch`. ``` -One can batch a large DocumentArray into small ones via {meth}`~docarray.array.mixins.group.GroupMixin.batch`. This is useful when a DocumentArray is too big to process at once. +You can batch a large DocumentArray into smaller ones with {meth}`~docarray.array.mixins.group.GroupMixin.batch`. This is useful when a DocumentArray is too big to process at once. ```python from docarray import DocumentArray @@ -314,7 +314,7 @@ da = DocumentArray.empty(1000).sample(10) ### Shuffling -Shuffling a DocumentArray inplace: +To shuffle a DocumentArray in-place: ```python from docarray import DocumentArray @@ -325,7 +325,7 @@ da.shuffle() ### Splitting by `.tags` -One can split a DocumentArray into multiple DocumentArrays according to the tag value (stored in `tags`) of each Document. +You can split a DocumentArray into multiple DocumentArrays according to a tag value (stored in `tags`) of each Document. It returns a Python `dict` where Documents with the same `tag` value are grouped together in a new DocumentArray, with their orders preserved from the original DocumentArray. ```python @@ -350,7 +350,6 @@ rv = da.split_by_tag(tag='category') 'a': } ``` - ## What's next? -Now we know how to select Documents from DocumentArray, next we learn how to {ref}`select attributes from DocumentArray`. Spoiler alert, it follows the same syntax. \ No newline at end of file +Now you know how to select Documents from DocumentArray, next you'll learn how to {ref}`select attributes from DocumentArray`. Spoiler alert: it follows the same syntax. diff --git a/docs/fundamentals/documentarray/construct.md b/docs/fundamentals/documentarray/construct.md index 3455dc44473..742da5b1219 100644 --- a/docs/fundamentals/documentarray/construct.md +++ b/docs/fundamentals/documentarray/construct.md @@ -1,7 +1,7 @@ (construct-array)= # Construct -## Construct an empty array +## Construct an empty DocumentArray ```python from docarray import DocumentArray @@ -13,7 +13,7 @@ da = DocumentArray() ``` -Now you can use list-like interfaces such as `.append()` and `.extend()` as you would add elements to a Python List. +Now you can use methods like `.append()` and `.extend()` just like you would in a Python List. ```python da.append(Document(text='hello world!')) @@ -24,7 +24,7 @@ da.extend([Document(text='hello'), Document(text='world!')]) ``` -Directly printing a DocumentArray does not show you too much useful information, you can use {meth}`~docarray.array.mixins.plot.PlotMixin.summary`. +Directly printing a DocumentArray doesn't show much useful information. For that you can use {meth}`~docarray.array.mixins.plot.PlotMixin.summary`. ```python @@ -63,7 +63,7 @@ da = DocumentArray.empty(10) ## Construct from list-like objects -You can construct DocumentArray from a `Sequence`, `List`, `Tuple` or `Iterator` that yields `Document` object. +You can construct DocumentArray from a `Sequence`, `List`, `Tuple` or `Iterator` that yields `Document` objects. ````{tab} From list of Documents ```python @@ -89,8 +89,7 @@ da = DocumentArray((Document() for _ in range(10))) ``` ```` - -As DocumentArray itself is also a "list-like object that yields `Document`", you can also construct DocumentArray from another DocumentArray: +As DocumentArray itself is also a "list-like object that yields `Document`s", you can also construct a DocumentArray from another DocumentArray: ```python da = DocumentArray(...) @@ -135,7 +134,7 @@ da = DocumentArray(d1) ## Deep copy on elements -Note that, as in Python list, adding Document object into DocumentArray only adds its memory reference. The original Document is *not* copied. If you change the original Document afterwards, then the one inside DocumentArray will also change. Here is an example, +Note that, as in Python lists, adding a Document object into a DocumentArray only adds its memory reference. The original Document is *not* copied. If you change the original Document later, then the Document inside the DocumentArray will also change: ```python from docarray import DocumentArray, Document @@ -153,7 +152,7 @@ hello world ``` -This may surprise some users, but considering the following Python code, you will find this behavior is very natural and authentic. +This may be surprising, but considering the following Python code, you'll see this behavior is very natural and authentic: ```python d = {'hello': None} @@ -169,7 +168,7 @@ None world ``` -To make a deep copy, set `DocumentArray(..., copy=True)`. Now all Documents in this DocumentArray are completely new objects with identical contents as the original ones. +To make a deep copy, set `DocumentArray(..., copy=True)`. Now all Documents in this DocumentArray are completely new objects with contents identical to the original Documents. ```python from docarray import DocumentArray, Document @@ -189,7 +188,7 @@ hello ## Construct from local files -You may recall the common pattern that {ref}`I mentioned here`. With {meth}`~docarray.document.generators.from_files` One can easily construct a DocumentArray object with all file paths defined by a glob expression. +You may recall the common pattern that {ref}`we mentioned here`. With {meth}`~docarray.document.generators.from_files` You can easily construct a DocumentArray object with all file paths defined by a glob expression. ```python from docarray import DocumentArray @@ -199,11 +198,8 @@ da_png = DocumentArray.from_files('images/*.png') da_all = DocumentArray.from_files(['images/**/*.png', 'images/**/*.jpg', 'images/**/*.jpeg']) ``` -This will scan all filenames that match the expression and construct Documents with filled `.uri` attribute. You can control if to read each as text or binary with `read_mode` argument. - - - +This scans all filenames that match the expression and constructs Documents with filled `.uri` attributes. You can control whether to read file as text or binary using the `read_mode` argument. ## What's next? -In the next chapter, we will see how to construct DocumentArray from binary bytes, JSON, CSV, dataframe, Protobuf message. \ No newline at end of file +In the next chapter, we'll see how to construct a DocumentArray from binary bytes, JSON, CSV, DataFrame, or Protobuf message. diff --git a/docs/fundamentals/documentarray/embedding.md b/docs/fundamentals/documentarray/embedding.md index e4946da4df8..bbe4036a4b8 100644 --- a/docs/fundamentals/documentarray/embedding.md +++ b/docs/fundamentals/documentarray/embedding.md @@ -6,7 +6,7 @@ {meth}`~docarray.array.mixins.embed.EmbedMixin.embed` supports both CPU & GPU. ``` -When DocumentArray has `.tensors` set, you can use a neural network to {meth}`~docarray.array.mixins.embed.EmbedMixin.embed` it into vector representations, i.e. filling `.embeddings`. For example, our DocumentArray looks like the following: +When DocumentArray has `.tensors` set, you can use a neural network to {meth}`~docarray.array.mixins.embed.EmbedMixin.embed` it into vector representations, i.e. filling `.embeddings`. For example, let's assume we have the following DocumentArray: ```python from docarray import DocumentArray @@ -16,7 +16,7 @@ docs = DocumentArray.empty(10) docs.tensors = np.random.random([10, 128]).astype(np.float32) ``` -Let's use a simple MLP in Pytorch/Keras/ONNX/Paddle as our embedding model: +Let's use a simple MLP in PyTorch/Keras/ONNX/Paddle as our embedding model: ````{tab} PyTorch @@ -90,7 +90,7 @@ model = paddle.nn.Sequential( ``` ```` -Now, you can simply do +Now, you can create the embeddings: ```python docs.embed(model) @@ -104,13 +104,13 @@ tensor([[-0.1234, 0.0506, -0.0015, 0.1154, -0.1630, -0.2376, 0.0576, -0.4109, -0.2312, -0.0068, -0.0991, 0.0767, -0.0501, -0.1393, 0.0965, -0.2062, ``` -By default, the filled `.embeddings` is in the given model framework's format. If you want it always be `numpy.ndarray`, use `.embed(..., to_numpy=True)`. +By default, the filled `.embeddings` are in the given model framework's format. If you want them to always be `numpy.ndarray`, use `.embed(..., to_numpy=True)`. -You can specify `.embed(..., device='cuda')` when working with GPU. The device name identifier depends on the model framework that you are using. +You can specify `.embed(..., device='cuda')` when working with a GPU. The device name identifier depends on the model framework that you're using. -On large DocumentArray that does not fit into GPU memory, you can set `batch_size` via `.embed(..., batch_size=128)`. +On large DocumentArrays that don't fit into GPU memory, you can set `batch_size` with `.embed(..., batch_size=128)`. -You can use pretrained model from Keras/PyTorch/PaddlePaddle/ONNX for embedding: +You can use a pretrained model from Keras/PyTorch/PaddlePaddle/ONNX for embedding: ```python import torchvision @@ -118,9 +118,9 @@ model = torchvision.models.resnet50(pretrained=True) docs.embed(model) ``` -After getting `.embeddings`, you can visualize it using {meth}`~docarray.array.mixins.plot.PlotMixin.plot_embeddings`, {ref}`find more details here`. +After getting `.embeddings`, you can visualize them using {meth}`~docarray.array.mixins.plot.PlotMixin.plot_embeddings`, {ref}`find more details here`. -Note that `.embed()` only works when you have `.tensors` set, if you have `.texts` set and your model function supports string as the input, then you can always do the following to get embeddings: +Note that `.embed()` only works when you have `.tensors` set, if you have `.texts` set and your model function supports strings as input, then you can do the following to generate embeddings: ```python from docarray import DocumentArray diff --git a/docs/fundamentals/documentarray/evaluation.md b/docs/fundamentals/documentarray/evaluation.md index 7ad289ca2ff..1b63f952d31 100644 --- a/docs/fundamentals/documentarray/evaluation.md +++ b/docs/fundamentals/documentarray/evaluation.md @@ -1,8 +1,8 @@ # Evaluate Matches -After the execution of {meth}`~docarray.array.mixins.match.MatchMixin.match`, your `DocumentArray` receives a `.matches` attribute. -You can evaluate those matches against the ground truth via {meth}`~docarray.array.mixins.evaluation.EvaluationMixin.evaluate`. -The ground truth describes which matches are relevant and non-relevant and can be provided in two formats: (1) a ground truth array or (2) in the form of labels. +After executing {meth}`~docarray.array.mixins.match.MatchMixin.match`, your `DocumentArray` receives a `.matches` attribute. +You can evaluate these matches against the ground truth via {meth}`~docarray.array.mixins.evaluation.EvaluationMixin.evaluate`. +The ground truth describes which matches are relevant and non-relevant, and can be provided in two formats: as a ground truth array or as labels. To demonstrate this, let's create a DocumentArray with random embeddings and match it to itself: @@ -33,8 +33,8 @@ da_original.summary() id ('str',) 10 False matches ('MatchArray',) 10 False ``` -Now `da.matches` contains the nearest neighbours. -To make our scenario more interesting, we mix in ten "noise Documents" to every `d.matches`: +Now `da.matches` contains the nearest neighbors. +To make this more interesting, let's mix in ten "noise Documents" in every `d.matches`: ```python da_prediction = DocumentArray(da_original, copy=True) @@ -66,16 +66,17 @@ da_prediction['@m'].summary() ## Evaluation against a ground truth array -To evaluate the matches against a ground truth array, you simply provide a DocumentArray to the evaluate function like `da_groundtruth` in the call below: +To evaluate the matches against a ground-truth array, you pass a DocumentArray (like `da_groundtruth`) to the `evaluate()` method: ```python da_predict.evaluate(ground_truth=da_groundtruth, metrics=['...'], **kwargs) ``` -Thereby, `da_groundtruth` should contain the same documents as in `da_prediction` where each `matches` attribute contains exactly those documents which are relevant to the respective root document. -The `metrics` argument determines the metric you want to use for your evaluation, e.g., `precision_at_k`. +Thereby, `da_groundtruth` should contain the same Documents as in `da_prediction`. Each `matches` attribute contains exactly those Documents which are relevant to the respective root document. -In the code cell below, we evaluate the array `da_prediction` with the noisy matches against the original one `da_original`: +You define the metrics you want to use for your evaluation (e.g. `precision_at_k`) with the `metrics` parameter. + +Let's evaluate the `da_prediction` DocumentArray (with the noisy matches) against `da_original`: ```python da_prediction.evaluate(ground_truth=da_original, metrics=['precision_at_k'], k=10) @@ -84,9 +85,8 @@ da_prediction.evaluate(ground_truth=da_original, metrics=['precision_at_k'], k=1 ```text {'precision_at_k': 0.45} ``` -It returns the average value for the `precision_at_k` metric. -The average is calculated over all Documents of `da_prediction`. -If you want to look at the individual evaluation values, you can check the {attr}`~docarray.Document.evaluations` attribute, e.g.: +This returns the average value for the `precision_at_k` metric, calculated over all Documents of `da_prediction`. +To see the individual evaluation values, check the {attr}`~docarray.Document.evaluations` attribute: ```python for d in da_prediction: @@ -108,12 +108,12 @@ for d in da_prediction: ### Document identifier -Note that the evaluation against a ground truth DocumentArray only works if both DocumentArrays have the same length and their nested structure is the same. +Note that evaluating a DocumentArray against a ground truth DocumentArray only works if both have the same length and nested structure. It makes no sense to evaluate with a completely different DocumentArray. -While evaluating, Document pairs are recognized as correct if they share the same identifier. By default, it simply uses {attr}`~docarray.Document.id`. One can customize this behavior by specifying `hash_fn`. +While evaluating, Document pairs are recognized as correct if they share the same identifier. By default, this is just {attr}`~docarray.Document.id`. You can customize this by specifying `hash_fn`. -Let's see an example by creating two DocumentArrays with some matches with identical texts. +Let's see an example by creating two DocumentArrays. Each DocumentArray has matches that are identical to each other, but differ from the matches of the other DocumentArray: ```python from docarray import DocumentArray, Document @@ -128,7 +128,7 @@ for d in g_da: d.matches.append(Document(text='my ground truth')) ``` -Now when you do evaluate, you will receive an error: +Now when you evaluate, you'll receive an error: ```python p_da.evaluate('average_precision', ground_truth=g_da) @@ -138,10 +138,10 @@ p_da.evaluate('average_precision', ground_truth=g_da) ValueError: Document from the left-hand side and from the right-hand are not hashed to the same value. This means your left and right DocumentArray may not be aligned; or it means your `hash_fn` is badly designed. ``` -This says that based on `.id` (default identifier), the given two DocumentArrays are so different that they can't be evaluated. -It is a valid point because our two DocumentArrays have completely random `.id`. +This says that based on `.id` (the default identifier), the two DocumentArrays are so different that they can't be evaluated. +It is a valid point because our two DocumentArrays have completely random `.id`s. -If we override the hash function as follows, the evaluation can be conducted: +If we override the hash function, the evaluation can proceed: ```python p_da.evaluate('average_precision', ground_truth=g_da, hash_fn=lambda d: d.text[:2]) @@ -151,14 +151,12 @@ p_da.evaluate('average_precision', ground_truth=g_da, hash_fn=lambda d: d.text[: {'average_precision': 1.0} ``` -It is correct as we define the evaluation as checking if the first two characters in `.text` are the same. - - +This is correct, as we define evaluation as checking if the first two characters in `.text` (in this case, `my`) are the same. ## Evaluation via labels -Alternatively, you can add labels to your documents to evaluate them. -In this case, a match is considered relevant to its root document if it has the same label: +Alternatively, you can evaluate your Documents by adding labels. +A match is considered relevant to its root Document if it has the same label: ```python import numpy as np @@ -176,12 +174,12 @@ example_da.evaluate(metrics=['precision_at_k']) {'precision_at_k': 0.5} ``` -Also here, the results are stored in the `.evaluations` field of each Document. +Also here, results are stored in the `.evaluations` attribute of each Document. ## Metric functions -DocArray provides common metrics used in the information retrieval community for evaluating the nearest-neighbour matches. -Some of those metrics accept additional arguments as `kwargs` which you can simply add to the call of the evaluate function: +DocArray provides common metrics used in the information retrieval community to evaluate nearest-neighbor matches. +Some of those metrics accept additional arguments as `kwargs` which you can add to the call of the `evaluate()` method: | Metric | Accept `kwargs` | |-----------------------------------------------------|------------------| @@ -196,12 +194,12 @@ Some of those metrics accept additional arguments as `kwargs` which you can simp | {meth}`~docarray.math.evaluation.ndcg_at_k` | `method`, `k` | ```{danger} -These metric scores might change if the `limit` argument of the match function is set differently. +These metric scores might change if you set the `limit` argument of the match method differently. -**Note:** Not all of these metrics can be applied to a Top-K result, i.e., `ndcg_at_k` and `r_precision` are calculated correctly only if the limit is set equal or higher than the number of documents in the `DocumentArray` provided to the match function. +**Note:** Not all of these metrics can be applied to a top-K result, i.e., `ndcg_at_k` and `r_precision` are calculated correctly only if the limit is set equal to or higher than the number of Documents in the DocumentArray provided to the match method. ``` -You can evaluate multiple metric functions at once, as you can see below: +You can evaluate multiple metric functions at once: ```python da_prediction.evaluate( @@ -213,15 +211,14 @@ da_prediction.evaluate( {'precision_at_k': 0.45, 'reciprocal_rank': 0.8166666666666667} ``` -In this case, the keyword argument `k` is passed to all metric functions, even though it does not fulfill any specific function for the calculation of the reciprocal rank. +In this case, the keyword argument `k` is passed to all metric functions, even though it fulfills no specific function for calculating the reciprocal rank. ### Custom metrics -If the pre-defined metrics do not fit your use-case, you can define a custom metric function. -It should take as input a list of binary relevance judgements of a query (`1` and `0` values). +If pre-defined metrics don't fit your use case, you can define a custom metric function, taking as input a list of binary relevance judgements of a query (`1` and `0` values). The evaluate function already calculates this binary list from the `matches` attribute so that each number represents the relevancy of a match. -Let's write a custom metric function, which counts the number of relevant documents per query: +Let's write a custom metric function, which counts the number of relevant Documents per query: ```python def count_relevant(binary_relevance): @@ -235,12 +232,12 @@ da_prediction.evaluate(ground_truth=da_original, metrics=[count_relevant]) {'count_relevant': 9.0} ``` -For an inspiration for writing your own metric function, you can take a look at DocArray's {mod}`~docarray.math.evaluation` module, which contains the implementations of the custom metric functions. +As inspiration for writing your own metric function, see DocArray's {mod}`~docarray.math.evaluation` module, which contains the implementations of the custom metric functions. ### Custom names -By default, the metrics are stored with the name of the metric function. -Alternatively, you can customize those names via the `metric_names` argument of the `evaluate` function: +By default, metrics are stored with the name of the metric function. +Alternatively, you can customize those names with the `metric_names` argument of the `evaluate` method: ```python da_prediction.evaluate( @@ -254,11 +251,11 @@ da_prediction.evaluate( {'#Relevant': 9.0, 'Precision@K': 0.47368421052631576} ``` -## Embed, match & evaluate at once +## Embed, match and evaluate at once -Instead of executing the functions {meth}`~docarray.array.mixins.embed.EmbedMixin.embed`, {meth}`~docarray.array.mixins.match.MatchMixin.match`, and {meth}`~docarray.array.mixins.evaluation.EvaluationMixin.evaluate` separately from each other, you can also execute them all at once by using {meth}`~docarray.array.mixins.evaluation.EvaluationMixin.embed_and_evaluate`. -To demonstrate this, we constuct two labeled DocumentArrays `example_queries` and `example_index`. -The second one `example_index` should be matched with `example_queries` and afterwards, we want to evaluate the reciprocal rank based on the labels of the matches in `example_queries`. +Instead of executing the methods {meth}`~docarray.array.mixins.embed.EmbedMixin.embed`, {meth}`~docarray.array.mixins.match.MatchMixin.match`, and {meth}`~docarray.array.mixins.evaluation.EvaluationMixin.evaluate` separately, you can execute them all at once with {meth}`~docarray.array.mixins.evaluation.EvaluationMixin.embed_and_evaluate`. + +To demonstrate this, let's construct two labeled DocumentArrays `example_queries` and `example_index`. `example_index` should be matched with `example_queries` and then we want to evaluate the reciprocal rank based on the matches' labels in `example_queries`. ```python import numpy as np @@ -284,20 +281,17 @@ print(result) ### Batch-wise matching -The ``embed_and_evaluate`` function is especially useful, when you need to evaluate the queries on a very large document collection (`example_index` in the code snippet above), which is too large to store the embeddings of all documents in main-memory. -In this case, ``embed_and_evaluate`` matches the queries to batches of the document collection. -After the batch is processed all embeddings are deleted. -By default, the batch size for the matching (`match_batch_size`) is set to `100_000`. -If you want to reduce the memory footprint, you can set it to a lower value. +``embed_and_evaluate`` is especially useful to evaluate queries on a Document collection (like `example_index`) which is too large to fit the embeddings of all Documents in main memory. In this case, the method matches queries to batches of the Document collection, then deletes embeddings after processing each batch. + +By default, the batch size for the matching (`match_batch_size`) is set to `100_000`. To reduce the memory footprint, you can set it to a lower value. ### Sampling Queries -If you want to evaluate a large dataset, it might be useful to sample query documents. -Since the metric values returned by the `embed_and_evaluate` are mean values, sampling should not change the result significantly if the sample is large enough. -By default, sampling is applied for `DocumentArray` objects with more than 1,000 documents. -However, it is only applied on the `DocumentArray` itself and not on the document provided in `index_data`. -If you want to change the number of samples, you can ajust the `query_sample_size` argument. -In the following code block an evaluation is done with 100 samples: +To evaluate a large dataset, it might be useful to sample query Documents. +Since the metric values returned by `embed_and_evaluate` are mean values, sampling shouldn't significantly change the result if the sample is large enough. +By default, sampling is applied for DocumentArrays with over 1,000 Documents. However, it's only applied on the `DocumentArray` itself and not on the Document provided in `index_data`. + +To change the number of samples, you can adjust the `query_sample_size` argument. In the following code block an evaluation is performed with 100 samples: ```python import numpy as np @@ -323,9 +317,9 @@ da.embed_and_evaluate( {'precision_at_k': 0.13649999999999998} ``` -Please note that in this way only documents which are actually evaluated obtain an `.evaluations` attribute. +Note that in this way, only Documents which are actually evaluated obtain an `.evaluations` attribute. -To test how close it is to the exact result, we execute the function again with `query_sample_size` set to 1,000: +To test how close it is to the exact result, you can execute the function again with `query_sample_size` set to `1_000`: ```python da.embed_and_evaluate( diff --git a/docs/fundamentals/documentarray/find.md b/docs/fundamentals/documentarray/find.md index 6cba8ea373e..6a5030a5c34 100644 --- a/docs/fundamentals/documentarray/find.md +++ b/docs/fundamentals/documentarray/find.md @@ -1,19 +1,18 @@ (find-documentarray)= # Query by Conditions -We can use {meth}`~docarray.array.mixins.find.FindMixin.find` to select Documents from a DocumentArray based the conditions specified in a `query` object. One can use `da.find(query)` to filter Documents and get nearest neighbours from `da`: +You can use {meth}`~docarray.array.mixins.find.FindMixin.find` to select Documents from a DocumentArray based on conditions specified in a `query` object. - To filter Documents, the `query` object is a Python dictionary object that defines the filtering conditions using a [MongoDB](https://docs.mongodb.com/manual/reference/operator/query/)-like query language. -- To find nearest neighbours, the `query` object needs to be a NdArray-like, a Document, or a DocumentArray object that defines embedding. One can also use `.match()` function for this purpose, and there is a minor interface difference between these two functions, which will be described {ref}`in the next chapter`. +- To find nearest neighbours, the `query` object needs to be an ndarray-like, Document, or DocumentArray that defines embedding(s). You can also use the `.match()` function for this purpose, and there's a minor interface difference between these two functions which is covered {ref}`in the next chapter`. ```{admonition} filter query syntax :class: note -The syntax to define filter queries is dependant of the {ref}`Document store ` used. Some will have their own query language -depending on the supporting backend. +The filter query syntax depends on which {ref}`document store ` you use. Some may have their own query language. ``` -Let's see some examples in action. First, let's prepare a DocumentArray we will use. +Let's see some examples in action. First, let's prepare a DocumentArray: ```python from jina import Document, DocumentArray @@ -76,13 +75,13 @@ da.summary() ## Filter with query operators -A query filter document can use the query operators to specify conditions in the following form: +A query filter document uses query operators to specify conditions: ```text { : { : }, ... } ``` -Here `field1` is {ref}`any field name` of a Document object. To access nested fields, one can use the dunder expression. For example, `tags__timestamp` is to access `doc.tags['timestamp']` field. +Here `field1` is {ref}`any field name` of a Document object. To access nested fields, you can use the dunder expression. For example, `tags__timestamp` accesses the `doc.tags['timestamp']` field. `value1` can be either a user given Python object, or a substitution field with curly bracket `{field}` @@ -103,7 +102,7 @@ Finally, `operator1` can be one of the following: | `$exists` | Matches documents that have the specified field. And empty string content is also considered as not exists. | -For example, to select all `modality='D'` Documents, +To select all `modality='D'` Documents: ```python r = da.find({'modality': {'$eq': 'D'}}) @@ -143,7 +142,7 @@ r = da.find({'tags__h': {'$gt': 10}}) "weight": 75.0}] ``` -Beside using a predefined value, one can also use a substitution with `{field}`, notice the curly brackets there. For example, +Beside using a predefined value, you can also use a substitution with `{field}`. Notice those curly braces. For example: ```python r = da.find({'tags__h': {'$gt': '{tags__w}'}}) @@ -157,12 +156,9 @@ r = da.find({'tags__h': {'$gt': '{tags__w}'}}) "weight": 25.0}] ``` - - ## Combine multiple conditions - -You can combine multiple conditions using the following operators +You can combine multiple conditions using the following operators: | Boolean Operator | Description | |------------------|----------------------------------------------------| @@ -170,8 +166,6 @@ You can combine multiple conditions using the following operators | `$or` | Join query clauses with a logical OR | | `$not` | Inverts the effect of a query expression | - - ```python r = da.find({'$or': [{'weight': {'$eq': 45}}, {'modality': {'$eq': 'D'}}]}) ``` diff --git a/docs/fundamentals/documentarray/index.md b/docs/fundamentals/documentarray/index.md index 2e01be4c281..33223fb4065 100644 --- a/docs/fundamentals/documentarray/index.md +++ b/docs/fundamentals/documentarray/index.md @@ -1,28 +1,28 @@ (documentarray)= # DocumentArray -This is a Document, we already know it can be a mix in data types and nested in structure: +This is a Document, we already know it can have different data types and a nested structure: ```{figure} images/docarray-single.svg :width: 30% ``` -Then this is a DocumentArray: +This is a DocumentArray: ```{figure} images/docarray-array.svg :width: 80% ``` -{class}`~docarray.array.document.DocumentArray` is a list-like container of {class}`~docarray.document.Document` objects. It is **the best way** when working with multiple Documents. +A {class}`~docarray.array.document.DocumentArray` is a list-like container of {class}`~docarray.document.Document` objects. It is **the best way** to work with multiple Documents. -In a nutshell, you can simply consider it as a Python `list`, as it implements **all** list interfaces. That is, if you know how to use Python `list`, you already know how to use DocumentArray. +In a nutshell, you can simply consider it as a Python `list`, as it implements **all** list interfaces. That is, if you know how to use Python's `list`, you already know how to use DocumentArray. -It is also powerful as Numpy `ndarray` and Pandas `DataFrame`, allowing you to efficiently [access elements](access-elements.md) and [attributes](access-attributes.md) of contained Documents. +It is also as powerful as Numpy `ndarray` and Pandas `DataFrame`, letting you efficiently [access elements](access-elements.md) and [attributes](access-attributes.md) of contained Documents. -What makes it more exciting is those advanced features of DocumentArray. These features greatly accelerate data scientists work on accessing nested elements, evaluating, visualizing, parallel computing, serializing, matching etc. +DocumentArray's advanced features make it even more exciting. These features greatly speed up accessing nested elements, evaluating, visualizing, parallel computing, serializing, matching etc. -Finally, if your data is too big to fit into memory, you can simply switch to an {ref}`on-disk/remote document store`. All API and user experiences remain the same. No need to learn anything else. +Finally, if your data is too big to fit in memory, you can simply switch to an {ref}`on-disk/remote document store`. The full API and user experience remain the same. There's no need to learn anything else. ## What's next? diff --git a/docs/fundamentals/documentarray/matching.md b/docs/fundamentals/documentarray/matching.md index 353a2a151e5..19660e6e6c0 100644 --- a/docs/fundamentals/documentarray/matching.md +++ b/docs/fundamentals/documentarray/matching.md @@ -6,30 +6,32 @@ {meth}`~docarray.array.mixins.match.MatchMixin.match` and {meth}`~docarray.array.mixins.find.FindMixin.find` support both CPU & GPU. ``` -Once `.embeddings` is set, one can use {meth}`~docarray.array.mixins.find.FindMixin.find` or {func}`~docarray.array.mixins.match.MatchMixin.match` function to find the nearest-neighbour Documents from another DocumentArray (or itself) based on their `.embeddings` and distance metrics. - +Once `.embeddings` is set, you can use the {meth}`~docarray.array.mixins.find.FindMixin.find` or {func}`~docarray.array.mixins.match.MatchMixin.match` method to find the nearest-neighbour Documents from another DocumentArray (or the current DocumentArray itself) based on their `.embeddings` and distance metrics. ## Difference between find and match -Though both `.find()` and `.match()` is about finding nearest neighbours of a given "query" and both accpet similar arguments, there are some differences between them: +Though both `.find()` and `.match()` are about finding nearest neighbours of a given "query" and both accept similar arguments, there are some differences: + +##### Which side is the query on? -##### Which side is the query at? -- `.find()` always requires the query on the right-hand side. Say you have a DocumentArray with one million Documents, to find one query's nearest neighbours you should write `one_million_docs.find(query)`; -- `.match()` assumes the query is on left-hand side. `A.match(B)` semantically means "A matches against B and save the results to A". So with `.match()` you should write `query.match(one_million_docs)`. +- `.find()` always requires the query on the right-hand side. Say you have a DocumentArray with one million Documents, to find a query's nearest neighbours you should use `one_million_docs.find(query)`; +- `.match()` assumes the query is on left-hand side. `A.match(B)` semantically means "A matches against B and saves the results to A". So with `.match()` you should use `query.match(one_million_docs)`. -##### What is the type of the query? - - query (RHS) in `.find()` can be plain NdArray-like object or a single Document or a DocumentArray. - - query (LHS) in `.match()` can be either a Document or a DocumentArray. +##### What's the query type? + +- The query (on the right) in `.find()` can be a plain ndarray-like object, single Document, or DocumentArray. +- The query (on the left) in `.match()` can be either a Document or DocumentArray. ##### What is the return? - - `.find()` returns a List of DocumentArray, each of which corresponds to one element/row in the query. - - `.match()` do not return anything. Match results are stored inside left-hand side's `.matches`. -In the sequel, we will use `.match()` to describe the features. But keep in mind that `.find()` should also work by simply switching the right and left-hand sides. +- `.find()` returns a List of DocumentArrays, each of which corresponds to one element/row in the query. +- `.match()` doesn't return anything. Matched results are stored inside the left-hand side's `.matches`. + +Moving forwards, we'll use `.match()`. But bear in mind you could also use `.find()` by switching the right and left-hand sides. ### Example -The following example finds for each element in `da1` the three closest Documents from the elements in `da2` according to Euclidean distance. +In the following example, for each element in `da1`, we'll find the three closest Documents from the elements in `da2` based on Euclidean distance. ````{tab} Dense embedding ```{code-block} python @@ -125,13 +127,13 @@ match emb = (0, 0) 1.0 ```` -The above example when writing with `.find()`: +The above example when using `.find()`: ```python da2.find(da1, metric='euclidean', limit=3) ``` -or simply: +Or simply: ```python da2.find( @@ -145,21 +147,19 @@ The following metrics are supported: | Metric | Frameworks | |----------------------------------------------------------------------------------------------------------------------|-------------------------------------------| -| `cosine` | Scipy, Numpy, Tensorflow, Pytorch, Paddle | -| `sqeuclidean` | Scipy, Numpy, Tensorflow, Pytorch, Paddle | -| `euclidean` | Scipy, Numpy, Tensorflow, Pytorch, Paddle | -| [Metrics supported by Scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html) | Scipy | +| `cosine` | SciPy, NumPy, TensorFlow, PyTorch, Paddle | +| `sqeuclidean` | SciPy, NumPy, TensorFlow, PyTorch, Paddle | +| `euclidean` | SciPy, NumPy, TensorFlow, PyTorch, Paddle | +| [Metrics supported by SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html) | SciPy | | User defined callable | Depending on the callable | -Note that framework is auto-chosen based on the type of `.embeddings`. For example, if `.embeddings` is a Tensorflow Tensor, then Tensorflow will be used for computing. One exception is when `.embeddings` is a Numpy `ndarray`, you can choose to use Numpy or Scipy (by specify `.match(..., use_scipy=True)`) for computing. - -By default `A.match(B)` will copy the top-K matched Documents from B to `A.matches`. When these matches are big, copying them can be time-consuming. In this case, one can leverage `.match(..., only_id=True)` to keep only {attr}`~docarray.Document.id`. - +Note that the framework is chosen automatically based on the type of `.embeddings`. For example, if `.embeddings` is a TensorFlow Tensor, then TensorFlow is used for computing. One exception is when `.embeddings` is a NumPy `ndarray`, you can choose to compute with either NumPy or SciPy (by specifying `.match(..., use_scipy=True)`). +By default `A.match(B)` copies the top-K matched Documents from B to `A.matches`. When these matches are big, copying can be time-consuming. In this case, you can leverage `.match(..., only_id=True)` to keep only {attr}`~docarray.Document.id`. ### GPU support -If `.embeddings` is a Tensorflow tensor, PyTorch tensor or Paddle tensor, `.match()` function can work directly on GPU. To do that, simply set `device=cuda`. For example, +If `.embeddings` is a TensorFlow, PyTorch, or Paddle tensor, `.match()` can work directly on the GPU. To do this, set `device=cuda`: ```python from docarray import DocumentArray @@ -174,7 +174,7 @@ da2.embeddings = torch.tensor(np.random.random([10, 256])) da1.match(da2, device='cuda') ``` -Similar as in {meth}`~docarray.array.mixins.embed.EmbedMixin.embed`, if a DocumentArray is too large to fit into GPU memory, one can set `batch_size` to alleviate the problem of OOM on GPU. +Like {meth}`~docarray.array.mixins.embed.EmbedMixin.embed`, if a DocumentArray is too large to fit into GPU memory, you can set `batch_size` to alleviate the problem of OOM on GPU: ```python da1.match(da2, device='cuda', batch_size=256) @@ -193,7 +193,7 @@ da1 = DocumentArray.empty(Q) da2 = DocumentArray.empty(M) ``` -````{tab} on CPU via Numpy +````{tab} on CPU via NumPy ```python import numpy as np @@ -230,5 +230,3 @@ da1.match(da2, device='cuda', batch_size=1_000, only_id=True) ``` ```` - - diff --git a/docs/fundamentals/documentarray/parallelization.md b/docs/fundamentals/documentarray/parallelization.md index 3a4e22f30a3..dbde191f648 100644 --- a/docs/fundamentals/documentarray/parallelization.md +++ b/docs/fundamentals/documentarray/parallelization.md @@ -3,14 +3,14 @@ ```{seealso} - {meth}`~docarray.array.mixins.parallel.ParallelMixin.map`: to parallel process Document by Document, return an interator of elements; - {meth}`~docarray.array.mixins.parallel.ParallelMixin.map_batch`: to parallel process minibatch DocumentArray, return an iterator of DocumentArray; -- {meth}`~docarray.array.mixins.parallel.ParallelMixin.apply`: like `.map()`, modify DocumentArray inplace; -- {meth}`~docarray.array.mixins.parallel.ParallelMixin.apply_batch`: like `.map_batch()`, modify DocumentArray inplace. +- {meth}`~docarray.array.mixins.parallel.ParallelMixin.apply`: like `.map()`, modify a DocumentArray inplace; +- {meth}`~docarray.array.mixins.parallel.ParallelMixin.apply_batch`: like `.map_batch()`, modify a DocumentArray inplace. ``` -Working with large DocumentArray in element-wise can be time-consuming. The naive way is to run a for-loop and enumerate all Document one by one. DocArray provides {meth}`~docarray.array.mixins.parallel.ParallelMixin.map` to speed up things quite a lot. It is like Python -built-in `map()` function but mapping the function to every element of the DocumentArray in parallel. There is also {meth}`~docarray.array.mixins.parallel.ParallelMixin.map_batch` that works on the minibatch level. +Working with large DocumentArrays in an element-wise manner can be time-consuming. The naive way is to run a for-loop and enumerate Documents one by one. DocArray provides {meth}`~docarray.array.mixins.parallel.ParallelMixin.map` to speed up things a lot. It's like Python's +built-in `map()` function but maps the function to every element of the DocumentArray in parallel. There is also {meth}`~docarray.array.mixins.parallel.ParallelMixin.map_batch` that works on the minibatch level. -`map()` returns an iterator of processed Documents. If you only modify elements in-place, and do not need the return values, you can use {meth}`~docarray.array.mixins.parallel.ParallelMixin.apply` instead: +`map()` returns an iterator of processed Documents. If you only modify elements in-place, and don't need the return values, you can use {meth}`~docarray.array.mixins.parallel.ParallelMixin.apply` instead: ```python from docarray import DocumentArray @@ -23,7 +23,7 @@ da.apply(func) This is often more popular than `map()` in practice. However, `map()` has its own charm as we shall see in the next section. -Let's see an example, where we want to preprocess ~6000 image Documents. First we fill the URI to each Document. +Let's see an example, where we want to preprocess about 6,000 image Documents. First we fill the URI of each Document: ```python from docarray import DocumentArray @@ -31,7 +31,7 @@ from docarray import DocumentArray docs = DocumentArray.from_files('*.jpg') # 6016 image Documents with .uri set ``` -To load and preprocess `docs`, we have: +Now let's load and preprocess the Documents: ```python def foo(d): @@ -42,7 +42,7 @@ def foo(d): ) ``` -This load the image from file into `.tensor` do some normalization and set the channel axis. Now, let's compare the time difference when we do things sequentially and use `.apply()`: +This loads the image from the file into the `.tensor` does some normalization and sets the channel axis. Now, let's see the time difference when we do things sequentially compared to using `.apply()`: ````{tab} For-loop @@ -64,27 +64,27 @@ foo-loop ... foo-loop takes 11.5 seconds apply ... apply takes 3.5 seconds ``` -One can see a significant speedup with `.apply()`. +You can see a significant speedup with `.apply()`. By default, parallelization is conducted with `thread` backend, i.e. multi-threading. It also supports `process` backend by setting `.apply(..., backend='process')`. ```{admonition} When to choose process or thread backend? :class: important -It depends on how your `func` in `.apply(func)` look like, here are some tips: -- First, if you want `func` to modify elements inplace, the you can only use `thread` backend. With `process` backend you can only rely on the return values of `.map()`, the modification happens inside `func` is lost. -- Second, follow what people often suggests: IO-bound `func` uses `thread`, CPU-bound `func` uses `process`. -- Last, ignore the second rule and what people told you. Test it by yourself and use whatever faster. +It depends on your `func` in `.apply(func)`. There are three options: +- First, if you want `func` to modify elements inplace, then only use the `thread` backend. With the `process` backend you can only rely on the return values of `.map()` -- the modification that happens inside `func` is lost. +- Second, follow what people often suggest: IO-bound `func` uses `thread`, CPU-bound `func` uses `process`. +- Last, ignore everything above. Test it for yourself and use whatever's faster. ``` (map-batch)= ## Use `map_batch()` to overlap CPU & GPU computation -As I said, {meth}`~docarray.array.mixins.parallel.ParallelMixin.map` / {meth}`~docarray.array.mixins.parallel.ParallelMixin.map_batch` has its own charm: it returns an iterator (of batch) where the partial result is immediately available, *regardless* if your function is still running. One can leverage this feature to speedup computation, especially when working with a CPU-GPU pipeline. +As I said, {meth}`~docarray.array.mixins.parallel.ParallelMixin.map`/{meth}`~docarray.array.mixins.parallel.ParallelMixin.map_batch` has its own charm: it returns an iterator (of a batch) where the partial result is immediately available, *regardless* of whether your function is still running. You can leverage this to speed up computation, especially when working with a CPU-GPU pipeline. -Let's see an example, say we have a DocumentArray with 1024 Documents, assuming we can run a CPU job for a 16-Document batch in 1 second/core; and we can run a GPU job for a 16-Document batch in 2 second/core. Say we have 4 CPU core and 1 GPU core as the total resources. +Let's see an example: Say we have a DocumentArray with 1,024 Documents. Assuming we can run a CPU job for a 16-Document batch in 1 second per core; and we can run a GPU job for a 16-Document batch in 2 seconds per core. Say we have four CPU cores and one GPU core as the total resources. -Question: **how long will it take to process 1024 Documents?** +Question: **how long will it take to process 1,024 Documents?** ```python @@ -107,7 +107,7 @@ def gpu_job(da): ``` -Before jump to the code, lets first whiteboard it, do a simple math: +Before jumping to the code, let's first whiteboard it with simple math: ```text CPU time: 1024/16/4 * 1s = 16s @@ -117,14 +117,14 @@ Total time: 16s + 128s = 144s So 144s, right? Yes, if we implement with `apply()`, it is around 144s. -However, we can do better. What if we overlap the computation of CPU and GPU? The whole procedure is anyway GPU bounded. If we can make sure GPU works on every batch **right away** when it is ready from CPU, rather than waits until all batches are ready from CPU, then we can save a lot of time. To be precise, we could do it in _129s_. +However, we can do better. What if we overlap the computation of CPU and GPU? The whole procedure is GPU-bounded anyway. If we can ensure the GPU works on every batch **right away** when it's ready from the CPU, rather than wait until all batches are ready from the CPU, then we can save a lot of time. To be precise, we could do it in _129s_. -```{admonition} Why 129s? Why not 128s +```{admonition} Why 129s? Why not 128s? :class: tip -Btw, if you immedidately know the answer you should [send your CV to us](https://jobs.jina.ai/). +If you immediately know the answer, [send your CV to us](https://jobs.jina.ai/)! -Because the very first batch must be done by the CPU first, this is inevitible, which makes the 1 second non-overlapping. The rest of the time will be overlapped and dominated by GPU's 128s. Hence, 1s + 128s = 129s. +Because the very first batch must be done by the CPU first. This is inevitable, which makes the first second non-overlapping. The rest of the time will be overlapped and dominated by the GPU's 128s. Hence, 1s + 128s = 129s. ``` Okay, let's program these two ways and validate our guess: @@ -144,18 +144,18 @@ for b in da.map_batch(cpu_job, batch_size=16, num_worker=4): ``` ```` -Which gives you, +Which gives you: ```text apply: 144.476s map: 129.326s ``` -Hope this sheds the light on solving the data-draining/blocking problem when you use DocArray in a CPU-GPU pipeline. +Hopefully this sheds light on solving the data-draining/blocking problem when you use DocArray in a CPU-GPU pipeline. ## Use `map_batch()` to overlap CPU and network time -Such technique and mindset can be extended to other pipeline that has potential data-blocking issue. For example, in the implementation of {meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.push`, you will find code similar to below: +Such a technique and mindset can be extended to other pipelines that have potential data-blocking issues. For example, in the implementation of {meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.push`, you find code similar to below: ```{code-block} python --- @@ -184,4 +184,4 @@ response = requests.post( ) ``` -This overlaps the time of sending network request (IO-bounded) with the time of serializing DocumentArray (CPU-bounded) and hence improve the performance a lot. \ No newline at end of file +This overlaps the time of sending network requests (IO-bounded) with the time of serializing DocumentArrays (CPU-bounded) and hence improves performance a lot. diff --git a/docs/fundamentals/documentarray/post-external.md b/docs/fundamentals/documentarray/post-external.md index 091846c7559..7f3014a9dd7 100644 --- a/docs/fundamentals/documentarray/post-external.md +++ b/docs/fundamentals/documentarray/post-external.md @@ -2,12 +2,12 @@ # Process via External Flow or Executor ```{tip} -This feature requires `jina` dependency. Please install Jina via `pip install -U jina`. +This feature requires the `jina` dependency. Please install it by running `pip install -U jina`. ``` -You can call an external Flow/Sandbox/Executor to "process" a DocumentArray via {meth}`~docarray.array.mixins.post.PostMixin.post`. The external Flow/Executor can be either local, remote, or inside Docker container. +You can call an external Flow/Sandbox/Executor to "process" a DocumentArray via {meth}`~docarray.array.mixins.post.PostMixin.post`. The external Flow/Executor can be local, remote, or inside a Docker container. -For example, to use an existing Flow on `192.168.2.3` on port `12345` to process a DocumentArray: +For example, to process a DocumentArray with an existing Flow at `192.168.2.3` on port `12345`: ```python from docarray import DocumentArray @@ -18,7 +18,8 @@ r = da.post('grpc://192.168.2.3:12345') r.summary() ``` -One can also use any [Executor from Jina Hub](https://cloud.jina.ai), e.g. +You can also use any Executor from [Executor Hub](https://cloud.jina.ai): + ```python from docarray import DocumentArray, Document @@ -45,7 +46,7 @@ r.summary() uri ('str',) 1 False ``` -Single Document has a sugar syntax that leverages this feature. Hence the above example can be also written as follows: +Single Documents have syntactic sugar that leverages this processing, meaning you can also write the above example as follows: ```python from docarray import Document @@ -56,7 +57,7 @@ r = d.post('jinahub+sandbox://CoquiTTS7') ## Accept schemes -{meth}`~docarray.array.mixins.post.PostMixin.post` accepts a URI-like scheme that supports a wide range of Flow/Hub Executor. It is described as below: +{meth}`~docarray.array.mixins.post.PostMixin.post` accepts a URI-like scheme, supporting a wide range of Flows/Hub Executors: ```text scheme://netloc[:port][/path] @@ -64,22 +65,23 @@ scheme://netloc[:port][/path] | Attribute | Supported Values | Meaning | |-----------|---------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------| -| `scheme` | 1. One of `grpc`, `websocket`, `http` | `protocol` of the connected Flow | -| | 2. One of `jinahub`, `jinahub+docker`, `jinhub+sandbox` | Jina hub executor in source code, Docker container, sandbox | -| `netloc` | 1. Host address | `host` of the connected Flow | -| | 2. Hub Executor name | Any Executor [listed here](https://cloud.jina.ai) | -| | 3. Executor version(optional) | Such as v0.1.1, v0.1.1-gpu, by default latest | -| `:port` | e.g. `:55566` | `port` of the connected Flow. This is required when using `scheme` type (1) ; it is ignored when using hub-related `scheme` type (2) | -| `/path` | e.g. `/foo` | The endpoint of the Executor you want to call. | +| `scheme` | 1. One of `grpc`, `websocket`, `http` | `protocol` of connected Flow | +| | 2. One of `jinaai`, `jinaai+docker`, `jinaai+sandbox` | Executor Hub Executor in source code/Docker container/sandbox | +| `netloc` | 1. Host address | `host` of connected Flow | +| | 2. Hub Executor name | Any [Hub Executor](https://cloud.jina.ai) | +| | 3. Executor version (optional) | e.g. `v0.1.1`, `v0.1.1-gpu`. `latest` by default | +| `:port` | e.g. `:55566` | `port` of connected Flow. Required when using `scheme` type (1); ignored when using Hub-related `scheme` type (2) | +| `/path` | e.g. `/foo` | Endpoint of Executor you want to call. | Some examples: -- `.post('websocket://localhost:8081/foo')`: call the `/foo` endpoint of the Flow on `localhost` port `8081` with `websocket` protocol to process the DocumentArray; processing is on local. -- `.post('grpc://192.168.12.2:12345/foo')`: call the `/foo` endpoint of the Flow on `192.168.12.2` port `12345` with `grpc` protocol to process the DocumentArray; processing is on remote. -- `.post('jinahub://Hello/foo')`: call the `/foo` endpoint of the Hub Executor `Hello` to process the DocumentArray; porcessing is on local. -- `.post('jinahub+sandbox://Hello/foo')`: call the `/foo` endpoint of the Hub Sandbox `Hello` to process the DocumentArray; porcessing is on remote. -- `.post('jinahub+docker://Hello/v0.5.0/foo')`: call the `/foo` endpoint of the Hub Sandbox `Hello` of version `v0.5.0` to process the DocumentArray; porcessing in container. + +- `.post('websocket://localhost:8081/foo')`: call the `/foo` endpoint of the Flow at `localhost` port `8081` with `websocket` protocol to process the DocumentArray; processing is local. +- `.post('grpc://192.168.12.2:12345/foo')`: call the `/foo` endpoint of the Flow at `192.168.12.2` port `12345` with `grpc` protocol to process the DocumentArray; processing is remote. +- `.post('jinahub://Hello/foo')`: call the `/foo` endpoint of the Hub Executor `Hello` to process the DocumentArray; processing is local. +- `.post('jinahub+sandbox://Hello/foo')`: call the `/foo` endpoint of the Hub Sandbox `Hello` to process the DocumentArray; processing is remote. +- `.post('jinahub+docker://Hello/v0.5.0/foo')`: call the `/foo` endpoint of the Hub Sandbox `Hello` of version `v0.5.0` to process the DocumentArray; processing in container. ## Read more -For more explanation of Flow, Hub Executor and Sandbox, please refer to [Jina docs](https://docs.jina.ai). +For a deeper explanation of Flow, Hub Executor and Sandbox, refer to [Jina's docs](https://docs.jina.ai). diff --git a/docs/fundamentals/documentarray/serialization.md b/docs/fundamentals/documentarray/serialization.md index c6d2c568fc8..6fb37789dd5 100644 --- a/docs/fundamentals/documentarray/serialization.md +++ b/docs/fundamentals/documentarray/serialization.md @@ -2,8 +2,9 @@ # Serialization DocArray is designed to be "ready-to-wire" at anytime. Serialization is important. -DocumentArray provides multiple serialization methods that allows one transfer DocumentArray object over network and across different microservices. -Moreover, there is the ability to store/load `DocumentArray` objects to/from disk. + +DocumentArray provides multiple serialization methods that let you transfer DocumentArray objects over the network and across different microservices. +Moreover, you can store/load `DocumentArray` objects to/from disk. - JSON string: `.from_json()`/`.to_json()` - Pydantic model: `.from_pydantic_model()`/`.to_pydantic_model()` @@ -15,18 +16,14 @@ Moreover, there is the ability to store/load `DocumentArray` objects to/from dis - Pandas Dataframe: `.from_dataframe()`/`.to_dataframe()` - Cloud: `.push()`/`.pull()` - - - ## From/to JSON - ```{tip} -If you are building a webservice and want to use JSON for passing DocArray objects, then data validation and field-filtering can be crucial. In this case, it is highly recommended to check out {ref}`fastapi-support` and follow the methods there. +If you're building a webservice and want to use JSON for passing DocArray objects, then data validation and field-filtering can be crucial. In this case, you should check {ref}`fastapi-support` and follow the methods there. ``` ```{important} -Depending on which protocol you use, this feature requires `pydantic` or `protobuf` dependency. You can do `pip install "docarray[common]"` to install it. +Depending on which protocol you use, this feature requires `pydantic` or `protobuf` dependency. You can run `pip install "docarray[common]"` to install it. ``` @@ -68,7 +65,7 @@ da_r.summary() ```{seealso} -To load an arbitrary JSON file, please set `protocol=None` {ref}`as described here`. +To load an arbitrary JSON file, set `protocol=None` {ref}`as described here`. More parameters and usages can be found in the Document-level {ref}`doc-json`. ``` @@ -77,10 +74,10 @@ More parameters and usages can be found in the Document-level {ref}`doc-json`. ## From/to bytes ```{important} -Depending on your values of `protocol` and `compress` arguments, this feature may require `protobuf` and `lz4` dependencies. You can do `pip install "docarray[full]"` to install it. +Depending on your values of `protocol` and `compress` arguments, this feature may require `protobuf` and `lz4` dependencies. You can run `pip install "docarray[full]"` to install it. ``` -Serialization into bytes often yield more compact representation than in JSON. Similar to {ref}`the Document serialization`, DocumentArray can be serialized with different `protocol` and `compress` combinations. In its most simple form, +Serialization into bytes often yields more compact representation than in JSON. Similar to {ref}`the Document serialization`, DocumentArray can be serialized with different `protocol` and `compress` combinations. In its most simple form: ```python from docarray import DocumentArray, Document @@ -119,7 +116,7 @@ da_r.summary() If you go with default `protcol` and `compress` settings, you can simply use `bytes(da)`, which is more Pythonic. ``` -The table below summarize the supported serialization protocols and compressions: +The table below summarizes supported serialization protocols and compressions: | `protocol=...` | Description | Remarks | |--------------------------|------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------| @@ -128,11 +125,11 @@ The table below summarize the supported serialization protocols and compressions | `pickle` | Serialize elements one-by-one using Python `pickle`. | Allow streaming. Not portable to other languages. Insecure in production. | | `protobuf` | Serialize elements one-by-one using [`DocumentProto`](../../../proto/#docarray.DocumentProto). | Allow streaming. Portable to other languages if they implement `DocumentProto`. No max-size restriction | -For compressions, the following algorithms are supported: `lz4`, `bz2`, `lzma`, `zlib`, `gzip`. The most frequently used ones are `lz4` (fastest) and `gzip` (most widely used). +The following algorithms are supported for compression: `lz4`, `bz2`, `lzma`, `zlib`, `gzip`. The most frequently used are `lz4` (fastest) and `gzip` (most widely used). -If you specified non-default `protocol` and `compress` in {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_bytes`, you will need to specify the same in {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.from_bytes`. +If you specified non-default `protocol` and `compress` in {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_bytes`, you need to specify the same in {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.from_bytes`. -Depending on the use cases, you can choose the one works best for you. Here is a benchmark on serializing a DocumentArray with one million near-empty Documents (i.e. init with `DocumentArray.empty(...)` where each Document has only `id`). +Choose the one that works best for your use case. Below is a benchmark on serializing a DocumentArray with one million near-empty Documents (i.e. init with `DocumentArray.empty(...)` where each Document has only `id`). ```{figure} images/benchmark-size.svg ``` @@ -142,12 +139,12 @@ Depending on the use cases, you can choose the one works best for you. Here is a The benchmark was conducted [on the codebase of Jan. 5, 2022](https://github.com/jina-ai/docarray/tree/a56067e486d2318e05bcf6088bd1436040107ad2). -Depending on how you want to interpret the results, the figures above can be an over-estimation/under-estimation of the serialization latency: one may argue that near-empty Documents are not realistic, but serializing a DocumentArray with one million Documents is also unreal. In practice, DocumentArray passing across microservices are relatively small, say at thousands, for better overlapping the network latency and computational overhead. +Depending on how you want to interpret the results, the figures above can be an over-estimation/under-estimation of the serialization latency: you may argue that near-empty Documents are not realistic, but serializing a DocumentArray with one million Documents is also unreal. In practice, DocumentArrays passing across microservices are relatively small, say at thousands, for better overlapping network latency and computational overhead. (wire-format)= ### Wire format of `pickle` and `protobuf` -When set `protocol=pickle` or `protobuf`, the resulting bytes look like the following: +When `protocol` is set to `pickle` or `protobuf`, the resulting bytes look as follows: ```text -------------------------------------------------------------------------------------------------------- @@ -160,17 +157,21 @@ When set `protocol=pickle` or `protobuf`, the resulting bytes look like the foll ``` -Here `version` is a `uint8` that specifies the serialization version of the `DocumentArray` serialization format, followed by `len(docs)` which is a `uint64` that specifies the amount of serialized documents. -Afterwards, `doc1_bytes` describes how many bytes are used to serialize `doc1`, followed by `doc1.to_bytes()` which is the bytes data of the document itself. -The pattern `dock_bytes` and `dock.to_bytes` is repeated `len(docs)` times. +- `version` is a `uint8` specifying the serialization version of the `DocumentArray` serialization format. +- `len(docs)` is a `uint64` specifying the number of serialized Documents. +- `doc1_bytes` shows how many bytes are used to serialize `doc1`. +- `doc1.to_bytes()` shows the bytes data of the Document itself. + +The patterns `doc_bytes` and `doc.to_bytes` are repeated `len(docs)` times. ### From/to disk If you want to store a `DocumentArray` to disk you can use `.save_binary(filename, protocol, compress)` where `protocol` and `compress` refer to the protocol and compression methods used to serialize the data. + If you want to load a `DocumentArray` from disk you can use `.load_binary(filename, protocol, compress)`. -For example, the following snippet shows how to save/load a `DocumentArray` in `my_docarray.bin`. +For example, let's save/load a `DocumentArray` in `my_docarray.bin`: ```python from docarray import DocumentArray, Document @@ -202,7 +203,7 @@ da_rec.summary() ``` -User do not need to remember the protocol and compression methods on loading. You can simply specify `protocol` and `compress` in the file extension via: +You don't need to remember the protocol and compression methods on loading. You can simply specify `protocol` and `compress` in the file extension: ```text filename.protobuf.gzip @@ -214,10 +215,9 @@ filename.protobuf.gzip ``` -When a filename is given as the above format in `.save_binary`, you can simply load it back with `.load_binary` without specifying the protocol and compress method again. - +When a filename is given as the above format in `.save_binary`, you can load it back with `.load_binary` without specifying the protocol and compress method again. -The previous code snippet can be simplified to +The previous code snippet can be simplified to: ```python da.save_binary('my_docarray.protobuf.lz4') @@ -245,10 +245,10 @@ for d in da_generator: ## From/to base64 ```{important} -Depending on your values of `protocol` and `compress` arguments, this feature may require `protobuf` and `lz4` dependencies. You can do `pip install "docarray[full]"` to install it. +Depending on your values of `protocol` and `compress` arguments, this feature may require `protobuf` and `lz4` dependencies. You can run `pip install "docarray[full]"` to install it. ``` -Serialize into base64 can be useful when binary string is not allowed, e.g. in REST API. This can be easily done via {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_base64` and {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.from_base64`. Like in binary serialization, one can specify `protocol` and `compress`: +Serializing into base64 can be useful when binary strings are not allowed, e.g. in RESTful APIs. You can do this with {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_base64` and {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.from_base64`. Like in binary serialization, you can specify `protocol` and `compress`: ```python from docarray import DocumentArray @@ -286,7 +286,7 @@ da.summary() ## From/to Protobuf -Serializing to Protobuf Message is less frequently used, unless you are using Python Protobuf API. Nonetheless, you can use {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.from_protobuf` and {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_protobuf` to get a Protobuf Message object in Python. +Serializing to Protobuf Message is less common, unless you're using Python's Protobuf API. Nonetheless, you can use {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.from_protobuf` and {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_protobuf` to get a Protobuf Message object in Python. ```python from docarray import DocumentArray, Document @@ -311,10 +311,10 @@ docs { ## From/to list ```{important} -This feature requires `protobuf` or `pydantic` dependency. You can do `pip install "docarray[full]"` to install it. +This feature requires `protobuf` or `pydantic` dependency. You can run `pip install "docarray[full]"` to install it. ``` -Serializing to/from Python list is less frequently used for the same reason as `Document.to_dict()`: it is often an intermediate step of serializing to JSON. You can do: +Serializing to/from Python lists is less common for the same reason as `Document.to_dict()`: it's often an intermediate step of serializing to JSON. You can do: ```python from docarray import DocumentArray, Document @@ -334,10 +334,10 @@ More parameters and usages can be found in the Document-level {ref}`doc-dict`. ## From/to dataframe ```{important} -This feature requires `pandas` dependency. You can do `pip install "docarray[full]"` to install it. +This feature requires the `pandas` dependency. You can run `pip install "docarray[full]"` to install it. ``` -One can convert between a DocumentArray object and a `pandas.dataframe` object. +You can convert between a DocumentArray object and a `pandas.dataframe` object: ```python from docarray import DocumentArray, Document @@ -352,7 +352,7 @@ da.to_dataframe() 1 43cb95746e4e11ec8b731e008a366d49 world text/plain ``` -To build a DocumentArray from dataframe, +To build a DocumentArray from a DataFrame: ```python df = ... @@ -362,12 +362,12 @@ da = DocumentArray.from_dataframe(df) ## From/to cloud ```{important} -This feature requires `rich` and `requests` dependency. You can do `pip install "docarray[full]"` to install it. +This feature requires the `rich` and `requests` dependencies. You can run `pip install "docarray[full]"` to install them. ``` -{meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.push` and {meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.pull` allows you to serialize a DocumentArray object to Jina Cloud and share it across machines. +{meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.push` and {meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.pull` let you serialize a DocumentArray object to Jina Cloud and share it across machines. -Considering you are working on a GPU machine via Google Colab/Jupyter. After preprocessing and embedding, you got everything you need in a DocumentArray. You can easily store it to the cloud via: +Let's say you're working on a GPU machine via Google Colab/Jupyter. After preprocessing and embedding, you've got everything you need in a DocumentArray. You can easily store it to the cloud with: ```python from docarray import DocumentArray @@ -387,14 +387,14 @@ from docarray import DocumentArray da = DocumentArray.pull('myda123', show_progress=True) ``` -Now you can continue the work at local, analyzing `da` or visualizing it. Your friends & colleagues who know the token `myda123` can also pull that DocumentArray. It's useful when you want to quickly share the results with your colleagues & friends. +Now you can continue the work locally, analyzing or visualizing `da`. Your friends and colleagues who know the token `myda123` can also pull that DocumentArray. It's useful when you want to quickly share results with colleagues and friends. -The maximum size of an upload is 4GB under the `protocol='protobuf'` and `compress='gzip'` setting. The lifetime of an upload is one week after its creation. +The maximum upload size is 4GB under the `protocol='protobuf'` and `compress='gzip'` setting. The upload expires one week after its creation. -To avoid unnecessary download when upstream DocumentArray is unchanged, you can add `DocumentArray.pull(..., local_cache=True)`. +To avoid unnecessary downloading when the upstream DocumentArray is unchanged, you can use `DocumentArray.pull(..., local_cache=True)`. ```{seealso} DocArray allows pushing, pulling, and managing your DocumentArrays in Jina AI Cloud. -Read more about how to manage your data in Jina AI Cloud, using either the console or the DocArray Python API, in the +Read more about managing your data in Jina AI Cloud, using either the console or the DocArray Python API, in the {ref}`Data Management section `. ``` diff --git a/docs/fundamentals/documentarray/subindex.md b/docs/fundamentals/documentarray/subindex.md index 405f7b0f19b..e138348e8f4 100644 --- a/docs/fundamentals/documentarray/subindex.md +++ b/docs/fundamentals/documentarray/subindex.md @@ -1,19 +1,18 @@ (subindex)= # Search over Nested Structure -To use {meth}`~docarray.array.mixins.find.FindMixin.find` on multimodal or nested Documents (a multimodal Document is intrinsic nested Documents), you will need "subindices". The word "subindcies" represents that you are adding a new sublevel of indexing to the DocumentArray and make it searchable. +To use {meth}`~docarray.array.mixins.find.FindMixin.find` on multimodal or nested Documents (a multimodal Document is intrinsically a nested Document), you will need "subindices". The word "subindices" represents that you are adding a new sub-level of indexing to the DocumentArray and making it searchable. - -Each subindex indexes and stores one nesting level, such as `'@c'` or a {ref}`custom modality ` like `'@.[image]'`, and makes it directly searchable. Under the hood, subindices are fully fledged DocumentArrays with their own {ref}`Document Store`. +Each subindex indexes and stores one nesting level (like `'@c'` or a {ref}`custom modality ` like `'@.[image]'`) and makes it directly searchable. Under the hood, subindices are fully fledged DocumentArrays with their own {ref}`document store`. ```{seealso} -To see an example of subindices in action, see {ref}`here `. +To see subindices in action, check {ref}`here `. ``` ## Construct subindices -Subindices are specified when creating a DocumentArray, -by passing a configuration for each desired subindex to the `subindex_configs` parameter: +You can specify subindices when you create a DocumentArray +by passing configuration for each desired subindex to the `subindex_configs` parameter: ````{tab} Subindex with dataclass modalities ```python @@ -135,19 +134,18 @@ da = DocumentArray( ``` ```` -The `subindex_configs` dictionary is structured in the following way: - -- **Keys:** Each key in `subindex_configs` is the *name* of a subindex. It has to be a valid DocumentArray access path (such as `'@.[image]'`, `'@.[image, paragraph]'`, `'@c'`, or `'@cc'`). +The `subindex_configs` dictionary is structured as follows: -- **Values:** Each value in `subindex_configs` is the *configuration* of a subindex. It can be any configuration that is valid for the given DocumentArray type. -Fields that are not given in the subindex configuration will be inherited from the parent configuration. +- **Keys:** Each key in `subindex_configs` is the *name* of a subindex. It must be a valid DocumentArray access path (like `'@.[image]'`, `'@.[image, paragraph]'`, `'@c'`, or `'@cc'`). +- **Values:** Each value in `subindex_configs` is the *configuration* of a subindex. It can be any valid configuration for the given DocumentArray type. +Fields that are not given in the subindex configuration are inherited from the parent configuration. ## Modify subindices -Once a DocumentArray with subindices has been constructed, any modifications to the parent DocumentArray will automatically update the subindices. +Once you've constructed a DocumentArray with subindices, modifying the parent DocumentArray automatically updates the subindices. -This means that you can insert, extend, delete etc. it like any other DocumentArray. For example: +This means you can insert, extend, delete (etc.) it like any other DocumentArray: ````{tab} Subindex with dataclass modalities ```python @@ -217,7 +215,7 @@ Document(embedding=np.random.rand(512)).match(da, on='@c') ``` ```` -Such a search will return Documents from the subindex. If you are interested in the top-level Documents associated with +This kind of searchc returns Documents from the subindex. If you want the top-level Documents associated with a match, you can retrieve them using `parent_id`: ````{tab} Subindex with dataclass modalities diff --git a/docs/fundamentals/documentarray/visualization.md b/docs/fundamentals/documentarray/visualization.md index c941c77e56c..85672d00a0f 100644 --- a/docs/fundamentals/documentarray/visualization.md +++ b/docs/fundamentals/documentarray/visualization.md @@ -2,7 +2,7 @@ ## Summary in table -We are already pretty familiar with {meth}`~docarray.array.mixins.plot.PlotMixin.summary`, which prints a table of summary for DocumentArray and its attributes: +You are already familiar with {meth}`~docarray.array.mixins.plot.PlotMixin.summary`, which prints a summary table for a DocumentArray and its attributes: ```python from docarray import DocumentArray @@ -28,7 +28,7 @@ da.summary() ## Image sprites -If a DocumentArray contains all image Documents, you can plot all images in one sprite image using {meth}`~docarray.array.mixins.plot.PlotMixin.plot_image_sprites`. +If a DocumentArray contains only image Documents, you can plot them all in one sprite image using {meth}`~docarray.array.mixins.plot.PlotMixin.plot_image_sprites`. ```python from docarray import DocumentArray @@ -43,7 +43,7 @@ docs.plot_image_sprites() (plot-matches)= ### Plot Matches -If an image Document contains the matching images in its `.matches` attribute, you can visualise the matching results using {meth}`~docarray.document.mixins.plot.PlotMixin.plot_matches_sprites`. +If an image Document contains images in its `.matches` attribute, you can visualise the matching results using {meth}`~docarray.document.mixins.plot.PlotMixin.plot_matches_sprites`. ```python import numpy as np @@ -63,10 +63,10 @@ da[0].plot_matches_sprites(top_k=5, channel_axis=-1, inv_normalize=False) ## Embedding projector ```{important} -This feature requires `fastapi` dependency. You can do `pip install "docarray[full]"` to install it. +This feature requires `fastapi` dependency. You can run `pip install "docarray[full]"` to install it. ``` -If a DocumentArray has `.embeddings`, you can visualize the embeddings interactively using {meth}`~docarray.array.mixins.plot.PlotMixin.plot_embeddings`. +If a DocumentArray has `.embeddings`, you can visualize them interactively using {meth}`~docarray.array.mixins.plot.PlotMixin.plot_embeddings`. ```python import numpy as np @@ -82,7 +82,7 @@ docs.plot_embeddings() :align: center ``` -For image DocumentArray, you can do one step more to attach the image sprite on to the visualization points. +For an image DocumentArray, you can pass the `image_sprites` parameter to set the visualization points to images. ```python da.plot_embeddings(image_sprites=True) @@ -90,4 +90,4 @@ da.plot_embeddings(image_sprites=True) ```{figure} images/embedding-projector.gif :align: center -``` \ No newline at end of file +``` From 256b5569f3006e6f123e1e85d0f738ac62d75d1c Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Fri, 9 Dec 2022 08:00:02 +0100 Subject: [PATCH 07/10] docs: us english spelling of neighbor Signed-off-by: Alex C-G --- docs/advanced/document-store/benchmark.md | 4 ++-- docs/advanced/document-store/index.md | 6 +++--- docs/advanced/document-store/qdrant.md | 2 +- docs/fundamentals/document/embedding.md | 2 +- docs/fundamentals/document/nested.md | 2 +- docs/fundamentals/documentarray/find.md | 2 +- docs/fundamentals/documentarray/matching.md | 8 ++++---- 7 files changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/advanced/document-store/benchmark.md b/docs/advanced/document-store/benchmark.md index 61ab0c9a572..ad5a4917268 100644 --- a/docs/advanced/document-store/benchmark.md +++ b/docs/advanced/document-store/benchmark.md @@ -402,7 +402,7 @@ We now elaborate the setup of our benchmark. First the following parameters are | The dimension of `.embedding` | 128 | | Number of results for the task "Find by vector" | 10,000 | -We choose sift1m dataset, which has been commonly used for evaluating the approximate nearest neighbour search methods. +We choose sift1m dataset, which has been commonly used for evaluating the approximate nearest neighbor search methods. Each Document follows the structure: @@ -423,7 +423,7 @@ As Weaviate, Qdrant, ElasticSearch, and Redis follow a client/server pattern, we Results might include overhead coming from DocArray side which applies equally for all backends, unless a specific backend provides a more efficient implementation. -### Settings of the nearest neighbour search +### Settings of the nearest neighbor search Most of these document stores use their own implementation of HNSW (an approximate nearest neighbor search algorithm) but with different parameters: 1. `ef_construct` - the HNSW build parameter that controls the index time/index accuracy. Bigger `ef_construct` leads to longer construction, but better index quality. diff --git a/docs/advanced/document-store/index.md b/docs/advanced/document-store/index.md index 665fc80b2ef..295c5647c42 100644 --- a/docs/advanced/document-store/index.md +++ b/docs/advanced/document-store/index.md @@ -155,9 +155,9 @@ Creating DocumentArrays without indexes is useful during prototyping but shouldn Each document store supports different functionalities. The three key ones are: -- **vector search**: perform approximate nearest neighbour search (or exact full scan search). The search function's input is a numpy array or a DocumentArray containing an embedding. +- **vector search**: perform approximate nearest neighbor search (or exact full scan search). The search function's input is a numpy array or a DocumentArray containing an embedding. -- **vector search + filter**: perform approximate nearest neighbour search (or exact full scan search). The search function's input is a numpy array or a DocumentArray containing an embedding and a filter. +- **vector search + filter**: perform approximate nearest neighbor search (or exact full scan search). The search function's input is a numpy array or a DocumentArray containing an embedding and a filter. - **filter**: perform a filter step over the data. The search function's input is a filter. @@ -594,4 +594,4 @@ By default, `list_like` will be true. Obviously, a DocumentArray with on-disk storage is slower than an in-memory DocumentArray. However, if you choose on-disk storage, then often your concern of persistence overwhelms the concern of efficiency. -Slowness can affect all functions of DocumentArray. On the bright side, they may not be as severe as you would expect -- modern databases are highly optimized. Moreover, some databases provide faster methods for resolving certain queries, e.g. nearest-neighbour queries. We are actively and continuously improving DocArray to better leverage those features. +Slowness can affect all functions of DocumentArray. On the bright side, they may not be as severe as you would expect -- modern databases are highly optimized. Moreover, some databases provide faster methods for resolving certain queries, e.g. nearest-neighbor queries. We are actively and continuously improving DocArray to better leverage those features. diff --git a/docs/advanced/document-store/qdrant.md b/docs/advanced/document-store/qdrant.md index 77df8a80596..d6062e31435 100644 --- a/docs/advanced/document-store/qdrant.md +++ b/docs/advanced/document-store/qdrant.md @@ -86,7 +86,7 @@ Other functions behave the same as an in-memory DocumentArray. | `https` | Set `True` to use HTTPS (SSL) protocol | `None` | | `serialize_config` | [Serialization configuration of each Document](../../../fundamentals/document/serialization.md) | `None` | | `scroll_batch_size` | Batch size used when scrolling over the storage | `64` | -| `ef_construct` | Number of neighbours to consider during the index building. Larger = more accurate search, more time to build index | `None`, defaults to the default value in Qdrant* | +| `ef_construct` | Number of neighbors to consider during the index building. Larger = more accurate search, more time to build index | `None`, defaults to the default value in Qdrant* | | `full_scan_threshold` | Minimum size (in kilobytes) of vectors for additional payload-based indexing | `None`, defaults to the default value in Qdrant* | | `m` | Number of edges per node in the index graph. Higher = more accurate search, more space required | `None`, defaults to the default value in Qdrant* | | `columns` | Other fields to store in Document | `None` | diff --git a/docs/fundamentals/document/embedding.md b/docs/fundamentals/document/embedding.md index d74d7d24bdc..7282f3f206d 100644 --- a/docs/fundamentals/document/embedding.md +++ b/docs/fundamentals/document/embedding.md @@ -58,7 +58,7 @@ model = torchvision.models.resnet50(pretrained=True) q.embed(model) ``` -## Find nearest-neighbours +## Find nearest-neighbors ```{admonition} On multiple Documents use DocumentArray :class: tip diff --git a/docs/fundamentals/document/nested.md b/docs/fundamentals/document/nested.md index 41010f20dbc..a0f230497ce 100644 --- a/docs/fundamentals/document/nested.md +++ b/docs/fundamentals/document/nested.md @@ -13,7 +13,7 @@ Documents can be nested both horizontally and vertically via `.matches` and `.ch | `doc.granularity` | The "depth" of the nested chunks structure | | `doc.adjacency` | The "width" of the nested match structure | -You can add **chunks** (sub-Document) and **matches** (neighbour-Document) to a Document: +You can add **chunks** (sub-Document) and **matches** (neighbor-Document) to a Document: - Add in constructor: diff --git a/docs/fundamentals/documentarray/find.md b/docs/fundamentals/documentarray/find.md index 6a5030a5c34..32ad37e93e8 100644 --- a/docs/fundamentals/documentarray/find.md +++ b/docs/fundamentals/documentarray/find.md @@ -4,7 +4,7 @@ You can use {meth}`~docarray.array.mixins.find.FindMixin.find` to select Documents from a DocumentArray based on conditions specified in a `query` object. - To filter Documents, the `query` object is a Python dictionary object that defines the filtering conditions using a [MongoDB](https://docs.mongodb.com/manual/reference/operator/query/)-like query language. -- To find nearest neighbours, the `query` object needs to be an ndarray-like, Document, or DocumentArray that defines embedding(s). You can also use the `.match()` function for this purpose, and there's a minor interface difference between these two functions which is covered {ref}`in the next chapter`. +- To find nearest neighbors, the `query` object needs to be an ndarray-like, Document, or DocumentArray that defines embedding(s). You can also use the `.match()` function for this purpose, and there's a minor interface difference between these two functions which is covered {ref}`in the next chapter`. ```{admonition} filter query syntax :class: note diff --git a/docs/fundamentals/documentarray/matching.md b/docs/fundamentals/documentarray/matching.md index 19660e6e6c0..cb5f062a78c 100644 --- a/docs/fundamentals/documentarray/matching.md +++ b/docs/fundamentals/documentarray/matching.md @@ -1,20 +1,20 @@ (match-documentarray)= -# Find Nearest Neighbours +# Find Nearest Neighbors ```{important} {meth}`~docarray.array.mixins.match.MatchMixin.match` and {meth}`~docarray.array.mixins.find.FindMixin.find` support both CPU & GPU. ``` -Once `.embeddings` is set, you can use the {meth}`~docarray.array.mixins.find.FindMixin.find` or {func}`~docarray.array.mixins.match.MatchMixin.match` method to find the nearest-neighbour Documents from another DocumentArray (or the current DocumentArray itself) based on their `.embeddings` and distance metrics. +Once `.embeddings` is set, you can use the {meth}`~docarray.array.mixins.find.FindMixin.find` or {func}`~docarray.array.mixins.match.MatchMixin.match` method to find the nearest-neighbor Documents from another DocumentArray (or the current DocumentArray itself) based on their `.embeddings` and distance metrics. ## Difference between find and match -Though both `.find()` and `.match()` are about finding nearest neighbours of a given "query" and both accept similar arguments, there are some differences: +Though both `.find()` and `.match()` are about finding nearest neighbors of a given "query" and both accept similar arguments, there are some differences: ##### Which side is the query on? -- `.find()` always requires the query on the right-hand side. Say you have a DocumentArray with one million Documents, to find a query's nearest neighbours you should use `one_million_docs.find(query)`; +- `.find()` always requires the query on the right-hand side. Say you have a DocumentArray with one million Documents, to find a query's nearest neighbors you should use `one_million_docs.find(query)`; - `.match()` assumes the query is on left-hand side. `A.match(B)` semantically means "A matches against B and saves the results to A". So with `.match()` you should use `query.match(one_million_docs)`. ##### What's the query type? From d718c561f88926ac532ede11d061b8dd51807405 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Fri, 9 Dec 2022 08:04:01 +0100 Subject: [PATCH 08/10] docs: consistent use of _you_ can, not _one_ can Signed-off-by: Alex C-G --- docs/advanced/document-store/annlite.md | 4 ++-- docs/advanced/document-store/elasticsearch.md | 8 ++++---- docs/advanced/document-store/extend.md | 2 +- docs/advanced/document-store/redis.md | 2 +- docs/advanced/document-store/sqlite.md | 2 +- docs/datatypes/image/index.md | 2 +- docs/datatypes/tabular/index.md | 8 ++++---- docs/datatypes/video/index.md | 8 ++++---- docs/fundamentals/fastapi-support/index.md | 2 +- 9 files changed, 19 insertions(+), 19 deletions(-) diff --git a/docs/advanced/document-store/annlite.md b/docs/advanced/document-store/annlite.md index ead843b4ee6..143821f2d4b 100644 --- a/docs/advanced/document-store/annlite.md +++ b/docs/advanced/document-store/annlite.md @@ -10,7 +10,7 @@ This feature requires `annlite`. You can install it via `pip install "docarray[a ## Usage -One can instantiate a DocumentArray with Annlite storage like so: +You can instantiate a DocumentArray with Annlite storage like so: ```python from docarray import DocumentArray @@ -20,7 +20,7 @@ da = DocumentArray(storage='annlite', config={'n_dim': 10}) The usage would be the same as the ordinary DocumentArray. -To access a DocumentArray formerly persisted, one can specify the `data_path` in `config`. +To access a DocumentArray formerly persisted, you can specify the `data_path` in `config`. ```python from docarray import DocumentArray diff --git a/docs/advanced/document-store/elasticsearch.md b/docs/advanced/document-store/elasticsearch.md index 391d5da8a00..2c2b4f29b5f 100644 --- a/docs/advanced/document-store/elasticsearch.md +++ b/docs/advanced/document-store/elasticsearch.md @@ -41,7 +41,7 @@ docker-compose up ### Create DocumentArray with Elasticsearch backend -Assuming service is started using the default configuration (i.e. server address is `http://localhost:9200`), one can instantiate a DocumentArray with Elasticsearch storage as such: +Assuming service is started using the default configuration (i.e. server address is `http://localhost:9200`), you can instantiate a DocumentArray with Elasticsearch storage as such: ```python from docarray import DocumentArray @@ -70,7 +70,7 @@ da = DocumentArray( Here is [the official Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#elasticsearch-security-certificates) for you to get certificate, password etc. -To access a DocumentArray formerly persisted, one can specify `index_name` and the hosts. +To access a DocumentArray formerly persisted, you can specify `index_name` and the hosts. The following example will build a DocumentArray with previously stored data from `old_stuff` on `http://localhost:9200`: @@ -160,7 +160,7 @@ You can read more about parallel bulk config and their default values [here](htt ### Vector search with filter query -One can perform Approximate Nearest Neighbor Search and pre-filter results using a filter query that follows [ElasticSearch's DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html). +You can perform Approximate Nearest Neighbor Search and pre-filter results using a filter query that follows [ElasticSearch's DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html). Consider Documents with embeddings `[0,0,0]` up to `[9,9,9]` where the document with embedding `[i,i,i]` has as tag `price` with value `i`. We can create such example with the following code: @@ -233,7 +233,7 @@ You can read more about approximate kNN tuning [here](https://www.elastic.co/gui ### Search by filter query -One can search with user-defined query filters using the `.find` method. Such queries can be constructed following the +You can search with user-defined query filters using the `.find` method. Such queries can be constructed following the guidelines in [ElasticSearch's Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html). Consider you store Documents with a certain tag `price` into ElasticSearch and you want to retrieve all Documents diff --git a/docs/advanced/document-store/extend.md b/docs/advanced/document-store/extend.md index 42d94ac2fe6..80b48cb1226 100644 --- a/docs/advanced/document-store/extend.md +++ b/docs/advanced/document-store/extend.md @@ -94,7 +94,7 @@ upper level. Also, make sure that `_set_doc_by_id` performs an **upsert operatio ```{tip} Let's call the above five functions as **the essentials**. -If you aim for high performance, it is recommeneded to implement other methods *without* leveraging your essentials. They are: `_get_docs_by_ids`, `_del_docs_by_ids`, `_clear_storage`, `_set_doc_value_pairs`, `_set_doc_value_pairs_nested`, `_set_docs_by_ids`. One can get their full signatures from {class}`~docarray.array.storage.base.getsetdel.BaseGetSetDelMixin`. These functions define more fine-grained get/set/delete logics that are frequently used in DocumentArray. +If you aim for high performance, it is recommeneded to implement other methods *without* leveraging your essentials. They are: `_get_docs_by_ids`, `_del_docs_by_ids`, `_clear_storage`, `_set_doc_value_pairs`, `_set_doc_value_pairs_nested`, `_set_docs_by_ids`. You can get their full signatures from {class}`~docarray.array.storage.base.getsetdel.BaseGetSetDelMixin`. These functions define more fine-grained get/set/delete logics that are frequently used in DocumentArray. Implementing them is fully optional, and you can only implement some of them not all of them. If you are not implementing them, those methods will use a generic-but-slow version that is based on your five essentials. ``` diff --git a/docs/advanced/document-store/redis.md b/docs/advanced/document-store/redis.md index c0d289b9292..92333dcf8d4 100644 --- a/docs/advanced/document-store/redis.md +++ b/docs/advanced/document-store/redis.md @@ -245,7 +245,7 @@ integer in `columns` configuration (`'field': 'int'`) and use a filter query tha ### Search by filter query -One can search with user-defined query filters using the `.find` method. Such queries follow the [Redis Search Query Syntax](https://redis.io/docs/stack/search/reference/query_syntax/). +You can search with user-defined query filters using the `.find` method. Such queries follow the [Redis Search Query Syntax](https://redis.io/docs/stack/search/reference/query_syntax/). Consider a case where you store Documents with a tag of `price` into Redis and you want to retrieve all Documents with `price` less than or equal to some `max_price` value. diff --git a/docs/advanced/document-store/sqlite.md b/docs/advanced/document-store/sqlite.md index aac3fdc56ed..817cfbae00f 100644 --- a/docs/advanced/document-store/sqlite.md +++ b/docs/advanced/document-store/sqlite.md @@ -15,7 +15,7 @@ da1 = DocumentArray( ) # with customize config ``` -To reconnect a formerly persisted database, one can need to specify *both* `connection` and `table_name` in `config`: +To reconnect a formerly persisted database, you can need to specify *both* `connection` and `table_name` in `config`: ```python from docarray import DocumentArray diff --git a/docs/datatypes/image/index.md b/docs/datatypes/image/index.md index e92067a6a5e..fee6d09dbfc 100644 --- a/docs/datatypes/image/index.md +++ b/docs/datatypes/image/index.md @@ -123,7 +123,7 @@ print(d.tensor.shape) (180, 64, 64, 3) ``` -As one can see, it converts the single image tensor into 180 image tensors, each with the size of (64, 64, 3). You can also add all 180 image tensors into the chunks of this `Document`, simply do: +As you can see, it converts the single image tensor into 180 image tensors, each with the size of (64, 64, 3). You can also add all 180 image tensors into the chunks of this `Document`, simply do: ```python d.convert_image_tensor_to_sliding_windows(window_shape=(64, 64), as_chunks=True) diff --git a/docs/datatypes/tabular/index.md b/docs/datatypes/tabular/index.md index b97598426f1..714fc530789 100644 --- a/docs/datatypes/tabular/index.md +++ b/docs/datatypes/tabular/index.md @@ -1,11 +1,11 @@ (table-type)= # {octicon}`table` Table -One can freely convert between DocumentArray and `pandas.Dataframe`, read more details in {ref}`docarray-serialization`. Besides, one can load and write CSV file with DocumentArray. +You can freely convert between DocumentArray and `pandas.Dataframe`, read more details in {ref}`docarray-serialization`. Besides, you can load and write CSV file with DocumentArray. ## Load CSV table -One can easily load tabular data from `csv` file into a DocumentArray. For example, +You can easily load tabular data from `csv` file into a DocumentArray. For example, ```text Username;Identifier;First name;Last name @@ -37,10 +37,10 @@ da = DocumentArray.from_csv('toy.csv') tags ('dict',) 5 False ``` -One can observe that each row is loaded as a Document and the columns are loaded into `Document.tags`. +You can observe that each row is loaded as a Document and the columns are loaded into `Document.tags`. -In general, `from_csv` will try its best to resolve the column names of the table and map them into the corresponding Document attributes. If such attempt fails, one can always resolve the field manually via: +In general, `from_csv` will try its best to resolve the column names of the table and map them into the corresponding Document attributes. If such attempt fails, you can always resolve the field manually via: ```python from docarray import DocumentArray diff --git a/docs/datatypes/video/index.md b/docs/datatypes/video/index.md index caabd7040a5..ccf3c2de014 100644 --- a/docs/datatypes/video/index.md +++ b/docs/datatypes/video/index.md @@ -48,7 +48,7 @@ d.chunks.plot_image_sprites('mov.png') ## Key frame extraction -From the sprite image one can observe our example video is quite redundant. Let's extract the key frames from this video and see: +From the sprite image you can observe our example video is quite redundant. Let's extract the key frames from this video and see: ```python from docarray import Document @@ -83,7 +83,7 @@ Makes sense, right? ## Save as video file -One can also save a Document `.tensor` as a video file. In this example, we load our `.mp4` video and store it into a 60fps video. +You can also save a Document `.tensor` as a video file. In this example, we load our `.mp4` video and store it into a 60fps video. ```python from docarray import Document @@ -101,7 +101,7 @@ d = ( ## Create Document from webcam -One can generate a stream of Documents from a webcam via {meth}`~docarray.document.mixins.video.VideoDataMixin.generator_from_webcam`: +You can generate a stream of Documents from a webcam via {meth}`~docarray.document.mixins.video.VideoDataMixin.generator_from_webcam`: ```python from docarray import Document @@ -110,7 +110,7 @@ for d in Document.generator_from_webcam(): pass ``` -This will create a generator that yields a Document for each frame. One can control the framerate via `fps` parameter. Note that the upper bound of the framerate is determined by the hardware of webcam, not the software. Press `Esc` to exit. +This will create a generator that yields a Document for each frame. You can control the framerate via `fps` parameter. Note that the upper bound of the framerate is determined by the hardware of webcam, not the software. Press `Esc` to exit.