From 58bca52fe5d3e7ca8333ea87023a7a4f0e346d40 Mon Sep 17 00:00:00 2001 From: samsja Date: Wed, 5 Apr 2023 16:08:46 +0200 Subject: [PATCH 01/22] docs : wip add AnyDocArray docs Signed-off-by: samsja --- docs/user_guide/representing/array.md | 231 +++++++++++++++++++++ docs/user_guide/representing/first_step.md | 2 +- mkdocs.yml | 4 +- 3 files changed, 235 insertions(+), 2 deletions(-) create mode 100644 docs/user_guide/representing/array.md diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md new file mode 100644 index 00000000000..9e6eb11d9ae --- /dev/null +++ b/docs/user_guide/representing/array.md @@ -0,0 +1,231 @@ +# Collection of documents + +DocArray allow users to represent and manipulate multi-modal data to build AI application (Generative AI, neural search ...). +DocArray could be seen as a `multi-modal extension of Pydantic for Machine Learning use case`. + +!!! warning + DocArray is actually more than just a Pydantic extension, it is a general purpose multi-modal python libraries. + But it is usefully to see it that way to fully understand the representing ability that DocArray offer. + +As you have seen in the last section (LINK), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which allows to represent a *single* document, a *single* datapoint. + +In Machine Learning though we often need to work with a *collection* of documents, a *collection* of datapoints. + +This section introduce the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This library +name: `DocArray` is actually derive from this concept, and it stands for `DocumentArray`. + + +## AnyDocArray + +`AnyDocArray` is an abstract class that represent a collection of `BaseDoc` which is not meant to be used directly, but to be subclassed. + +We provide two concrete implementation of `AnyDocArray` : + +- [`DocList`][docarray.array.doc_list.doc_list.DocList] which is a python list of `BaseDoc` +- [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] which is a column based representation of `BaseDoc` + +We will go into the difference between `DocList` and `DocVec` in the next section but let's first focus on what they have in common. + + +`AnyDocArray` spirit is to extend the `BaseDoc` and `BaseModel` concept to the Array level in a *seamless* way + +!!! important + `AnyDocArray` is the Array equivalent of a Pydantic `BaseModel`or a DocArray [`BaseDoc`][docarray.base_doc.doc.BaseDoc]. + It extends the `BaseDoc` API at the Array level. + +### Example + +before going into detail lets look at a code example. After all it all a question of API and code +example is the best way to visualize an API. + +!!! Note + + `DocList` and `DocVec` are both `AnyDocArray`. The following section will use `DocList` as an example, but the same + apply to `DocVec`. + +First we need to create a Doc class, our data schema. Let's say we want to represent a banner with an image, a title and a description. + +```python +from docarray import BaseDoc, DocList +from docarray.typing import ImageUrl + + +class BannerDoc(BaseDoc): + image: ImageUrl + title: str + description: str +``` + +let's instantiate several `BannerDoc` + +```python +banner1 = BannerDoc( + image='https://example.com/image1.png', + title='Hello World', + description='This is a banner', +) + +banner2 = BannerDoc( + image='https://example.com/image2.png', + title='Bye Bye World', + description='This is (distopic) banner', +) +``` + +we can now collect them into a `DocList` of `BannerDoc` + +```python +docs = DocList[BannerDoc]([banner1, banner2]) + +docs.summary() +``` + +```cmd +╭──────── DocList Summary ────────╮ +│ │ +│ Type DocList[BannerDoc] │ +│ Length 2 │ +│ │ +╰─────────────────────────────────╯ +╭──── Document Schema ─────╮ +│ │ +│ BannerDoc │ +│ ├── image: ImageUrl │ +│ ├── title: str │ +│ └── description: str │ +│ │ +╰──────────────────────────╯ +``` + +`docs` here is a collection of `BannerDoc`. + +!!! note + The syntax `DocList[BannerDoc]` should surprise you in this context, + it is actually at the heart of DocArray but let's come back to it later LINK TO LATER and continue with the example. + +As we said earlier `DocList` or more generaly `AnyDocArray` extend the `BaseDoc` API at the Array level. + +What it means concretely is that the same way you can access with Pydantic at the attribute of your data at the +document level, you can do access it at the Array level. + +Let's see how it looks: + + +at the document level: +```python +print(banner.url) +``` + +```cmd +https://example.com/image1.png' +``` + +at the Array level: +```python +print(docs.url) +``` + +```cmd +['https://example.com/image1.png', 'https://example.com/image2.png'] +``` + +!!! Important + All the attribute of `BannerDoc` are accessible at the Array level. + +!!! Warning + Whereas this is true at runtime, static type analyser like Mypy or IDE like PyCharm will not be able to know it. + This limitation is know and will be fixed in the future by the introduction of a Mypy, PyCharm, VSCode plugin. + +This even work when you have a nested `BaseDoc`: + +```python +from docarray import BaseDoc, DocList +from docarray.typing import ImageUrl + + +class BannerDoc(BaseDoc): + image: ImageUrl + title: str + description: str + + +class PageDoc(BaseDoc): + banner: BannerDoc + content: str + + +page1 = PageDoc( + banner=BannerDoc( + image='https://example.com/image1.png', + title='Hello World', + description='This is a banner', + ), + content='Hello wolrd is the most used example in programming, but do you know that ? ...', +) + +page2 = PageDoc( + banner=BannerDoc( + image='https://example.com/image2.png', + title='Bye Bye World', + description='This is (distopic) banner', + ), + content='What if the most used example in programming was Bye Bye World, would programming be that much fun ? ...', +) + +docs = DocList[PageDoc]([page1, page2]) + +docs.summary() +``` + +```cmd +╭─────── DocList Summary ───────╮ +│ │ +│ Type DocList[PageDoc] │ +│ Length 2 │ +│ │ +╰───────────────────────────────╯ +╭────── Document Schema ───────╮ +│ │ +│ PageDoc │ +│ ├── banner: BannerDoc │ +│ │ ├── image: ImageUrl │ +│ │ ├── title: str │ +│ │ └── description: str │ +│ └── content: str │ +│ │ +╰──────────────────────────────╯ +``` + +```python +print(docs.banner) +``` + +```cmd + +``` + +Yes, `docs.banner` return a nested `DocList` of `BannerDoc` ! + +You can even access the attribute of the nested `BaseDoc` at the Array level: + +```python +print(docs.banner.url) +``` + +```cmd +['https://example.com/image1.png', 'https://example.com/image2.png'] +``` + +the same way that with Pydantic and DocArray [BaseDoc][docarray.base_doc.doc.BaseDoc] you would have done: + +```python +print(page1.banner.image) +``` + +```cmd +'https://example.com/image1.png' +``` + +### Custom syntax and in depth understanding of `AnyDocArray` + + diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index c20b0dc553f..f8b647805cf 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -1,4 +1,4 @@ -# Representing +# Document At the heart of `DocArray` lies the concept of [`BaseDoc`][docarray.base_doc.doc.BaseDoc]. diff --git a/mkdocs.yml b/mkdocs.yml index ca72a966197..e54126d3370 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -74,7 +74,9 @@ nav: - Home: README.md - Tutorial - User Guide: - user_guide/intro.md - - user_guide/representing/first_step.md + - Representing: + - user_guide/representing/first_step.md + - user_guide/representing/array.md - user_guide/sending/first_step.md - user_guide/storing/first_step.md From a9c705107304ba51d018fb037582ff1d9f98dbfb Mon Sep 17 00:00:00 2001 From: samsja Date: Thu, 6 Apr 2023 12:18:34 +0200 Subject: [PATCH 02/22] docs : add array section Signed-off-by: samsja --- docs/user_guide/representing/array.md | 173 ++++++++++++++++++++- docs/user_guide/representing/first_step.md | 5 + 2 files changed, 177 insertions(+), 1 deletion(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 9e6eb11d9ae..fd7d6ec92dc 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -226,6 +226,177 @@ print(page1.banner.image) 'https://example.com/image1.png' ``` -### Custom syntax and in depth understanding of `AnyDocArray` +### `DocList[DocType]` syntax + +As you have seen in the previous section, `AnyDocArray` will expose the same attribute as the `BaseDoc` it contains. + +But this concept only work if and only if all of the `BaseDoc` in the `AnyDocArray` have the same schema. + +Indeed, if one of your `BaseDoc` have an attribute that the other don't, you will get an error if you try to acces it at +the array level. + + +!!! note + To be able to extend Pydantic API to the Array level, `AnyDocArray` need to contain homogenous Document. + +This is where the custom syntax `DocList[DocType]` come into play. + +!!! + `DocList[DocType]` create a custom [`DocList`][docarray.array.doc_list.doc_list.DocList] that can only contain `DocType` Document. + +This syntax is inspired by more statically typed language, and even though it might offend python purist and go against +python list principle we believe that it is actually a good user experience to think of Array of `BaseDoc` rather than +just a collection of non-homogenous `BaseDoc`. + + +That being said `AnyDocArray` can be used to create a non-homogenous `AnyDocArray`: + +!!! note + The default `DocList` can be used to create a non-homogenous list of `BaseDoc`. + +!!! warning + `DocVec` cannot store non-homogenous `BaseDoc` and always need the `DocVec[DocType]` syntax. + +The usage if non-homogenous `DocList` is really similar to a normal Python list but still offer DocArray functionality +like serialization and send over the wire (LINK). But it won't be able to extend the Pydantic API to the Array level. + + +Here is how you can instantiate a non-homogenous `DocList`: + +```python +from docarray import BaseDoc, DocList +from docarray.typing import ImageUrl, AudioUrl + + +class ImageDoc(BaseDoc): + url: ImageUrl + + +class AudioDoc(BaseDoc): + url: AudioUrl + + +docs = DocList( + [ + ImageDoc(url='https://example.com/image1.png'), + AudioDoc(url='https://example.com/audio1.mp3'), + ] +) +``` + +But you will not have been able to do + +```python +try: + docs = DocList[ImageDoc]( + [ + ImageDoc(url='https://example.com/image1.png'), + AudioDoc(url='https://example.com/audio1.mp3'), + ] + ) +except ValueError as e: + print(e) +``` + +```cmd +ValueError: AudioDoc( + id='e286b10f58533f48a0928460f0206441', + url=AudioUrl('https://example.com/audio1.mp3', host_type='domain') +) is not a +``` + +### `DocList` vs `DocVec` + +[`DocList`][docarray.array.doc_list.doc_list.DocList] and [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] are both +[`AnyDocArray`][docarray.array.doc_array.doc_array.AnyDocArray] but they have different use case, and they differ in how +they store the data in memory. + +They share almost everything that as been said in the previous section, but they have some conceptual differences. + +[`DocList`][docarray.array.doc_list.doc_list.DocList] is based on Python List. +You can, append, extend, insert, pop , ... on it. In DocList, the data is individually owned by each `BaseDoc` collect just +different Document reference. You want to use [`DocList`][docarray.array.doc_list.doc_list.DocList] when you want to be able +to rearrange or rerank you data. One flaw of `DocList` is that none of the data is contiguous in memory. So you cannot +leverage function that require contiguous data like without first copying the data in a continuous array. + +[`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] is a columnar data structure. DocVec is always a collection +of homogeneous Documents. The idea is that every attribute of the `BaseDoc` will be stored in a contiguous array: a column. + +This mean that when you access the attribute of a `BaseDoc` at the Array level, we don't collect under the hood the data +from all the documents (like `DocList`) before giving it back to you. We just return the column that is store in memory. + +This really matter when you need to handle multi-modal data that you will feed into algorithm that require contiguous data, like matrix multiplication +which is at the heart of Machine Learning especially in Deep Learning. + +let's take an example to illustrate the difference + + +let's say you want to work with Image: +```python +from docarray import BaseDoc +from docarray.typing import NdArray + + +class ImageDoc(BaseDoc): + image: NdArray[ + 3, 224, 224 + ] = None # [3, 224, 224] this just mean we know in advance the shape of the tensor +``` + +and that you have a function that take a contiguous array of image as input (like a deep learning model) + +```python +def predict(image: NdArray['batch_size', 3, 224, 224]): + ... +``` + +let's create a `DocList` of `ImageDoc` and pass it to the function + +```python hl_lines="5 7" +from docarray import DocList +import numpy as np + +docs = DocList[ImageDoc]([ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]) + +predict(np.stack(docs.image)) +... +predict(np.stack(docs.image)) + +``` + +When you call `docs.image` DocList loop over the 10 documents and collect the image attribute of each document in a list + +it is similar to do + +```python +images = [] +for doc in docs: + images.append(doc.image) +``` + +this means that if you need to call `docs.image` multiple time, you will have to stack in the array in a contiguous batch array +multiple time. This is not optimal. + +Let's see how it will work with `DocVec` + +```python hl_lines="5 7" +from docarray import DocList +import numpy as np + +docs = DocList[ImageDoc]([ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]) + +predict(docs.image) +... +predict(docs.image) +``` + +First difference is that you don't need to call `np.stack` on `docs.image` because `docs.image` is already a contiguous array. +Second difference is that you just get the column and don't need to create it at each call. + + + +!!! Note + You should use `DocVec` when you need to work with contiguous data and you should use `DocList` when you need to rearrange + or extend your data. diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index f8b647805cf..93ff0cdfc46 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -6,6 +6,11 @@ A [BaseDoc][docarray.base_doc.doc.BaseDoc] is very similar to a [Pydantic](https [`BaseModel`](https://docs.Pydantic.dev/usage/models) - in fact it _is_ a specialized Pydantic `BaseModel`. It allows you to define custom `Document` schemas (or `Model` in the Pydantic world) to represent your data. + +!!! note + Naming convention. When we refer to a BaseDoc we refer to a class that inherits from [BaseDoc][docarray.base_doc.doc.BaseDoc]. + When we refer to a `Document` we refer to an instance of a BaseDoc class. + ## Basic `Doc` usage. Before going into detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's From f30593ba5edc8368e47118ef1351e993db8d495d Mon Sep 17 00:00:00 2001 From: samsja Date: Thu, 6 Apr 2023 13:05:50 +0200 Subject: [PATCH 03/22] docs : add array section Signed-off-by: samsja --- docs/api_references/array/any_da.md | 3 + docs/api_references/array/da_stack.md | 2 +- docs/user_guide/representing/array.md | 79 ++++++++++++++++++++++++--- 3 files changed, 74 insertions(+), 10 deletions(-) create mode 100644 docs/api_references/array/any_da.md diff --git a/docs/api_references/array/any_da.md b/docs/api_references/array/any_da.md new file mode 100644 index 00000000000..e71d1999cf5 --- /dev/null +++ b/docs/api_references/array/any_da.md @@ -0,0 +1,3 @@ +# AnyDocArray + +::: docarray.array.doc_vec.doc_vec.DocVec diff --git a/docs/api_references/array/da_stack.md b/docs/api_references/array/da_stack.md index c0709f2e084..74f9ff637a0 100644 --- a/docs/api_references/array/da_stack.md +++ b/docs/api_references/array/da_stack.md @@ -1,3 +1,3 @@ # DocVec -::: docarray.array.doc_vec.doc_vec.DocVec +::: docarray.array.any_array.AnyDocArray diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index fd7d6ec92dc..a970c97b544 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -9,7 +9,7 @@ DocArray could be seen as a `multi-modal extension of Pydantic for Machine Learn As you have seen in the last section (LINK), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which allows to represent a *single* document, a *single* datapoint. -In Machine Learning though we often need to work with a *collection* of documents, a *collection* of datapoints. +In Machine Learning though we often need to work with an *array* of documents, an *array* of datapoints. This section introduce the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This library name: `DocArray` is actually derive from this concept, and it stands for `DocumentArray`. @@ -17,7 +17,7 @@ name: `DocArray` is actually derive from this concept, and it stands for `Docume ## AnyDocArray -`AnyDocArray` is an abstract class that represent a collection of `BaseDoc` which is not meant to be used directly, but to be subclassed. +`AnyDocArray` is an abstract class that represent an array of `BaseDoc` which is not meant to be used directly, but to be subclassed. We provide two concrete implementation of `AnyDocArray` : @@ -97,7 +97,31 @@ docs.summary() ╰──────────────────────────╯ ``` -`docs` here is a collection of `BannerDoc`. +`docs` here is a array-like collection of `BannerDoc`. + +You can access document inside it with the usual python array API: + +```python +print(docs[0]) +``` + +```cmd +BannerDoc(image='https://example.com/image1.png', title='Hello World', description='This is a banner') +``` + +or iterate over it: + +```python +for doc in docs: + print(doc) +``` + +```cmd +BannerDoc(image='https://example.com/image1.png', title='Hello World', description='This is a banner') +BannerDoc(image='https://example.com/image2.png', title='Bye Bye World', description='This is (distopic) banner') +``` + + !!! note The syntax `DocList[BannerDoc]` should surprise you in this context, @@ -246,7 +270,7 @@ This is where the custom syntax `DocList[DocType]` come into play. This syntax is inspired by more statically typed language, and even though it might offend python purist and go against python list principle we believe that it is actually a good user experience to think of Array of `BaseDoc` rather than -just a collection of non-homogenous `BaseDoc`. +just an array of non-homogenous `BaseDoc`. That being said `AnyDocArray` can be used to create a non-homogenous `AnyDocArray`: @@ -308,7 +332,7 @@ ValueError: AudioDoc( ### `DocList` vs `DocVec` [`DocList`][docarray.array.doc_list.doc_list.DocList] and [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] are both -[`AnyDocArray`][docarray.array.doc_array.doc_array.AnyDocArray] but they have different use case, and they differ in how +[`AnyDocArray`][docarray.array.any_array.AnyDocArray] but they have different use case, and they differ in how they store the data in memory. They share almost everything that as been said in the previous section, but they have some conceptual differences. @@ -319,7 +343,7 @@ different Document reference. You want to use [`DocList`][docarray.array.doc_lis to rearrange or rerank you data. One flaw of `DocList` is that none of the data is contiguous in memory. So you cannot leverage function that require contiguous data like without first copying the data in a continuous array. -[`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] is a columnar data structure. DocVec is always a collection +[`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] is a columnar data structure. DocVec is always an array of homogeneous Documents. The idea is that every attribute of the `BaseDoc` will be stored in a contiguous array: a column. This mean that when you access the attribute of a `BaseDoc` at the Array level, we don't collect under the hood the data @@ -352,7 +376,7 @@ def predict(image: NdArray['batch_size', 3, 224, 224]): let's create a `DocList` of `ImageDoc` and pass it to the function -```python hl_lines="5 7" +```python hl_lines="6 8" from docarray import DocList import numpy as np @@ -379,7 +403,7 @@ multiple time. This is not optimal. Let's see how it will work with `DocVec` -```python hl_lines="5 7" +```python hl_lines="6 8" from docarray import DocList import numpy as np @@ -393,10 +417,47 @@ predict(docs.image) First difference is that you don't need to call `np.stack` on `docs.image` because `docs.image` is already a contiguous array. Second difference is that you just get the column and don't need to create it at each call. +One of the other main difference between both of them is how you can access document inside them. + +If you access a document inside a `DocList` you will get a `BaseDoc` instance, i.e, a document. + +If you access a document inside a `DocVec` you will get a document view. A document view is a view of the columnar data structure but which +looks and behave like a `BaseDoc` instance. It is actually a `BaseDoc` instance but with a different way access the data. + +When you do a change at the view level it will be reflected at the DocVec level. + +```python +docs = DocVec[ImageDoc]( + [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)] +) + +my_doc = docs[0] + +assert my_doc.is_view() # True +``` + +whereas with DocList: + +```python +docs = DocList[ImageDoc]( + [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)] +) + +my_doc = docs[0] + +assert not my_doc.is_view() # False +``` !!! Note - You should use `DocVec` when you need to work with contiguous data and you should use `DocList` when you need to rearrange + to summarize : you should use `DocVec` when you need to work with contiguous data, and you should use `DocList` when you need to rearrange or extend your data. +See also: + +* [`DocList`][docarray.array.doc_list.doc_list.DocList] +* [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] +* REPRESENTING REF +* STORING REF +* ... From 0375443b3e8f8f574e4c9d0a0c649a6b411a9476 Mon Sep 17 00:00:00 2001 From: samsja Date: Thu, 6 Apr 2023 14:41:54 +0200 Subject: [PATCH 04/22] docs : aadd glossary Signed-off-by: samsja --- docs/glossary.md | 70 ++++++++++++++++++++++++++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 71 insertions(+) create mode 100644 docs/glossary.md diff --git a/docs/glossary.md b/docs/glossary.md new file mode 100644 index 00000000000..fe405e20c13 --- /dev/null +++ b/docs/glossary.md @@ -0,0 +1,70 @@ +# Glossary + +DocArray scope is a edge of different field, from AI to web app. To make it easier to understand, we have created a glossary of terms used in the documentation. + + +## Concept + +### `Multi Modal Data` +Multi Modal data is data that is composed of different modalities, Image, Text, Video, Audio, etc. +For example, a YouTube video is composed of a video, a title, a description, a thumbnail, etc. + +Actually most of the data we have in the world is multi-modal. + +### `Multi Modal AI` + +Multi Modal AI is the field of AI that focus on multi-modal data. + +Most of the recent breakthrough in AI are actually multi-modal AI. + +* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [MidJourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [Dalle-2](https://openai.com/product/dall-e-2) generate *image* from *text*. +* [Whisper](https://openai.com/research/whisper) can generate *text* from *speech* +* [GPT4](https://openai.com/product/gpt-4), [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) are MLLM (Multi Modal Large language Model) that can undersrtand both *text* and *image*. +* ... + +One of the reason that AI lab are focusing on multi-modal AI is that is can solve a lot of practical problem and that is actually might be +a requirement to build strong AI system as argued by Yann Lecun in [this article](https://www.noemamag.com/ai-and-the-limits-of-language/) where he said that `A system trained on language alone will never approximate human intelligence`. + +### `Generative AI` + +Generative AI is as well in the epicenter of the latest AI revolution. These tool allow to *generate* data. + +* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [MidJourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [Dalle-2](https://openai.com/product/dall-e-2) generate *image* from *text*. + + +### `Neural Search` + +Neural search is search powered by neural network. Unlike traditional keyword-based search methods, neural search can understand the context and semantic meaning of the query, allowing it to find relevant results even when the exact keywords are not present + + +### `Vector Database` + +A vector database is a specialized storage system designed to handle high-dimensional vectors, which are common representations of data in machine learning and AI applications. It enables efficient storage, indexing, and querying of these vectors, and typically supports operations like nearest neighbor search, similarity search, and clustering + + +## Tools + +### `Jina` + +[Jina](https://jina.ai) is a framework to build Multi Modal application. It heavily relies on DocArray to represent and send data. + +Originally DocArray was part of Jina but it became a standalone project that is now independent of Jina. + +### `Pydantic` + +[Pydantic](https://github.com/pydantic/pydantic/) is a python library that allow to data validation using Python type hints. +DocArray relies on Pydantic. + +### `FastAPI` + +[FastAPI](https://fastapi.tiangolo.com/) is a python library that allow to build API using Python type hints. + +It is build on top of Pydantic and nicely extend to DocArray + +### `Weaviate` + +[Weaviate](https://weaviate.io/) is an open-source vector database that is supported in DocArray + +### `Weaviate` + +[Qdrant](https://qdrant.tech/) is an open-source vector database that is supported in DocArray \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index e54126d3370..f4441995378 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -86,4 +86,5 @@ nav: - how_to/optimize_performance_with_id_generation.md - how_to/audio2text.md - ... + - Glossary: glossary.md - Contributing: CONTRIBUTING.md From c303cdb5a29aa5882360b06be62c774e26db9fa9 Mon Sep 17 00:00:00 2001 From: samsja <55492238+samsja@users.noreply.github.com> Date: Thu, 6 Apr 2023 14:55:10 +0200 Subject: [PATCH 05/22] feat: apply johannes suggestion Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: samsja <55492238+samsja@users.noreply.github.com> --- docs/user_guide/representing/array.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index a970c97b544..29a6c61a0af 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -1,10 +1,10 @@ # Collection of documents -DocArray allow users to represent and manipulate multi-modal data to build AI application (Generative AI, neural search ...). -DocArray could be seen as a `multi-modal extension of Pydantic for Machine Learning use case`. +DocArray allows users to represent and manipulate multi-modal data to build AI applications (Generative AI, neural search ...). +DocArray could be seen as a `multi-modal extension of Pydantic for Machine Learning use cases`. !!! warning - DocArray is actually more than just a Pydantic extension, it is a general purpose multi-modal python libraries. + DocArray is actually more than just a Pydantic extension, it is a general purpose multi-modal Python library. But it is usefully to see it that way to fully understand the representing ability that DocArray offer. As you have seen in the last section (LINK), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which allows to represent a *single* document, a *single* datapoint. From bc34b2c94235239385162608f5f73f37f812140e Mon Sep 17 00:00:00 2001 From: samsja <55492238+samsja@users.noreply.github.com> Date: Thu, 6 Apr 2023 14:55:35 +0200 Subject: [PATCH 06/22] feat: apply johannes suggestion Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: samsja <55492238+samsja@users.noreply.github.com> --- docs/user_guide/representing/array.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 29a6c61a0af..1fd713f5573 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -5,13 +5,13 @@ DocArray could be seen as a `multi-modal extension of Pydantic for Machine Learn !!! warning DocArray is actually more than just a Pydantic extension, it is a general purpose multi-modal Python library. - But it is usefully to see it that way to fully understand the representing ability that DocArray offer. + But it can be usefl usefully to see it that way to fully understand the representing ability that DocArray offers. As you have seen in the last section (LINK), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which allows to represent a *single* document, a *single* datapoint. In Machine Learning though we often need to work with an *array* of documents, an *array* of datapoints. -This section introduce the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This library +This section introducew the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This library name: `DocArray` is actually derive from this concept, and it stands for `DocumentArray`. From 67c6275e9eccd483478e3199129602c18a777cf5 Mon Sep 17 00:00:00 2001 From: samsja <55492238+samsja@users.noreply.github.com> Date: Thu, 6 Apr 2023 14:56:07 +0200 Subject: [PATCH 07/22] feat: apply johannes suggestion Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: samsja <55492238+samsja@users.noreply.github.com> --- docs/user_guide/representing/array.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 1fd713f5573..f4899c944fa 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -17,17 +17,17 @@ name: `DocArray` is actually derive from this concept, and it stands for `Docume ## AnyDocArray -`AnyDocArray` is an abstract class that represent an array of `BaseDoc` which is not meant to be used directly, but to be subclassed. +`AnyDocArray` is an abstract class that represents an array of `BaseDoc` which is not meant to be used directly, but to be subclassed. We provide two concrete implementation of `AnyDocArray` : -- [`DocList`][docarray.array.doc_list.doc_list.DocList] which is a python list of `BaseDoc` +- [`DocList`][docarray.array.doc_list.doc_list.DocList] which is a Python list of `BaseDoc` - [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] which is a column based representation of `BaseDoc` We will go into the difference between `DocList` and `DocVec` in the next section but let's first focus on what they have in common. -`AnyDocArray` spirit is to extend the `BaseDoc` and `BaseModel` concept to the Array level in a *seamless* way +`AnyDocArray`s spirit is to extend the `BaseDoc` and `BaseModel` concept to the Array level in a *seamless* way. !!! important `AnyDocArray` is the Array equivalent of a Pydantic `BaseModel`or a DocArray [`BaseDoc`][docarray.base_doc.doc.BaseDoc]. @@ -35,8 +35,7 @@ We will go into the difference between `DocList` and `DocVec` in the next sectio ### Example -before going into detail lets look at a code example. After all it all a question of API and code -example is the best way to visualize an API. +before going into detail lets look at a code example. !!! Note From c3468e37b994db88a8483a2d79e37fb769cdeaff Mon Sep 17 00:00:00 2001 From: samsja <55492238+samsja@users.noreply.github.com> Date: Thu, 6 Apr 2023 14:57:52 +0200 Subject: [PATCH 08/22] feat: apply johannes suggestion Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: samsja <55492238+samsja@users.noreply.github.com> --- docs/user_guide/representing/array.md | 73 +++++++++++++-------------- 1 file changed, 36 insertions(+), 37 deletions(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index f4899c944fa..0b1962d9bd1 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -40,9 +40,9 @@ before going into detail lets look at a code example. !!! Note `DocList` and `DocVec` are both `AnyDocArray`. The following section will use `DocList` as an example, but the same - apply to `DocVec`. + applies to `DocVec`. -First we need to create a Doc class, our data schema. Let's say we want to represent a banner with an image, a title and a description. +First you need to create a Doc class, our data schema. Let's say you want to represent a banner with an image, a title and a description. ```python from docarray import BaseDoc, DocList @@ -55,7 +55,7 @@ class BannerDoc(BaseDoc): description: str ``` -let's instantiate several `BannerDoc` +Let's instantiate several `BannerDoc` ```python banner1 = BannerDoc( @@ -71,7 +71,7 @@ banner2 = BannerDoc( ) ``` -we can now collect them into a `DocList` of `BannerDoc` +You can now collect them into a `DocList` of `BannerDoc`: ```python docs = DocList[BannerDoc]([banner1, banner2]) @@ -123,18 +123,18 @@ BannerDoc(image='https://example.com/image2.png', title='Bye Bye World', descrip !!! note - The syntax `DocList[BannerDoc]` should surprise you in this context, + The syntax `DocList[BannerDoc]` might surprise you in this context, it is actually at the heart of DocArray but let's come back to it later LINK TO LATER and continue with the example. -As we said earlier `DocList` or more generaly `AnyDocArray` extend the `BaseDoc` API at the Array level. +As we said earlier, `DocList` or more generally `AnyDocArray`, extends the `BaseDoc` API at the Array level. -What it means concretely is that the same way you can access with Pydantic at the attribute of your data at the +What this means concretely is that the same way you can access your data at the document level, you can do access it at the Array level. -Let's see how it looks: +Let's see what that looks like: -at the document level: +At the document level: ```python print(banner.url) ``` @@ -143,7 +143,7 @@ print(banner.url) https://example.com/image1.png' ``` -at the Array level: +At the Array level: ```python print(docs.url) ``` @@ -153,13 +153,13 @@ print(docs.url) ``` !!! Important - All the attribute of `BannerDoc` are accessible at the Array level. + All the attributes of `BannerDoc` are accessible at the Array level. !!! Warning - Whereas this is true at runtime, static type analyser like Mypy or IDE like PyCharm will not be able to know it. - This limitation is know and will be fixed in the future by the introduction of a Mypy, PyCharm, VSCode plugin. + Whereas this is true at runtime, static type analysers like Mypy or IDE like PyCharm will not be able to know it. + This limitation is known and will be fixed in the future by the introduction of a Mypy, PyCharm, VSCode plugin. -This even work when you have a nested `BaseDoc`: +This even works when you have a nested `BaseDoc`: ```python from docarray import BaseDoc, DocList @@ -227,7 +227,7 @@ print(docs.banner) ``` -Yes, `docs.banner` return a nested `DocList` of `BannerDoc` ! +Yes, `docs.banner` returns a nested `DocList` of `BannerDoc` ! You can even access the attribute of the nested `BaseDoc` at the Array level: @@ -239,7 +239,7 @@ print(docs.banner.url) ['https://example.com/image1.png', 'https://example.com/image2.png'] ``` -the same way that with Pydantic and DocArray [BaseDoc][docarray.base_doc.doc.BaseDoc] you would have done: +The same way that with Pydantic and DocArray [BaseDoc][docarray.base_doc.doc.BaseDoc] you would have done: ```python print(page1.banner.image) @@ -251,37 +251,36 @@ print(page1.banner.image) ### `DocList[DocType]` syntax -As you have seen in the previous section, `AnyDocArray` will expose the same attribute as the `BaseDoc` it contains. +As you have seen in the previous section, `AnyDocArray` will expose the same attributes as the `BaseDoc` it contains. -But this concept only work if and only if all of the `BaseDoc` in the `AnyDocArray` have the same schema. +But this concept only works if and only if all of the `BaseDoc`s in the `AnyDocArray` have the same schema. -Indeed, if one of your `BaseDoc` have an attribute that the other don't, you will get an error if you try to acces it at +Indeed, if one of your `BaseDoc`s has an attribute that the others don't, you will get an error if you try to access it at the array level. !!! note - To be able to extend Pydantic API to the Array level, `AnyDocArray` need to contain homogenous Document. + To be able to extend your schema to the Array level, `AnyDocArray` needs to contain homogenous Document. -This is where the custom syntax `DocList[DocType]` come into play. +This is where the custom syntax `DocList[DocType]` comes into play. !!! - `DocList[DocType]` create a custom [`DocList`][docarray.array.doc_list.doc_list.DocList] that can only contain `DocType` Document. + `DocList[DocType]` creates a custom [`DocList`][docarray.array.doc_list.doc_list.DocList] that can only contain `DocType` Documents. -This syntax is inspired by more statically typed language, and even though it might offend python purist and go against -python list principle we believe that it is actually a good user experience to think of Array of `BaseDoc` rather than +This syntax is inspired by more statically typed languages, and even though it might offend Python purists, we believe that it is actually a good user experience to think of Array of `BaseDoc` rather than just an array of non-homogenous `BaseDoc`. -That being said `AnyDocArray` can be used to create a non-homogenous `AnyDocArray`: +That being said, `AnyDocArray` can also be used to create a non-homogenous `AnyDocArray`: !!! note The default `DocList` can be used to create a non-homogenous list of `BaseDoc`. !!! warning - `DocVec` cannot store non-homogenous `BaseDoc` and always need the `DocVec[DocType]` syntax. + `DocVec` cannot store non-homogenous `BaseDoc` and always needs the `DocVec[DocType]` syntax. -The usage if non-homogenous `DocList` is really similar to a normal Python list but still offer DocArray functionality -like serialization and send over the wire (LINK). But it won't be able to extend the Pydantic API to the Array level. +The usage of non-homogenous `DocList` is really similar to a normal Python list but still offers DocArray functionality +like serialization and sending over the wire (LINK). But it won't be able to extend the API of your custom schema to the Array level. Here is how you can instantiate a non-homogenous `DocList`: @@ -331,16 +330,16 @@ ValueError: AudioDoc( ### `DocList` vs `DocVec` [`DocList`][docarray.array.doc_list.doc_list.DocList] and [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] are both -[`AnyDocArray`][docarray.array.any_array.AnyDocArray] but they have different use case, and they differ in how +[`AnyDocArray`][docarray.array.any_array.AnyDocArray] but they have different use cases, and they differ in how they store the data in memory. -They share almost everything that as been said in the previous section, but they have some conceptual differences. +They share almost everything that has been said in the previous sections, but they have some conceptual differences. [`DocList`][docarray.array.doc_list.doc_list.DocList] is based on Python List. -You can, append, extend, insert, pop , ... on it. In DocList, the data is individually owned by each `BaseDoc` collect just +You can append, extend, insert, pop , ... on it. In DocList, the data is individually owned by each `BaseDoc` collect just different Document reference. You want to use [`DocList`][docarray.array.doc_list.doc_list.DocList] when you want to be able -to rearrange or rerank you data. One flaw of `DocList` is that none of the data is contiguous in memory. So you cannot -leverage function that require contiguous data like without first copying the data in a continuous array. +to rearrange or re-rank your data. One flaw of `DocList` is that none of the data is contiguous in memory. So you cannot +leverage functions that require contiguous data like without first copying the data in a continuous array. [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] is a columnar data structure. DocVec is always an array of homogeneous Documents. The idea is that every attribute of the `BaseDoc` will be stored in a contiguous array: a column. @@ -351,10 +350,10 @@ from all the documents (like `DocList`) before giving it back to you. We just re This really matter when you need to handle multi-modal data that you will feed into algorithm that require contiguous data, like matrix multiplication which is at the heart of Machine Learning especially in Deep Learning. -let's take an example to illustrate the difference +Let's take an example to illustrate the difference -let's say you want to work with Image: +Let's say you want to work with Image: ```python from docarray import BaseDoc from docarray.typing import NdArray @@ -366,14 +365,14 @@ class ImageDoc(BaseDoc): ] = None # [3, 224, 224] this just mean we know in advance the shape of the tensor ``` -and that you have a function that take a contiguous array of image as input (like a deep learning model) +And that you have a function that take a contiguous array of image as input (like a deep learning model) ```python def predict(image: NdArray['batch_size', 3, 224, 224]): ... ``` -let's create a `DocList` of `ImageDoc` and pass it to the function +Let's create a `DocList` of `ImageDoc` and pass it to the function ```python hl_lines="6 8" from docarray import DocList From 25cf1637aebfbf7f101eff21cafb1771e5077334 Mon Sep 17 00:00:00 2001 From: samsja Date: Thu, 6 Apr 2023 15:00:35 +0200 Subject: [PATCH 09/22] docs : fix johannes suggestion Signed-off-by: samsja --- docs/user_guide/representing/array.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 0b1962d9bd1..334ea525a0b 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -3,7 +3,8 @@ DocArray allows users to represent and manipulate multi-modal data to build AI applications (Generative AI, neural search ...). DocArray could be seen as a `multi-modal extension of Pydantic for Machine Learning use cases`. -!!! warning + +!!! note DocArray is actually more than just a Pydantic extension, it is a general purpose multi-modal Python library. But it can be usefl usefully to see it that way to fully understand the representing ability that DocArray offers. @@ -29,9 +30,6 @@ We will go into the difference between `DocList` and `DocVec` in the next sectio `AnyDocArray`s spirit is to extend the `BaseDoc` and `BaseModel` concept to the Array level in a *seamless* way. -!!! important - `AnyDocArray` is the Array equivalent of a Pydantic `BaseModel`or a DocArray [`BaseDoc`][docarray.base_doc.doc.BaseDoc]. - It extends the `BaseDoc` API at the Array level. ### Example From eee36d41f00b47637fa509389e791f5723f9b9ba Mon Sep 17 00:00:00 2001 From: samsja Date: Thu, 6 Apr 2023 15:02:09 +0200 Subject: [PATCH 10/22] docs : fix typo Signed-off-by: samsja --- docs/user_guide/representing/array.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 334ea525a0b..da43e91e922 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -6,7 +6,7 @@ DocArray could be seen as a `multi-modal extension of Pydantic for Machine Learn !!! note DocArray is actually more than just a Pydantic extension, it is a general purpose multi-modal Python library. - But it can be usefl usefully to see it that way to fully understand the representing ability that DocArray offers. + But it can be useful to see it that way to fully understand the representing ability that DocArray offers. As you have seen in the last section (LINK), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which allows to represent a *single* document, a *single* datapoint. From 53f0f4c3a08b7d46db0a14524e7ee04e9e15126c Mon Sep 17 00:00:00 2001 From: samsja Date: Thu, 6 Apr 2023 15:02:23 +0200 Subject: [PATCH 11/22] docs : fix typo Signed-off-by: samsja --- docs/user_guide/representing/array.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index da43e91e922..f2b747b3de7 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -12,7 +12,7 @@ As you have seen in the last section (LINK), the fundamental building block of D In Machine Learning though we often need to work with an *array* of documents, an *array* of datapoints. -This section introducew the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This library +This section introduce the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This library name: `DocArray` is actually derive from this concept, and it stands for `DocumentArray`. From 4440257ac00dd1ec71c0ad2235e337ff37a97b20 Mon Sep 17 00:00:00 2001 From: samsja Date: Thu, 6 Apr 2023 15:03:18 +0200 Subject: [PATCH 12/22] docs : reove pydantic stuff Signed-off-by: samsja --- docs/user_guide/representing/array.md | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index f2b747b3de7..339137dd134 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -1,12 +1,6 @@ # Collection of documents DocArray allows users to represent and manipulate multi-modal data to build AI applications (Generative AI, neural search ...). -DocArray could be seen as a `multi-modal extension of Pydantic for Machine Learning use cases`. - - -!!! note - DocArray is actually more than just a Pydantic extension, it is a general purpose multi-modal Python library. - But it can be useful to see it that way to fully understand the representing ability that DocArray offers. As you have seen in the last section (LINK), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which allows to represent a *single* document, a *single* datapoint. @@ -237,7 +231,7 @@ print(docs.banner.url) ['https://example.com/image1.png', 'https://example.com/image2.png'] ``` -The same way that with Pydantic and DocArray [BaseDoc][docarray.base_doc.doc.BaseDoc] you would have done: +The same way that [BaseDoc][docarray.base_doc.doc.BaseDoc] you would have done: ```python print(page1.banner.image) From 4f60d028339961f54f593ffdc40c5bd315a7500e Mon Sep 17 00:00:00 2001 From: samsja Date: Thu, 6 Apr 2023 15:16:12 +0200 Subject: [PATCH 13/22] docs : fix title Signed-off-by: samsja --- docs/user_guide/representing/array.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 339137dd134..7b073b3d810 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -1,4 +1,4 @@ -# Collection of documents +# Array of documents DocArray allows users to represent and manipulate multi-modal data to build AI applications (Generative AI, neural search ...). From 5ba6d4305899dc8e363fedb3fc94dd59328a0502 Mon Sep 17 00:00:00 2001 From: samsja Date: Tue, 11 Apr 2023 10:24:56 +0200 Subject: [PATCH 14/22] feat: apply gammarly Signed-off-by: samsja --- docs/user_guide/representing/array.md | 49 +++++++++++++-------------- 1 file changed, 23 insertions(+), 26 deletions(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 7b073b3d810..6904c16a35d 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -4,11 +4,10 @@ DocArray allows users to represent and manipulate multi-modal data to build AI a As you have seen in the last section (LINK), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which allows to represent a *single* document, a *single* datapoint. -In Machine Learning though we often need to work with an *array* of documents, an *array* of datapoints. - -This section introduce the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This library -name: `DocArray` is actually derive from this concept, and it stands for `DocumentArray`. +In Machine Learning though we often need to work with an *array* of documents, and an *array* of data points. +This section introduces the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This library +name: `DocArray` is derived from this concept, and it stands for `DocumentArray`. ## AnyDocArray @@ -88,9 +87,9 @@ docs.summary() ╰──────────────────────────╯ ``` -`docs` here is a array-like collection of `BannerDoc`. +`docs` here is an array-like collection of `BannerDoc`. -You can access document inside it with the usual python array API: +You can access documents inside it with the usual Python array API: ```python print(docs[0]) @@ -121,7 +120,7 @@ BannerDoc(image='https://example.com/image2.png', title='Bye Bye World', descrip As we said earlier, `DocList` or more generally `AnyDocArray`, extends the `BaseDoc` API at the Array level. What this means concretely is that the same way you can access your data at the -document level, you can do access it at the Array level. +document level, you can access it at the Array level. Let's see what that looks like: @@ -148,7 +147,7 @@ print(docs.url) All the attributes of `BannerDoc` are accessible at the Array level. !!! Warning - Whereas this is true at runtime, static type analysers like Mypy or IDE like PyCharm will not be able to know it. + Whereas this is true at runtime, static type analyzers like Mypy or IDE like PyCharm will not be able to know it. This limitation is known and will be fixed in the future by the introduction of a Mypy, PyCharm, VSCode plugin. This even works when you have a nested `BaseDoc`: @@ -252,14 +251,14 @@ the array level. !!! note - To be able to extend your schema to the Array level, `AnyDocArray` needs to contain homogenous Document. + To be able to extend your schema to the Array level, `AnyDocArray` needs to contain a homogenous Document. This is where the custom syntax `DocList[DocType]` comes into play. !!! `DocList[DocType]` creates a custom [`DocList`][docarray.array.doc_list.doc_list.DocList] that can only contain `DocType` Documents. -This syntax is inspired by more statically typed languages, and even though it might offend Python purists, we believe that it is actually a good user experience to think of Array of `BaseDoc` rather than +This syntax is inspired by more statically typed languages, and even though it might offend Python purists, we believe that it is a good user experience to think of an Array of `BaseDoc` rather than just an array of non-homogenous `BaseDoc`. @@ -271,12 +270,11 @@ That being said, `AnyDocArray` can also be used to create a non-homogenous `AnyD !!! warning `DocVec` cannot store non-homogenous `BaseDoc` and always needs the `DocVec[DocType]` syntax. -The usage of non-homogenous `DocList` is really similar to a normal Python list but still offers DocArray functionality +The usage of non-homogenous `DocList` is similar to a normal Python list but still offers DocArray functionality like serialization and sending over the wire (LINK). But it won't be able to extend the API of your custom schema to the Array level. Here is how you can instantiate a non-homogenous `DocList`: - ```python from docarray import BaseDoc, DocList from docarray.typing import ImageUrl, AudioUrl @@ -329,18 +327,18 @@ They share almost everything that has been said in the previous sections, but th [`DocList`][docarray.array.doc_list.doc_list.DocList] is based on Python List. You can append, extend, insert, pop , ... on it. In DocList, the data is individually owned by each `BaseDoc` collect just -different Document reference. You want to use [`DocList`][docarray.array.doc_list.doc_list.DocList] when you want to be able +different Document references. You want to use [`DocList`][docarray.array.doc_list.doc_list.DocList] when you want to be able to rearrange or re-rank your data. One flaw of `DocList` is that none of the data is contiguous in memory. So you cannot -leverage functions that require contiguous data like without first copying the data in a continuous array. +leverage functions that require contiguous data without first copying the data in a continuous array. [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] is a columnar data structure. DocVec is always an array of homogeneous Documents. The idea is that every attribute of the `BaseDoc` will be stored in a contiguous array: a column. -This mean that when you access the attribute of a `BaseDoc` at the Array level, we don't collect under the hood the data -from all the documents (like `DocList`) before giving it back to you. We just return the column that is store in memory. +This means that when you access the attribute of a `BaseDoc` at the Array level, we don't collect under the hood the data +from all the documents (like `DocList`) before giving it back to you. We just return the column that is stored in memory. -This really matter when you need to handle multi-modal data that you will feed into algorithm that require contiguous data, like matrix multiplication -which is at the heart of Machine Learning especially in Deep Learning. +This really matters when you need to handle multi-modal data that you will feed into an algorithm that require contiguous data, like matrix multiplication +which is at the heart of Machine Learning, especially in Deep Learning. Let's take an example to illustrate the difference @@ -404,15 +402,15 @@ predict(docs.image) predict(docs.image) ``` -First difference is that you don't need to call `np.stack` on `docs.image` because `docs.image` is already a contiguous array. -Second difference is that you just get the column and don't need to create it at each call. +The first difference is that you don't need to call `np.stack` on `docs.image` because `docs.image` is already a contiguous array. +The second difference is that you just get the column and don't need to create it at each call. -One of the other main difference between both of them is how you can access document inside them. +One of the other main differences between both of them is how you can access documents inside them. -If you access a document inside a `DocList` you will get a `BaseDoc` instance, i.e, a document. +If you access a document inside a `DocList` you will get a `BaseDoc` instance, i.e., a document. -If you access a document inside a `DocVec` you will get a document view. A document view is a view of the columnar data structure but which -looks and behave like a `BaseDoc` instance. It is actually a `BaseDoc` instance but with a different way access the data. +If you access a document inside a `DocVec` you will get a document view. A document view is a view of the columnar data structure which +looks and behaves like a `BaseDoc` instance. It is a `BaseDoc` instance but with a different way to access the data. When you do a change at the view level it will be reflected at the DocVec level. @@ -440,10 +438,9 @@ assert not my_doc.is_view() # False !!! Note - to summarize : you should use `DocVec` when you need to work with contiguous data, and you should use `DocList` when you need to rearrange + to summarize: you should use `DocVec` when you need to work with contiguous data, and you should use `DocList` when you need to rearrange or extend your data. - See also: * [`DocList`][docarray.array.doc_list.doc_list.DocList] From f01e9e1735786252a9d7e2235e874dc38f3e9e13 Mon Sep 17 00:00:00 2001 From: samsja Date: Tue, 11 Apr 2023 12:09:56 +0200 Subject: [PATCH 15/22] fix: fix apply grammarly on glossary Signed-off-by: samsja --- docs/glossary.md | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/docs/glossary.md b/docs/glossary.md index fe405e20c13..b562de51da3 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -1,7 +1,6 @@ # Glossary -DocArray scope is a edge of different field, from AI to web app. To make it easier to understand, we have created a glossary of terms used in the documentation. - +DocArray scope is at the edge of different fields, from AI to web apps. To make it easier to understand, we have created a glossary of terms used in the documentation. ## Concept @@ -9,32 +8,32 @@ DocArray scope is a edge of different field, from AI to web app. To make it easi Multi Modal data is data that is composed of different modalities, Image, Text, Video, Audio, etc. For example, a YouTube video is composed of a video, a title, a description, a thumbnail, etc. -Actually most of the data we have in the world is multi-modal. +Actually, most of the data we have in the world is multi-modal. ### `Multi Modal AI` -Multi Modal AI is the field of AI that focus on multi-modal data. +Multi Modal AI is the field of AI that focuses on multi-modal data. -Most of the recent breakthrough in AI are actually multi-modal AI. +Most of the recent breakthroughs in AI are multi-modal AI. * [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [MidJourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [Dalle-2](https://openai.com/product/dall-e-2) generate *image* from *text*. * [Whisper](https://openai.com/research/whisper) can generate *text* from *speech* -* [GPT4](https://openai.com/product/gpt-4), [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) are MLLM (Multi Modal Large language Model) that can undersrtand both *text* and *image*. +* [GPT4](https://openai.com/product/gpt-4), [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) are MLLM (Multi Modal Large Language Model) that can understand both *text* and *image*. * ... -One of the reason that AI lab are focusing on multi-modal AI is that is can solve a lot of practical problem and that is actually might be -a requirement to build strong AI system as argued by Yann Lecun in [this article](https://www.noemamag.com/ai-and-the-limits-of-language/) where he said that `A system trained on language alone will never approximate human intelligence`. +One of the reasons that AI labs are focusing on multi-modal AI is that it can solve a lot of practical problems and that it actually might be +a requirement to build a strong AI system as argued by Yann Lecun in [this article](https://www.noemamag.com/ai-and-the-limits-of-language/) where he said that `A system trained on language alone will never approximate human intelligence`. ### `Generative AI` -Generative AI is as well in the epicenter of the latest AI revolution. These tool allow to *generate* data. +Generative AI is as well in the epicenter of the latest AI revolution. These tools allow us to *generate* data. * [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [MidJourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [Dalle-2](https://openai.com/product/dall-e-2) generate *image* from *text*. ### `Neural Search` -Neural search is search powered by neural network. Unlike traditional keyword-based search methods, neural search can understand the context and semantic meaning of the query, allowing it to find relevant results even when the exact keywords are not present +Neural search is a search powered by neural networks. Unlike traditional keyword-based search methods, neural search can understand the context and semantic meaning of the query, allowing it to find relevant results even when the exact keywords are not present ### `Vector Database` @@ -52,14 +51,14 @@ Originally DocArray was part of Jina but it became a standalone project that is ### `Pydantic` -[Pydantic](https://github.com/pydantic/pydantic/) is a python library that allow to data validation using Python type hints. +[Pydantic](https://github.com/pydantic/pydantic/) is a Python library that allows data validation using Python type hints. DocArray relies on Pydantic. ### `FastAPI` -[FastAPI](https://fastapi.tiangolo.com/) is a python library that allow to build API using Python type hints. +[FastAPI](https://fastapi.tiangolo.com/) is a Python library that allows building API using Python type hints. -It is build on top of Pydantic and nicely extend to DocArray +It is built on top of Pydantic and nicely extends to DocArray ### `Weaviate` From 8fc808b8a0115287be147121c69b09ab37c585c2 Mon Sep 17 00:00:00 2001 From: samsja Date: Tue, 11 Apr 2023 13:31:30 +0200 Subject: [PATCH 16/22] fix: fix doc test Signed-off-by: samsja --- docs/user_guide/representing/array.md | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 6904c16a35d..35a0a7696b9 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -127,7 +127,7 @@ Let's see what that looks like: At the document level: ```python -print(banner.url) +print(banner1.image) ``` ```cmd @@ -136,7 +136,7 @@ https://example.com/image1.png' At the Array level: ```python -print(docs.url) +print(docs.image) ``` ```cmd @@ -223,7 +223,7 @@ Yes, `docs.banner` returns a nested `DocList` of `BannerDoc` ! You can even access the attribute of the nested `BaseDoc` at the Array level: ```python -print(docs.banner.url) +print(docs.banner.image) ``` ```cmd @@ -364,16 +364,17 @@ def predict(image: NdArray['batch_size', 3, 224, 224]): Let's create a `DocList` of `ImageDoc` and pass it to the function -```python hl_lines="6 8" +```python from docarray import DocList import numpy as np -docs = DocList[ImageDoc]([ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]) +docs = DocList[ImageDoc]( + [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)] +) predict(np.stack(docs.image)) ... predict(np.stack(docs.image)) - ``` When you call `docs.image` DocList loop over the 10 documents and collect the image attribute of each document in a list @@ -391,11 +392,13 @@ multiple time. This is not optimal. Let's see how it will work with `DocVec` -```python hl_lines="6 8" +```python from docarray import DocList import numpy as np -docs = DocList[ImageDoc]([ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)]) +docs = DocList[ImageDoc]( + [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)] +) predict(docs.image) ... @@ -415,6 +418,8 @@ looks and behaves like a `BaseDoc` instance. It is a `BaseDoc` instance but with When you do a change at the view level it will be reflected at the DocVec level. ```python +from docarray import DocVec + docs = DocVec[ImageDoc]( [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)] ) From b6dc7d9f09063b6fa098663a053cd8afb0f05997 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Tue, 11 Apr 2023 13:36:18 +0200 Subject: [PATCH 17/22] docs: fix english Signed-off-by: Alex C-G --- docs/glossary.md | 46 ++++---- docs/user_guide/representing/array.md | 124 ++++++++++----------- docs/user_guide/representing/first_step.md | 4 +- 3 files changed, 82 insertions(+), 92 deletions(-) diff --git a/docs/glossary.md b/docs/glossary.md index b562de51da3..389e0254134 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -1,53 +1,49 @@ # Glossary -DocArray scope is at the edge of different fields, from AI to web apps. To make it easier to understand, we have created a glossary of terms used in the documentation. +DocArray's scope is at the edge of different fields, from AI to web apps. To make it easier to understand, we have created a glossary of terms used in the documentation. ## Concept -### `Multi Modal Data` -Multi Modal data is data that is composed of different modalities, Image, Text, Video, Audio, etc. +### `Multimodal Data` +Multimodal data is data that is composed of different modalities, like Image, Text, Video, Audio, etc. For example, a YouTube video is composed of a video, a title, a description, a thumbnail, etc. -Actually, most of the data we have in the world is multi-modal. +Actually, most of the data we have in the world is multimodal. -### `Multi Modal AI` +### `Multimodal AI` -Multi Modal AI is the field of AI that focuses on multi-modal data. +Multimodal AI is the field of AI that focuses on multimodal data. -Most of the recent breakthroughs in AI are multi-modal AI. +Most of the recent breakthroughs in AI are multimodal AI. -* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [MidJourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [Dalle-2](https://openai.com/product/dall-e-2) generate *image* from *text*. -* [Whisper](https://openai.com/research/whisper) can generate *text* from *speech* -* [GPT4](https://openai.com/product/gpt-4), [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) are MLLM (Multi Modal Large Language Model) that can understand both *text* and *image*. -* ... +* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [Midjourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [DALL-E 2](https://openai.com/product/dall-e-2) generate *images* from *text*. +* [Whisper](https://openai.com/research/whisper) generates *text* from *speech*. +* [GPT-4](https://openai.com/product/gpt-4) and [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) are MLLMs (Multimodal Large Language Models) that understand both *text* and *images*. -One of the reasons that AI labs are focusing on multi-modal AI is that it can solve a lot of practical problems and that it actually might be -a requirement to build a strong AI system as argued by Yann Lecun in [this article](https://www.noemamag.com/ai-and-the-limits-of-language/) where he said that `A system trained on language alone will never approximate human intelligence`. +One of the reasons that AI labs are focusing on multimodal AI is that it can solve a lot of practical problems and that it actually might be +a requirement to build a strong AI system as argued by Yann Lecun in [this article](https://www.noemamag.com/ai-and-the-limits-of-language/) where he stated that "a system trained on language alone will never approximate human intelligence." ### `Generative AI` -Generative AI is as well in the epicenter of the latest AI revolution. These tools allow us to *generate* data. - -* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [MidJourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [Dalle-2](https://openai.com/product/dall-e-2) generate *image* from *text*. +Generative AI is also in the epicenter of the latest AI revolution. These tools allow us to *generate* data. +* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [MidJourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [Dalle-2](https://openai.com/product/dall-e-2) generate *images* from *text*. ### `Neural Search` -Neural search is a search powered by neural networks. Unlike traditional keyword-based search methods, neural search can understand the context and semantic meaning of the query, allowing it to find relevant results even when the exact keywords are not present - +Neural search is search powered by neural networks. Unlike traditional keyword-based search methods, neural search understands the context and semantic meaning of a user's query, allowing it to find relevant results even when the exact keywords are not present. ### `Vector Database` -A vector database is a specialized storage system designed to handle high-dimensional vectors, which are common representations of data in machine learning and AI applications. It enables efficient storage, indexing, and querying of these vectors, and typically supports operations like nearest neighbor search, similarity search, and clustering - +A vector database is a specialized storage system designed to handle high-dimensional vectors, which are common representations of data in machine learning and AI applications. It enables efficient storage, indexing, and querying of these vectors, and typically supports operations like nearest neighbor search, similarity search, and clustering. ## Tools ### `Jina` -[Jina](https://jina.ai) is a framework to build Multi Modal application. It heavily relies on DocArray to represent and send data. +[Jina](https://jina.ai) is a framework to build multimodal applications. It relies heavily on DocArray to represent and send data. -Originally DocArray was part of Jina but it became a standalone project that is now independent of Jina. +DocArray was originally part of Jina but it became a standalone project that is now independent of Jina. ### `Pydantic` @@ -58,12 +54,12 @@ DocArray relies on Pydantic. [FastAPI](https://fastapi.tiangolo.com/) is a Python library that allows building API using Python type hints. -It is built on top of Pydantic and nicely extends to DocArray +It is built on top of Pydantic and nicely extends to DocArray. ### `Weaviate` -[Weaviate](https://weaviate.io/) is an open-source vector database that is supported in DocArray +[Weaviate](https://weaviate.io/) is an open-source vector database that is supported in DocArray. ### `Weaviate` -[Qdrant](https://qdrant.tech/) is an open-source vector database that is supported in DocArray \ No newline at end of file +[Qdrant](https://qdrant.tech/) is an open-source vector database that is supported in DocArray. diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 35a0a7696b9..f57dd3740c9 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -1,39 +1,37 @@ # Array of documents -DocArray allows users to represent and manipulate multi-modal data to build AI applications (Generative AI, neural search ...). +DocArray allows users to represent and manipulate multi-modal data to build AI applications (Generative AI, neural search, etc). -As you have seen in the last section (LINK), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which allows to represent a *single* document, a *single* datapoint. +As you have seen in the last section (LINK), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which represents a *single* document, a *single* datapoint. -In Machine Learning though we often need to work with an *array* of documents, and an *array* of data points. +However, in machine learning we often need to work with an *array* of documents, and an *array* of data points. -This section introduces the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This library -name: `DocArray` is derived from this concept, and it stands for `DocumentArray`. +This section introduces the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This name of this library -- +`DocArray` -- is derived from this concept and it is short for `DocumentArray`. ## AnyDocArray -`AnyDocArray` is an abstract class that represents an array of `BaseDoc` which is not meant to be used directly, but to be subclassed. +`AnyDocArray` is an abstract class that represents an array of `BaseDoc`s which is not meant to be used directly, but to be subclassed. -We provide two concrete implementation of `AnyDocArray` : +We provide two concrete implementations of `AnyDocArray` : -- [`DocList`][docarray.array.doc_list.doc_list.DocList] which is a Python list of `BaseDoc` -- [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] which is a column based representation of `BaseDoc` +- [`DocList`][docarray.array.doc_list.doc_list.DocList] which is a Python list of `BaseDoc`s +- [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] which is a column based representation of `BaseDoc`s -We will go into the difference between `DocList` and `DocVec` in the next section but let's first focus on what they have in common. - - -`AnyDocArray`s spirit is to extend the `BaseDoc` and `BaseModel` concept to the Array level in a *seamless* way. +We will go into the difference between `DocList` and `DocVec` in the next section, but let's first focus on what they have in common. +The spirit of `AnyDocArray`s is to extend the `BaseDoc` and `BaseModel` concepts to the array level in a *seamless* way. ### Example -before going into detail lets look at a code example. +Before going into detail lets look at a code example. !!! Note `DocList` and `DocVec` are both `AnyDocArray`. The following section will use `DocList` as an example, but the same applies to `DocVec`. -First you need to create a Doc class, our data schema. Let's say you want to represent a banner with an image, a title and a description. +First you need to create a `Doc` class, our data schema. Let's say you want to represent a banner with an image, a title and a description: ```python from docarray import BaseDoc, DocList @@ -46,7 +44,7 @@ class BannerDoc(BaseDoc): description: str ``` -Let's instantiate several `BannerDoc` +Let's instantiate several `BannerDoc`s: ```python banner1 = BannerDoc( @@ -62,7 +60,7 @@ banner2 = BannerDoc( ) ``` -You can now collect them into a `DocList` of `BannerDoc`: +You can now collect them into a `DocList` of `BannerDoc`s: ```python docs = DocList[BannerDoc]([banner1, banner2]) @@ -111,21 +109,20 @@ BannerDoc(image='https://example.com/image1.png', title='Hello World', descripti BannerDoc(image='https://example.com/image2.png', title='Bye Bye World', description='This is (distopic) banner') ``` - - !!! note - The syntax `DocList[BannerDoc]` might surprise you in this context, - it is actually at the heart of DocArray but let's come back to it later LINK TO LATER and continue with the example. + The syntax `DocList[BannerDoc]` might surprise you in this context. + It is actually at the heart of DocArray but we'll come back to it later LINK TO LATER and continue with this example for now. -As we said earlier, `DocList` or more generally `AnyDocArray`, extends the `BaseDoc` API at the Array level. +As we said earlier, `DocList` (or more generally `AnyDocArray`) extends the `BaseDoc` API at the array level. -What this means concretely is that the same way you can access your data at the -document level, you can access it at the Array level. +What this means concretely is you can access your data at the Array level in just the same way you would access your data at the +document level. Let's see what that looks like: At the document level: + ```python print(banner1.image) ``` @@ -135,6 +132,7 @@ https://example.com/image1.png' ``` At the Array level: + ```python print(docs.image) ``` @@ -147,8 +145,8 @@ print(docs.image) All the attributes of `BannerDoc` are accessible at the Array level. !!! Warning - Whereas this is true at runtime, static type analyzers like Mypy or IDE like PyCharm will not be able to know it. - This limitation is known and will be fixed in the future by the introduction of a Mypy, PyCharm, VSCode plugin. + Whereas this is true at runtime, static type analyzers like Mypy or IDEs like PyCharm will not be be aware of it. + This limitation is known and will be fixed in the future by the introduction of plugins for Mypy, PyCharm and VSCode. This even works when you have a nested `BaseDoc`: @@ -174,7 +172,7 @@ page1 = PageDoc( title='Hello World', description='This is a banner', ), - content='Hello wolrd is the most used example in programming, but do you know that ? ...', + content='Hello world is the most used example in programming, but do you know that? ...', ) page2 = PageDoc( @@ -183,7 +181,7 @@ page2 = PageDoc( title='Bye Bye World', description='This is (distopic) banner', ), - content='What if the most used example in programming was Bye Bye World, would programming be that much fun ? ...', + content='What if the most used example in programming was Bye Bye World, would programming be that much fun? ...', ) docs = DocList[PageDoc]([page1, page2]) @@ -218,9 +216,9 @@ print(docs.banner) ``` -Yes, `docs.banner` returns a nested `DocList` of `BannerDoc` ! +Yes, `docs.banner` returns a nested `DocList` of `BannerDoc`s! -You can even access the attribute of the nested `BaseDoc` at the Array level: +You can even access the attributes of the nested `BaseDoc` at the Array level: ```python print(docs.banner.image) @@ -230,7 +228,7 @@ print(docs.banner.image) ['https://example.com/image1.png', 'https://example.com/image2.png'] ``` -The same way that [BaseDoc][docarray.base_doc.doc.BaseDoc] you would have done: +This is just the same way that you would do it with [BaseDoc][docarray.base_doc.doc.BaseDoc]: ```python print(page1.banner.image) @@ -242,27 +240,25 @@ print(page1.banner.image) ### `DocList[DocType]` syntax -As you have seen in the previous section, `AnyDocArray` will expose the same attributes as the `BaseDoc` it contains. +As you have seen in the previous section, `AnyDocArray` will expose the same attributes as the `BaseDoc`s it contains. -But this concept only works if and only if all of the `BaseDoc`s in the `AnyDocArray` have the same schema. +But this concept only works if (and only if) all of the `BaseDoc`s in the `AnyDocArray` have the same schema. -Indeed, if one of your `BaseDoc`s has an attribute that the others don't, you will get an error if you try to access it at -the array level. +If one of your `BaseDoc`s has an attribute that the others don't, you will get an error if you try to access it at +the Array level. !!! note - To be able to extend your schema to the Array level, `AnyDocArray` needs to contain a homogenous Document. + To extend your schema to the Array level, `AnyDocArray` needs to contain a homogenous Document. This is where the custom syntax `DocList[DocType]` comes into play. !!! `DocList[DocType]` creates a custom [`DocList`][docarray.array.doc_list.doc_list.DocList] that can only contain `DocType` Documents. -This syntax is inspired by more statically typed languages, and even though it might offend Python purists, we believe that it is a good user experience to think of an Array of `BaseDoc` rather than -just an array of non-homogenous `BaseDoc`. - +This syntax is inspired by more statically typed languages, and even though it might offend Python purists, we believe that it is a good user experience to think of an Array of `BaseDoc`s rather than just an array of non-homogenous `BaseDoc`s. -That being said, `AnyDocArray` can also be used to create a non-homogenous `AnyDocArray`: +That said, `AnyDocArray` can also be used to create a non-homogenous `AnyDocArray`: !!! note The default `DocList` can be used to create a non-homogenous list of `BaseDoc`. @@ -270,11 +266,11 @@ That being said, `AnyDocArray` can also be used to create a non-homogenous `AnyD !!! warning `DocVec` cannot store non-homogenous `BaseDoc` and always needs the `DocVec[DocType]` syntax. -The usage of non-homogenous `DocList` is similar to a normal Python list but still offers DocArray functionality -like serialization and sending over the wire (LINK). But it won't be able to extend the API of your custom schema to the Array level. - +The usage of a non-homogenous `DocList` is similar to a normal Python list but still offers DocArray functionality +like serialization and sending over the wire (LINK). However, it won't be able to extend the API of your custom schema to the Array level. Here is how you can instantiate a non-homogenous `DocList`: + ```python from docarray import BaseDoc, DocList from docarray.typing import ImageUrl, AudioUrl @@ -296,7 +292,7 @@ docs = DocList( ) ``` -But you will not have been able to do +But this is not possible: ```python try: @@ -320,30 +316,30 @@ ValueError: AudioDoc( ### `DocList` vs `DocVec` [`DocList`][docarray.array.doc_list.doc_list.DocList] and [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] are both -[`AnyDocArray`][docarray.array.any_array.AnyDocArray] but they have different use cases, and they differ in how -they store the data in memory. +[`AnyDocArray`][docarray.array.any_array.AnyDocArray] but they have different use cases, and differ in how +they store data in memory. They share almost everything that has been said in the previous sections, but they have some conceptual differences. -[`DocList`][docarray.array.doc_list.doc_list.DocList] is based on Python List. -You can append, extend, insert, pop , ... on it. In DocList, the data is individually owned by each `BaseDoc` collect just -different Document references. You want to use [`DocList`][docarray.array.doc_list.doc_list.DocList] when you want to be able -to rearrange or re-rank your data. One flaw of `DocList` is that none of the data is contiguous in memory. So you cannot +[`DocList`][docarray.array.doc_list.doc_list.DocList] is based on Python Lists. +You can append, extend, insert, pop, and so on. In DocList, data is individually owned by each `BaseDoc` collect just +different Document references. Use [`DocList`][docarray.array.doc_list.doc_list.DocList] when you want to be able +to rearrange or re-rank your data. One flaw of `DocList` is that none of the data is contiguous in memory, so you cannot leverage functions that require contiguous data without first copying the data in a continuous array. -[`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] is a columnar data structure. DocVec is always an array +[`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] is a columnar data structure. `DocVec` is always an array of homogeneous Documents. The idea is that every attribute of the `BaseDoc` will be stored in a contiguous array: a column. -This means that when you access the attribute of a `BaseDoc` at the Array level, we don't collect under the hood the data +This means that when you access the attribute of a `BaseDoc` at the Array level, we don't collect the data under the hood from all the documents (like `DocList`) before giving it back to you. We just return the column that is stored in memory. -This really matters when you need to handle multi-modal data that you will feed into an algorithm that require contiguous data, like matrix multiplication +This really matters when you need to handle multi-modal data that you will feed into an algorithm that requires contiguous data, like matrix multiplication which is at the heart of Machine Learning, especially in Deep Learning. -Let's take an example to illustrate the difference +Let's take an example to illustrate the difference: +Let's say you want to work with an Image: -Let's say you want to work with Image: ```python from docarray import BaseDoc from docarray.typing import NdArray @@ -355,14 +351,14 @@ class ImageDoc(BaseDoc): ] = None # [3, 224, 224] this just mean we know in advance the shape of the tensor ``` -And that you have a function that take a contiguous array of image as input (like a deep learning model) +And that you have a function that takes a contiguous array of images as input (like a deep learning model): ```python def predict(image: NdArray['batch_size', 3, 224, 224]): ... ``` -Let's create a `DocList` of `ImageDoc` and pass it to the function +Let's create a `DocList` of `ImageDoc`s and pass it to the function: ```python from docarray import DocList @@ -377,9 +373,7 @@ predict(np.stack(docs.image)) predict(np.stack(docs.image)) ``` -When you call `docs.image` DocList loop over the 10 documents and collect the image attribute of each document in a list - -it is similar to do +When you call `docs.image`, `DocList` loops over the ten documents and collects the image attribute of each document in a list. It is similar to doing: ```python images = [] @@ -388,9 +382,9 @@ for doc in docs: ``` this means that if you need to call `docs.image` multiple time, you will have to stack in the array in a contiguous batch array -multiple time. This is not optimal. +multiple times. This is not optimal. -Let's see how it will work with `DocVec` +Let's see how it will work with `DocVec`: ```python from docarray import DocList @@ -410,12 +404,12 @@ The second difference is that you just get the column and don't need to create i One of the other main differences between both of them is how you can access documents inside them. -If you access a document inside a `DocList` you will get a `BaseDoc` instance, i.e., a document. +If you access a document inside a `DocList` you will get a `BaseDoc` instance, i.e. a document. If you access a document inside a `DocVec` you will get a document view. A document view is a view of the columnar data structure which looks and behaves like a `BaseDoc` instance. It is a `BaseDoc` instance but with a different way to access the data. -When you do a change at the view level it will be reflected at the DocVec level. +When you make a change at the view level it will be reflected at the DocVec level: ```python from docarray import DocVec @@ -443,7 +437,7 @@ assert not my_doc.is_view() # False !!! Note - to summarize: you should use `DocVec` when you need to work with contiguous data, and you should use `DocList` when you need to rearrange + To summarize: you should use `DocVec` when you need to work with contiguous data, and you should use `DocList` when you need to rearrange or extend your data. See also: diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index 93ff0cdfc46..a92974d339d 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -8,8 +8,8 @@ the Pydantic world) to represent your data. !!! note - Naming convention. When we refer to a BaseDoc we refer to a class that inherits from [BaseDoc][docarray.base_doc.doc.BaseDoc]. - When we refer to a `Document` we refer to an instance of a BaseDoc class. + Naming convention: When we refer to a `BaseDoc` we refer to a class that inherits from [BaseDoc][docarray.base_doc.doc.BaseDoc]. + When we refer to a `Document` we refer to an instance of a `BaseDoc` class. ## Basic `Doc` usage. From dcaa575a6d4a6a193680372402342bf28cde13c5 Mon Sep 17 00:00:00 2001 From: samsja Date: Tue, 11 Apr 2023 13:45:14 +0200 Subject: [PATCH 18/22] fix: fix add gpt to generative ai Signed-off-by: samsja --- docs/glossary.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/glossary.md b/docs/glossary.md index 389e0254134..b6810c9d25c 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -28,6 +28,7 @@ a requirement to build a strong AI system as argued by Yann Lecun in [this artic Generative AI is also in the epicenter of the latest AI revolution. These tools allow us to *generate* data. * [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [MidJourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [Dalle-2](https://openai.com/product/dall-e-2) generate *images* from *text*. +* LLM: Large Language Model, (GPT, Flan, LLama, Bloom). These models generate *text*. ### `Neural Search` From 5dee5e991d06cf879fb8d7f22cb0d86179a9c7fd Mon Sep 17 00:00:00 2001 From: samsja Date: Tue, 11 Apr 2023 13:48:08 +0200 Subject: [PATCH 19/22] fix: fix sentence Signed-off-by: samsja --- docs/user_guide/representing/array.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index f57dd3740c9..3fafa546ff8 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -381,8 +381,7 @@ for doc in docs: images.append(doc.image) ``` -this means that if you need to call `docs.image` multiple time, you will have to stack in the array in a contiguous batch array -multiple times. This is not optimal. +this means that if you need to call `docs.image` multiple time under the hood you will collect the image from each document and stack them several times. This is not optimal. Let's see how it will work with `DocVec`: From 7f3de3127c28ffae4a3da779ea101b16eac26667 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Tue, 11 Apr 2023 14:03:36 +0200 Subject: [PATCH 20/22] docs: fix english of fixed sentence Signed-off-by: Alex C-G --- docs/user_guide/representing/array.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 3fafa546ff8..2e1151b70d8 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -381,7 +381,7 @@ for doc in docs: images.append(doc.image) ``` -this means that if you need to call `docs.image` multiple time under the hood you will collect the image from each document and stack them several times. This is not optimal. +this means that if you call `docs.image` multiple times, under the hood you will collect the image from each document and stack them several times. This is not optimal. Let's see how it will work with `DocVec`: From 0eee05fef1e5533dd74c334ef4489f14fb722511 Mon Sep 17 00:00:00 2001 From: samsja <55492238+samsja@users.noreply.github.com> Date: Tue, 11 Apr 2023 14:09:53 +0200 Subject: [PATCH 21/22] feat: apply alex suggestion Co-authored-by: Alex Cureton-Griffiths Signed-off-by: samsja <55492238+samsja@users.noreply.github.com> --- docs/user_guide/representing/array.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 2e1151b70d8..140d89f5d0e 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -333,7 +333,7 @@ of homogeneous Documents. The idea is that every attribute of the `BaseDoc` will This means that when you access the attribute of a `BaseDoc` at the Array level, we don't collect the data under the hood from all the documents (like `DocList`) before giving it back to you. We just return the column that is stored in memory. -This really matters when you need to handle multi-modal data that you will feed into an algorithm that requires contiguous data, like matrix multiplication +This really matters when you need to handle multimodal data that you will feed into an algorithm that requires contiguous data, like matrix multiplication which is at the heart of Machine Learning, especially in Deep Learning. Let's take an example to illustrate the difference: From bae09f991120ca19af1ab2be559797a62f02952d Mon Sep 17 00:00:00 2001 From: samsja Date: Tue, 11 Apr 2023 14:26:09 +0200 Subject: [PATCH 22/22] fix: fix link Signed-off-by: samsja --- docs/user_guide/representing/array.md | 11 ++++++----- docs/user_guide/representing/first_step.md | 10 +++++----- 2 files changed, 11 insertions(+), 10 deletions(-) diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index 140d89f5d0e..ac0311f00bf 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -441,8 +441,9 @@ assert not my_doc.is_view() # False See also: -* [`DocList`][docarray.array.doc_list.doc_list.DocList] -* [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] -* REPRESENTING REF -* STORING REF -* ... + +* [First step](./first_step.md) of the representing section +* API Reference for the [`DocList`][docarray.array.doc_list.doc_list.DocList] class +* API Reference for the [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] class +* The [Storing](../storing/first_step.md) section on how to store your data +* The [Sending](../sending/first_step.md) section on how to send your data diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index a92974d339d..ffdb9275f67 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -133,8 +133,8 @@ This representation can now be used to send (LINK) or to store (LINK) data. You See also: -* [BaseDoc][docarray.base_doc.doc.BaseDoc] API Reference -* DOCUMENT_ARARY REF -* DOCUMENT INDEX REF -* DOCUMENT STORE REF -* ... +* The [next section](./array.md) of the representing section +* API Reference for the [BaseDoc][docarray.base_doc.doc.BaseDoc] class +* The [Storing](../storing/first_step.md) section on how to store your data +* The [Sending](../sending/first_step.md) section on how to send your data +