diff --git a/docs/api_references/array/any_da.md b/docs/api_references/array/any_da.md new file mode 100644 index 00000000000..e71d1999cf5 --- /dev/null +++ b/docs/api_references/array/any_da.md @@ -0,0 +1,3 @@ +# AnyDocArray + +::: docarray.array.doc_vec.doc_vec.DocVec diff --git a/docs/api_references/array/da_stack.md b/docs/api_references/array/da_stack.md index c0709f2e084..74f9ff637a0 100644 --- a/docs/api_references/array/da_stack.md +++ b/docs/api_references/array/da_stack.md @@ -1,3 +1,3 @@ # DocVec -::: docarray.array.doc_vec.doc_vec.DocVec +::: docarray.array.any_array.AnyDocArray diff --git a/docs/glossary.md b/docs/glossary.md new file mode 100644 index 00000000000..b6810c9d25c --- /dev/null +++ b/docs/glossary.md @@ -0,0 +1,66 @@ +# Glossary + +DocArray's scope is at the edge of different fields, from AI to web apps. To make it easier to understand, we have created a glossary of terms used in the documentation. + +## Concept + +### `Multimodal Data` +Multimodal data is data that is composed of different modalities, like Image, Text, Video, Audio, etc. +For example, a YouTube video is composed of a video, a title, a description, a thumbnail, etc. + +Actually, most of the data we have in the world is multimodal. + +### `Multimodal AI` + +Multimodal AI is the field of AI that focuses on multimodal data. + +Most of the recent breakthroughs in AI are multimodal AI. + +* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [Midjourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [DALL-E 2](https://openai.com/product/dall-e-2) generate *images* from *text*. +* [Whisper](https://openai.com/research/whisper) generates *text* from *speech*. +* [GPT-4](https://openai.com/product/gpt-4) and [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) are MLLMs (Multimodal Large Language Models) that understand both *text* and *images*. + +One of the reasons that AI labs are focusing on multimodal AI is that it can solve a lot of practical problems and that it actually might be +a requirement to build a strong AI system as argued by Yann Lecun in [this article](https://www.noemamag.com/ai-and-the-limits-of-language/) where he stated that "a system trained on language alone will never approximate human intelligence." + +### `Generative AI` + +Generative AI is also in the epicenter of the latest AI revolution. These tools allow us to *generate* data. + +* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [MidJourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [Dalle-2](https://openai.com/product/dall-e-2) generate *images* from *text*. +* LLM: Large Language Model, (GPT, Flan, LLama, Bloom). These models generate *text*. + +### `Neural Search` + +Neural search is search powered by neural networks. Unlike traditional keyword-based search methods, neural search understands the context and semantic meaning of a user's query, allowing it to find relevant results even when the exact keywords are not present. + +### `Vector Database` + +A vector database is a specialized storage system designed to handle high-dimensional vectors, which are common representations of data in machine learning and AI applications. It enables efficient storage, indexing, and querying of these vectors, and typically supports operations like nearest neighbor search, similarity search, and clustering. + +## Tools + +### `Jina` + +[Jina](https://jina.ai) is a framework to build multimodal applications. It relies heavily on DocArray to represent and send data. + +DocArray was originally part of Jina but it became a standalone project that is now independent of Jina. + +### `Pydantic` + +[Pydantic](https://github.com/pydantic/pydantic/) is a Python library that allows data validation using Python type hints. +DocArray relies on Pydantic. + +### `FastAPI` + +[FastAPI](https://fastapi.tiangolo.com/) is a Python library that allows building API using Python type hints. + +It is built on top of Pydantic and nicely extends to DocArray. + +### `Weaviate` + +[Weaviate](https://weaviate.io/) is an open-source vector database that is supported in DocArray. + +### `Weaviate` + +[Qdrant](https://qdrant.tech/) is an open-source vector database that is supported in DocArray. diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md new file mode 100644 index 00000000000..ac0311f00bf --- /dev/null +++ b/docs/user_guide/representing/array.md @@ -0,0 +1,449 @@ +# Array of documents + +DocArray allows users to represent and manipulate multi-modal data to build AI applications (Generative AI, neural search, etc). + +As you have seen in the last section (LINK), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which represents a *single* document, a *single* datapoint. + +However, in machine learning we often need to work with an *array* of documents, and an *array* of data points. + +This section introduces the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This name of this library -- +`DocArray` -- is derived from this concept and it is short for `DocumentArray`. + +## AnyDocArray + +`AnyDocArray` is an abstract class that represents an array of `BaseDoc`s which is not meant to be used directly, but to be subclassed. + +We provide two concrete implementations of `AnyDocArray` : + +- [`DocList`][docarray.array.doc_list.doc_list.DocList] which is a Python list of `BaseDoc`s +- [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] which is a column based representation of `BaseDoc`s + +We will go into the difference between `DocList` and `DocVec` in the next section, but let's first focus on what they have in common. + +The spirit of `AnyDocArray`s is to extend the `BaseDoc` and `BaseModel` concepts to the array level in a *seamless* way. + +### Example + +Before going into detail lets look at a code example. + +!!! Note + + `DocList` and `DocVec` are both `AnyDocArray`. The following section will use `DocList` as an example, but the same + applies to `DocVec`. + +First you need to create a `Doc` class, our data schema. Let's say you want to represent a banner with an image, a title and a description: + +```python +from docarray import BaseDoc, DocList +from docarray.typing import ImageUrl + + +class BannerDoc(BaseDoc): + image: ImageUrl + title: str + description: str +``` + +Let's instantiate several `BannerDoc`s: + +```python +banner1 = BannerDoc( + image='https://example.com/image1.png', + title='Hello World', + description='This is a banner', +) + +banner2 = BannerDoc( + image='https://example.com/image2.png', + title='Bye Bye World', + description='This is (distopic) banner', +) +``` + +You can now collect them into a `DocList` of `BannerDoc`s: + +```python +docs = DocList[BannerDoc]([banner1, banner2]) + +docs.summary() +``` + +```cmd +╭──────── DocList Summary ────────╮ +│ │ +│ Type DocList[BannerDoc] │ +│ Length 2 │ +│ │ +╰─────────────────────────────────╯ +╭──── Document Schema ─────╮ +│ │ +│ BannerDoc │ +│ ├── image: ImageUrl │ +│ ├── title: str │ +│ └── description: str │ +│ │ +╰──────────────────────────╯ +``` + +`docs` here is an array-like collection of `BannerDoc`. + +You can access documents inside it with the usual Python array API: + +```python +print(docs[0]) +``` + +```cmd +BannerDoc(image='https://example.com/image1.png', title='Hello World', description='This is a banner') +``` + +or iterate over it: + +```python +for doc in docs: + print(doc) +``` + +```cmd +BannerDoc(image='https://example.com/image1.png', title='Hello World', description='This is a banner') +BannerDoc(image='https://example.com/image2.png', title='Bye Bye World', description='This is (distopic) banner') +``` + +!!! note + The syntax `DocList[BannerDoc]` might surprise you in this context. + It is actually at the heart of DocArray but we'll come back to it later LINK TO LATER and continue with this example for now. + +As we said earlier, `DocList` (or more generally `AnyDocArray`) extends the `BaseDoc` API at the array level. + +What this means concretely is you can access your data at the Array level in just the same way you would access your data at the +document level. + +Let's see what that looks like: + + +At the document level: + +```python +print(banner1.image) +``` + +```cmd +https://example.com/image1.png' +``` + +At the Array level: + +```python +print(docs.image) +``` + +```cmd +['https://example.com/image1.png', 'https://example.com/image2.png'] +``` + +!!! Important + All the attributes of `BannerDoc` are accessible at the Array level. + +!!! Warning + Whereas this is true at runtime, static type analyzers like Mypy or IDEs like PyCharm will not be be aware of it. + This limitation is known and will be fixed in the future by the introduction of plugins for Mypy, PyCharm and VSCode. + +This even works when you have a nested `BaseDoc`: + +```python +from docarray import BaseDoc, DocList +from docarray.typing import ImageUrl + + +class BannerDoc(BaseDoc): + image: ImageUrl + title: str + description: str + + +class PageDoc(BaseDoc): + banner: BannerDoc + content: str + + +page1 = PageDoc( + banner=BannerDoc( + image='https://example.com/image1.png', + title='Hello World', + description='This is a banner', + ), + content='Hello world is the most used example in programming, but do you know that? ...', +) + +page2 = PageDoc( + banner=BannerDoc( + image='https://example.com/image2.png', + title='Bye Bye World', + description='This is (distopic) banner', + ), + content='What if the most used example in programming was Bye Bye World, would programming be that much fun? ...', +) + +docs = DocList[PageDoc]([page1, page2]) + +docs.summary() +``` + +```cmd +╭─────── DocList Summary ───────╮ +│ │ +│ Type DocList[PageDoc] │ +│ Length 2 │ +│ │ +╰───────────────────────────────╯ +╭────── Document Schema ───────╮ +│ │ +│ PageDoc │ +│ ├── banner: BannerDoc │ +│ │ ├── image: ImageUrl │ +│ │ ├── title: str │ +│ │ └── description: str │ +│ └── content: str │ +│ │ +╰──────────────────────────────╯ +``` + +```python +print(docs.banner) +``` + +```cmd + +``` + +Yes, `docs.banner` returns a nested `DocList` of `BannerDoc`s! + +You can even access the attributes of the nested `BaseDoc` at the Array level: + +```python +print(docs.banner.image) +``` + +```cmd +['https://example.com/image1.png', 'https://example.com/image2.png'] +``` + +This is just the same way that you would do it with [BaseDoc][docarray.base_doc.doc.BaseDoc]: + +```python +print(page1.banner.image) +``` + +```cmd +'https://example.com/image1.png' +``` + +### `DocList[DocType]` syntax + +As you have seen in the previous section, `AnyDocArray` will expose the same attributes as the `BaseDoc`s it contains. + +But this concept only works if (and only if) all of the `BaseDoc`s in the `AnyDocArray` have the same schema. + +If one of your `BaseDoc`s has an attribute that the others don't, you will get an error if you try to access it at +the Array level. + + +!!! note + To extend your schema to the Array level, `AnyDocArray` needs to contain a homogenous Document. + +This is where the custom syntax `DocList[DocType]` comes into play. + +!!! + `DocList[DocType]` creates a custom [`DocList`][docarray.array.doc_list.doc_list.DocList] that can only contain `DocType` Documents. + +This syntax is inspired by more statically typed languages, and even though it might offend Python purists, we believe that it is a good user experience to think of an Array of `BaseDoc`s rather than just an array of non-homogenous `BaseDoc`s. + +That said, `AnyDocArray` can also be used to create a non-homogenous `AnyDocArray`: + +!!! note + The default `DocList` can be used to create a non-homogenous list of `BaseDoc`. + +!!! warning + `DocVec` cannot store non-homogenous `BaseDoc` and always needs the `DocVec[DocType]` syntax. + +The usage of a non-homogenous `DocList` is similar to a normal Python list but still offers DocArray functionality +like serialization and sending over the wire (LINK). However, it won't be able to extend the API of your custom schema to the Array level. + +Here is how you can instantiate a non-homogenous `DocList`: + +```python +from docarray import BaseDoc, DocList +from docarray.typing import ImageUrl, AudioUrl + + +class ImageDoc(BaseDoc): + url: ImageUrl + + +class AudioDoc(BaseDoc): + url: AudioUrl + + +docs = DocList( + [ + ImageDoc(url='https://example.com/image1.png'), + AudioDoc(url='https://example.com/audio1.mp3'), + ] +) +``` + +But this is not possible: + +```python +try: + docs = DocList[ImageDoc]( + [ + ImageDoc(url='https://example.com/image1.png'), + AudioDoc(url='https://example.com/audio1.mp3'), + ] + ) +except ValueError as e: + print(e) +``` + +```cmd +ValueError: AudioDoc( + id='e286b10f58533f48a0928460f0206441', + url=AudioUrl('https://example.com/audio1.mp3', host_type='domain') +) is not a +``` + +### `DocList` vs `DocVec` + +[`DocList`][docarray.array.doc_list.doc_list.DocList] and [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] are both +[`AnyDocArray`][docarray.array.any_array.AnyDocArray] but they have different use cases, and differ in how +they store data in memory. + +They share almost everything that has been said in the previous sections, but they have some conceptual differences. + +[`DocList`][docarray.array.doc_list.doc_list.DocList] is based on Python Lists. +You can append, extend, insert, pop, and so on. In DocList, data is individually owned by each `BaseDoc` collect just +different Document references. Use [`DocList`][docarray.array.doc_list.doc_list.DocList] when you want to be able +to rearrange or re-rank your data. One flaw of `DocList` is that none of the data is contiguous in memory, so you cannot +leverage functions that require contiguous data without first copying the data in a continuous array. + +[`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] is a columnar data structure. `DocVec` is always an array +of homogeneous Documents. The idea is that every attribute of the `BaseDoc` will be stored in a contiguous array: a column. + +This means that when you access the attribute of a `BaseDoc` at the Array level, we don't collect the data under the hood +from all the documents (like `DocList`) before giving it back to you. We just return the column that is stored in memory. + +This really matters when you need to handle multimodal data that you will feed into an algorithm that requires contiguous data, like matrix multiplication +which is at the heart of Machine Learning, especially in Deep Learning. + +Let's take an example to illustrate the difference: + +Let's say you want to work with an Image: + +```python +from docarray import BaseDoc +from docarray.typing import NdArray + + +class ImageDoc(BaseDoc): + image: NdArray[ + 3, 224, 224 + ] = None # [3, 224, 224] this just mean we know in advance the shape of the tensor +``` + +And that you have a function that takes a contiguous array of images as input (like a deep learning model): + +```python +def predict(image: NdArray['batch_size', 3, 224, 224]): + ... +``` + +Let's create a `DocList` of `ImageDoc`s and pass it to the function: + +```python +from docarray import DocList +import numpy as np + +docs = DocList[ImageDoc]( + [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)] +) + +predict(np.stack(docs.image)) +... +predict(np.stack(docs.image)) +``` + +When you call `docs.image`, `DocList` loops over the ten documents and collects the image attribute of each document in a list. It is similar to doing: + +```python +images = [] +for doc in docs: + images.append(doc.image) +``` + +this means that if you call `docs.image` multiple times, under the hood you will collect the image from each document and stack them several times. This is not optimal. + +Let's see how it will work with `DocVec`: + +```python +from docarray import DocList +import numpy as np + +docs = DocList[ImageDoc]( + [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)] +) + +predict(docs.image) +... +predict(docs.image) +``` + +The first difference is that you don't need to call `np.stack` on `docs.image` because `docs.image` is already a contiguous array. +The second difference is that you just get the column and don't need to create it at each call. + +One of the other main differences between both of them is how you can access documents inside them. + +If you access a document inside a `DocList` you will get a `BaseDoc` instance, i.e. a document. + +If you access a document inside a `DocVec` you will get a document view. A document view is a view of the columnar data structure which +looks and behaves like a `BaseDoc` instance. It is a `BaseDoc` instance but with a different way to access the data. + +When you make a change at the view level it will be reflected at the DocVec level: + +```python +from docarray import DocVec + +docs = DocVec[ImageDoc]( + [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)] +) + +my_doc = docs[0] + +assert my_doc.is_view() # True +``` + +whereas with DocList: + +```python +docs = DocList[ImageDoc]( + [ImageDoc(image=np.random.rand(3, 224, 224)) for _ in range(10)] +) + +my_doc = docs[0] + +assert not my_doc.is_view() # False +``` + + +!!! Note + To summarize: you should use `DocVec` when you need to work with contiguous data, and you should use `DocList` when you need to rearrange + or extend your data. + +See also: + + +* [First step](./first_step.md) of the representing section +* API Reference for the [`DocList`][docarray.array.doc_list.doc_list.DocList] class +* API Reference for the [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] class +* The [Storing](../storing/first_step.md) section on how to store your data +* The [Sending](../sending/first_step.md) section on how to send your data diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index c20b0dc553f..ffdb9275f67 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -1,4 +1,4 @@ -# Representing +# Document At the heart of `DocArray` lies the concept of [`BaseDoc`][docarray.base_doc.doc.BaseDoc]. @@ -6,6 +6,11 @@ A [BaseDoc][docarray.base_doc.doc.BaseDoc] is very similar to a [Pydantic](https [`BaseModel`](https://docs.Pydantic.dev/usage/models) - in fact it _is_ a specialized Pydantic `BaseModel`. It allows you to define custom `Document` schemas (or `Model` in the Pydantic world) to represent your data. + +!!! note + Naming convention: When we refer to a `BaseDoc` we refer to a class that inherits from [BaseDoc][docarray.base_doc.doc.BaseDoc]. + When we refer to a `Document` we refer to an instance of a `BaseDoc` class. + ## Basic `Doc` usage. Before going into detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's @@ -128,8 +133,8 @@ This representation can now be used to send (LINK) or to store (LINK) data. You See also: -* [BaseDoc][docarray.base_doc.doc.BaseDoc] API Reference -* DOCUMENT_ARARY REF -* DOCUMENT INDEX REF -* DOCUMENT STORE REF -* ... +* The [next section](./array.md) of the representing section +* API Reference for the [BaseDoc][docarray.base_doc.doc.BaseDoc] class +* The [Storing](../storing/first_step.md) section on how to store your data +* The [Sending](../sending/first_step.md) section on how to send your data + diff --git a/mkdocs.yml b/mkdocs.yml index ca72a966197..f4441995378 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -74,7 +74,9 @@ nav: - Home: README.md - Tutorial - User Guide: - user_guide/intro.md - - user_guide/representing/first_step.md + - Representing: + - user_guide/representing/first_step.md + - user_guide/representing/array.md - user_guide/sending/first_step.md - user_guide/storing/first_step.md @@ -84,4 +86,5 @@ nav: - how_to/optimize_performance_with_id_generation.md - how_to/audio2text.md - ... + - Glossary: glossary.md - Contributing: CONTRIBUTING.md