Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 5 additions & 8 deletions docs/user_guide/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ This user guide shows you how to use `DocArray` with most of its features.

There are three main sections:

- [Representing Data](representing/first_step.md): This section will show you how to use `DocArray` to represent your data. This is a great starting point if you want to better organize the data in your ML models, or if you are looking for a "pydantic for ML".
- [Sending Data](sending/first_step.md): This section will show you how to use `DocArray` to send your data. This is a great starting point if you want to serve your ML model, for example through FastAPI.
- [Storing Data](storing/first_step.md): This section will show you how to use `DocArray` to store your data. This is a great starting point if you are looking for an "ORM for vector databases".
- [Representing data](representing/first_step.md): This section will show you how to represent your data. This is a great starting point if you want to better organize the data in your ML models, or if you are looking for a "Pydantic for ML".
- [Sending data](sending/first_step.md): This section will show you how to send your data. This is a great starting point if you want to serve your ML model, for example through FastAPI.
- [Storing data](storing/first_step.md): This section will show you how to store your data. This is a great starting point if you are looking for an "ORM for vector databases".

You should start by reading the [Representing Data](representing/first_step.md) section, and then the [Sending Data](sending/first_step.md) and [Storing Data](storing/first_step.md) sections can be read in any order.
You should start by reading the [Representing data](representing/first_step.md) section, and then the [Sending data](sending/first_step.md) and [Storing data](storing/first_step.md) sections can be read in any order.

You will first need to install `DocArray` in your Python environment.

Expand All @@ -35,7 +35,7 @@ This will install the main dependencies of `DocArray` and will work with all the
```

Depending on your usage you might want to use `DocArray` with only a couple of specific modalities and their dependencies.
For instance, let's say you only want to work with images, you can install `DocArray` using the following command:
For instance, if you only want to work with images, you can install `DocArray` using the following command:

```
pip install "docarray[image]"
Expand All @@ -46,6 +46,3 @@ pip install "docarray[image]"
```
pip install "docarray[image, audio]"
```

!!! warning
This way of installing `DocArray` is only valid starting with version `0.30`
51 changes: 23 additions & 28 deletions docs/user_guide/representing/first_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,19 @@
At the heart of `DocArray` lies the concept of [`BaseDoc`][docarray.base_doc.doc.BaseDoc].

A [BaseDoc][docarray.base_doc.doc.BaseDoc] is very similar to a [Pydantic](https://docs.pydantic.dev/)
[`BaseModel`](https://docs.Pydantic.dev/usage/models) - in fact it _is_ a specialized Pydantic `BaseModel`. It allows you to define custom `Document` schemas (or `Model` in
[`BaseModel`](https://docs.Pydantic.dev/usage/models) -- in fact it _is_ a specialized Pydantic `BaseModel`. It allows you to define custom `Document` schemas (or `Model`s in
the Pydantic world) to represent your data.


!!! note
Naming convention: When we refer to a `BaseDoc` we refer to a class that inherits from [BaseDoc][docarray.base_doc.doc.BaseDoc].
Naming convention: When we refer to a `BaseDoc`, we refer to a class that inherits from [BaseDoc][docarray.base_doc.doc.BaseDoc].
When we refer to a `Document` we refer to an instance of a `BaseDoc` class.

## Basic `Doc` usage.

Before going into detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's
see what it looks like in practice.

The following Python code defines a `BannerDoc` class that can be used to represent the data of a website banner.
The following Python code defines a `BannerDoc` class that can be used to represent the data of a website banner:

```python
from docarray import BaseDoc
Expand All @@ -29,7 +28,7 @@ class BannerDoc(BaseDoc):
description: str
```

You can then instantiate a `BannerDoc` object and access its attributes.
You can then instantiate a `BannerDoc` object and access its attributes:

```python
banner = BannerDoc(
Expand All @@ -45,39 +44,36 @@ assert banner.description == 'This is a banner'

## `BaseDoc` is a Pydantic `BaseModel`

The class [BaseDoc][docarray.base_doc.doc.BaseDoc] inherits from Pydantic [BaseModel](https://docs.pydantic.dev/usage/models). So you can use
all the features of `BaseModel` in your `Doc` class.

This means that `BaseDoc`:
The [BaseDoc][docarray.base_doc.doc.BaseDoc] class inherits from Pydantic [BaseModel](https://docs.pydantic.dev/usage/models). This means you can use
all the features of `BaseModel` in your `Doc` class. `BaseDoc`:

* Will perform data validation: `BaseDoc` will check that the data you pass to it is valid. If not, it will raise an
error. Data being "valid" is actually defined by the type used in the type hint itself, but we will come back to this concept later. (TODO add typing section)
* Can be configured using a nested `Config` class, see Pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more detail on what kind of config pydantic offers.
* Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools that use Pydantic like [FastAPI]('https://fastapi.tiangolo.com/').
* Can be configured using a nested `Config` class, see Pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more detail on what kind of config Pydantic offers.
* Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools that use Pydantic, like [FastAPI]('https://fastapi.tiangolo.com/').

### What is the difference with Pydantic `BaseModel`? (INCOMPLETE)
### How is `BaseDoc` different to Pydantic's `BaseModel`? (INCOMPLETE)

LINK TO THE VERSUS (not ready)

[BaseDoc][docarray.base_doc.doc.BaseDoc] is not only a [BaseModel](https://docs.pydantic.dev/usage/models),
[BaseDoc][docarray.base_doc.doc.BaseDoc] is not just a [BaseModel](https://docs.pydantic.dev/usage/models):

* You can use it with DocArray [Typed](docarray.typing) that are oriented toward MultiModal (image, audio, ...) data and for
* You can use it with DocArray [Typed](docarray.typing) that are oriented toward Multimodal (image, audio, etc) data and for
Machine Learning use case TODO link the type section.
* [BaseDoc][docarray.base_doc.doc.BaseDoc] has an `id` field (generated by default) to uniquely identify a Document.

Another difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has an `id` field that is generated by default that is used to uniquely identify a Document.

## `BaseDoc` allows representing multimodal and nested data
## Representing multimodal and nested data

Let's say you want to represent a YouTube video in your application, perhaps to build a search system for YouTube videos.
A YouTube video is not only composed of a video, but also has a title, description, thumbnail (and more, but let's keep it simple).

All of these elements are from different [`modalities`](../../data_types/first_steps.md): the title and description are text, the thumbnail is an image, and the video in itself is, well, a video.
All of these elements are from different [`modalities`](../../data_types/first_steps.md): the title and description are text, the thumbnail is an image, and the video itself is, well, a video.

DocArray allows to represent all of this multimodal data in a single object.
DocArray lets you represent all of this multimodal data in a single object.

Let's first create an `BaseDoc` for each of the elements that compose the YouTube video.
Let's first create a `BaseDoc` for each of the elements that compose the YouTube video.

First for the thumbnail which is an image:
First for the thumbnail image:

```python
from docarray import BaseDoc
Expand Down Expand Up @@ -105,7 +101,7 @@ class VideoDoc(BaseDoc):
)
```

Then for the title and description (which are text) we will just use a `str` type.
Then for the title and description (which are text) we'll just use a `str` type.

All the elements that compose a YouTube video are ready:

Expand All @@ -120,21 +116,20 @@ class YouTubeVideoDoc(BaseDoc):
video: VideoDoc
```

You now have `YouTubeVideoDoc` which is a pythonic representation of a YouTube video.
We now have `YouTubeVideoDoc` which is a pythonic representation of a YouTube video.

This representation can now be used to [send](../sending/first_step.md) or to [store](../storing/first_step.md) data. You can even use it directly to [train a machine learning](../../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation.
This representation can be used to [send](../sending/first_step.md) or [store](../storing/first_step.md) data. You can even use it directly to [train a machine learning](../../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation.

!!! note

You see here that `ImageDoc` and `VideoDoc` are also [BaseDoc][docarray.base_doc.doc.BaseDoc], and they later used inside another [BaseDoc][docarray.base_doc.doc.BaseDoc]`.
You see here that `ImageDoc` and `VideoDoc` are also [BaseDoc][docarray.base_doc.doc.BaseDoc], and they are later used inside another [BaseDoc][docarray.base_doc.doc.BaseDoc]`.
This is what we call nested data representation.

[BaseDoc][docarray.base_doc.doc.BaseDoc] can be nested to represent any kind of data hierarchy.

See also:

* The [next section](./array.md) of the representing section
* API Reference for the [BaseDoc][docarray.base_doc.doc.BaseDoc] class
* The [next part](./array.md) of the representing section
* API reference for the [BaseDoc][docarray.base_doc.doc.BaseDoc] class
* The [Storing](../storing/first_step.md) section on how to store your data
* The [Sending](../sending/first_step.md) section on how to send your data