Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -482,13 +482,13 @@ INFO - docarray - HnswDocumentIndex[SimpleDoc] has been initialized
To try out the alpha you can install it via git:

```shell
pip install "git+https://github.com/docarray/[email protected]#egg=docarray[common,torch,image]"
pip install "git+https://github.com/docarray/[email protected]#egg=docarray[proto,torch,image]"
```

...or from the latest development branch

```shell
pip install "git+https://github.com/docarray/docarray@feat-rewrite-v2#egg=docarray[common,torch,image]"
pip install "git+https://github.com/docarray/docarray@feat-rewrite-v2#egg=docarray[proto,torch,image]"
```

## See also
Expand Down
1 change: 0 additions & 1 deletion docs/user_guide/first_step.md

This file was deleted.

52 changes: 51 additions & 1 deletion docs/user_guide/intro.md
Original file line number Diff line number Diff line change
@@ -1 +1,51 @@
# User Guide - Intro
# User Guide - Introduction

This user guide shows you how to use `DocArray` with most of its features.

There are three main sections:

- [Representing Data](representing/first_step.md): This section will show you how to use `DocArray` to represent your data. This is a great starting point if you want to better organize the data in your ML models, or if you are looking for a "pydantic for ML".
- [Sending Data](sending/first_step.md): This section will show you how to use `DocArray` to send your data. This is a great starting point if you want to serve your ML model, for example through FastAPI.
- [Storing Data](storing/first_step.md): This section will show you how to use `DocArray` to store your data. This is a great starting point if you are looking for an "ORM for vector databases".

You should start by reading the [Representing Data](representing/first_step.md) section, and then the [Sending Data](sending/first_step.md) and [Storing Data](storing/first_step.md) sections can be read in any order.

You will first need to install `DocArray` in your Python environment.

## Install DocArray

To install `DocArray`, you can use the following command:

```console
pip install "docarray[full]"
```

This will install the main dependencies of `DocArray` and will work with all the supported data modalities.

!!! note
To install a very light version of `DocArray` with only the core dependencies, you can use the following command:
```
pip install "docarray"
```

If you want to use `protobuf` and `DocArray`, you can run:

```
pip install "docarray[proto]"
```

Depending on your usage you might want to use `DocArray` with only a couple of specific modalities and their dependencies.
For instance, let's say you only want to work with images, you can install `DocArray` using the following command:

```
pip install "docarray[image]"
```

...or with images and audio:

```
pip install "docarray[image, audio]"
```

!!! warning
This way of installing `DocArray` is only valid starting with version `0.30`
135 changes: 135 additions & 0 deletions docs/user_guide/representing/first_step.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Representing

At the heart of `DocArray` lies the concept of [`BaseDoc`][docarray.base_doc.doc.BaseDoc].

A [BaseDoc][docarray.base_doc.doc.BaseDoc] is very similar to a [Pydantic](https://docs.pydantic.dev/)
[`BaseModel`](https://docs.Pydantic.dev/usage/models) - in fact it _is_ a specialized Pydantic `BaseModel`. It allows you to define custom `Document` schemas (or `Model` in
the Pydantic world) to represent your data.

## Basic `Doc` usage.

Before going into detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's
see what it looks like in practice.

The following Python code defines a `BannerDoc` class that can be used to represent the data of a website banner.

```python
from docarray import BaseDoc
from docarray.typing import ImageUrl


class BannerDoc(BaseDoc):
image_url: ImageUrl
title: str
description: str
```

You can then instantiate a `BannerDoc` object and access its attributes.

```python
banner = BannerDoc(
image_url='https://example.com/image.png',
title='Hello World',
description='This is a banner',
)

assert banner.image_url == 'https://example.com/image.png'
assert banner.title == 'Hello World'
assert banner.description == 'This is a banner'
```

## `BaseDoc` is a Pydantic `BaseModel`

The class [BaseDoc][docarray.base_doc.doc.BaseDoc] inherits from Pydantic [BaseModel](https://docs.pydantic.dev/usage/models). So you can use
all the features of `BaseModel` in your `Doc` class.

This means that `BaseDoc`:

* Will perform data validation: `BaseDoc` will check that the data you pass to it is valid. If not, it will raise an
error. Data being "valid" is actually defined by the type used in the type hint itself, but we will come back to this concept later. (TODO add typing section)
* Can be configured using a nested `Config` class, see Pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more detail on what kind of config pydantic offers.
* Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools that use Pydantic like [FastAPI]('https://fastapi.tiangolo.com/').
Copy link
Copy Markdown
Contributor

@jupyterjazz jupyterjazz Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note to reference FastAPI integration page here that I'm currently working on for whoever merges the second


### What is the difference with Pydantic `BaseModel`? (INCOMPLETE)

LINK TO THE VERSUS (not ready)

[BaseDoc][docarray.base_doc.doc.BaseDoc] is not only a [BaseModel](https://docs.pydantic.dev/usage/models),

* You can use it with DocArray [Typed](docarray.typing) that are oriented toward MultiModal (image, audio, ...) data and for
Machine Learning use case TODO link the type section.

Another difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has an `id` field that is generated by default that is used to uniquely identify a Document.

## `BaseDoc` allows representing multimodal and nested data

Let's say you want to represent a YouTube video in your application, perhaps to build a search system for YouTube videos.
A YouTube video is not only composed of a video, but also has a title, description, thumbnail (and more, but let's keep it simple).

All of these elements are from different `modalities` LINK TO MODALITIES SECTION (not ready): the title and description are text, the thumbnail is an image, and the video in itself is, well, a video.

DocArray allows to represent all of this multimodal data in a single object.

Let's first create an `BaseDoc` for each of the elements that compose the YouTube video.

First for the thumbnail which is an image:

```python
from docarray import BaseDoc
from docarray.typing import ImageUrl, ImageBytes


class ImageDoc(BaseDoc):
url: ImageUrl
bytes: ImageBytes = (
None # bytes are not always loaded in memory, so we make it optional
)
```

Then for the video itself:

```python
from docarray import BaseDoc
from docarray.typing import VideoUrl, VideoBytes


class VideoDoc(BaseDoc):
url: VideoUrl
bytes: VideoBytes = (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No if you put =None it is optional by default

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it is more explicit to have optional ?

None # bytes are not always loaded in memory, so we make it optional
)
```

Then for the title and description (which are text) we will just use a `str` type.

All the elements that compose a YouTube video are ready:

```python
from docarray import BaseDoc


class YouTubeVideoDoc(BaseDoc):
title: str
description: str
thumbnail: ImageDoc
video: VideoDoc
```

You now have `YouTubeVideoDoc` which is a pythonic representation of a YouTube video.

This representation can now be used to send (LINK) or to store (LINK) data. You can even use it directly to [train a machine learning](../../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation.

!!! note

You see here that `ImageDoc` and `VideoDoc` are also [BaseDoc][docarray.base_doc.doc.BaseDoc], and they later used inside another [BaseDoc][docarray.base_doc.doc.BaseDoc]`.
This is what we call nested data representation.

[BaseDoc][docarray.base_doc.doc.BaseDoc] can be nested to represent any kind of data hierarchy.

See also:

* [BaseDoc][docarray.base_doc.doc.BaseDoc] API Reference
* DOCUMENT_ARARY REF
* DOCUMENT INDEX REF
* DOCUMENT STORE REF
* ...
1 change: 1 addition & 0 deletions docs/user_guide/sending/first_step.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Sending
1 change: 1 addition & 0 deletions docs/user_guide/storing/first_step.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Storing
4 changes: 3 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,9 @@ nav:
- Home: README.md
- Tutorial - User Guide:
- user_guide/intro.md
- user_guide/first_step.md
- user_guide/representing/first_step.md
- user_guide/sending/first_step.md
- user_guide/storing/first_step.md

- How-to:
- how_to/add_doc_index.md
Expand Down
4 changes: 2 additions & 2 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ smart-open = {version = ">=6.3.0", extras = ["s3"], optional = true}
jina-hubble-sdk = {version = ">=0.34.0", optional = true}

[tool.poetry.extras]
common = ["protobuf", "lz4"]
proto = ["protobuf", "lz4"]
pandas = ["pandas"]
image = ["pillow", "types-pillow"]
video = ["av"]
Expand Down