diff --git a/README.md b/README.md index 52826b5cd8b..8d4b45ae264 100644 --- a/README.md +++ b/README.md @@ -482,13 +482,13 @@ INFO - docarray - HnswDocumentIndex[SimpleDoc] has been initialized To try out the alpha you can install it via git: ```shell -pip install "git+https://github.com/docarray/docarray@2023.01.18.alpha#egg=docarray[common,torch,image]" +pip install "git+https://github.com/docarray/docarray@2023.01.18.alpha#egg=docarray[proto,torch,image]" ``` ...or from the latest development branch ```shell -pip install "git+https://github.com/docarray/docarray@feat-rewrite-v2#egg=docarray[common,torch,image]" +pip install "git+https://github.com/docarray/docarray@feat-rewrite-v2#egg=docarray[proto,torch,image]" ``` ## See also diff --git a/docs/user_guide/first_step.md b/docs/user_guide/first_step.md deleted file mode 100644 index 0671e3a096a..00000000000 --- a/docs/user_guide/first_step.md +++ /dev/null @@ -1 +0,0 @@ -# First Step : BaseDoc diff --git a/docs/user_guide/intro.md b/docs/user_guide/intro.md index c500c92629f..5c9fbb14d1f 100644 --- a/docs/user_guide/intro.md +++ b/docs/user_guide/intro.md @@ -1 +1,51 @@ -# User Guide - Intro +# User Guide - Introduction + +This user guide shows you how to use `DocArray` with most of its features. + +There are three main sections: + +- [Representing Data](representing/first_step.md): This section will show you how to use `DocArray` to represent your data. This is a great starting point if you want to better organize the data in your ML models, or if you are looking for a "pydantic for ML". +- [Sending Data](sending/first_step.md): This section will show you how to use `DocArray` to send your data. This is a great starting point if you want to serve your ML model, for example through FastAPI. +- [Storing Data](storing/first_step.md): This section will show you how to use `DocArray` to store your data. This is a great starting point if you are looking for an "ORM for vector databases". + +You should start by reading the [Representing Data](representing/first_step.md) section, and then the [Sending Data](sending/first_step.md) and [Storing Data](storing/first_step.md) sections can be read in any order. + +You will first need to install `DocArray` in your Python environment. + +## Install DocArray + +To install `DocArray`, you can use the following command: + +```console +pip install "docarray[full]" +``` + +This will install the main dependencies of `DocArray` and will work with all the supported data modalities. + +!!! note + To install a very light version of `DocArray` with only the core dependencies, you can use the following command: + ``` + pip install "docarray" + ``` + + If you want to use `protobuf` and `DocArray`, you can run: + + ``` + pip install "docarray[proto]" + ``` + +Depending on your usage you might want to use `DocArray` with only a couple of specific modalities and their dependencies. +For instance, let's say you only want to work with images, you can install `DocArray` using the following command: + +``` +pip install "docarray[image]" +``` + +...or with images and audio: + +``` +pip install "docarray[image, audio]" +``` + +!!! warning + This way of installing `DocArray` is only valid starting with version `0.30` diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md new file mode 100644 index 00000000000..c20b0dc553f --- /dev/null +++ b/docs/user_guide/representing/first_step.md @@ -0,0 +1,135 @@ +# Representing + +At the heart of `DocArray` lies the concept of [`BaseDoc`][docarray.base_doc.doc.BaseDoc]. + +A [BaseDoc][docarray.base_doc.doc.BaseDoc] is very similar to a [Pydantic](https://docs.pydantic.dev/) +[`BaseModel`](https://docs.Pydantic.dev/usage/models) - in fact it _is_ a specialized Pydantic `BaseModel`. It allows you to define custom `Document` schemas (or `Model` in +the Pydantic world) to represent your data. + +## Basic `Doc` usage. + +Before going into detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's +see what it looks like in practice. + +The following Python code defines a `BannerDoc` class that can be used to represent the data of a website banner. + +```python +from docarray import BaseDoc +from docarray.typing import ImageUrl + + +class BannerDoc(BaseDoc): + image_url: ImageUrl + title: str + description: str +``` + +You can then instantiate a `BannerDoc` object and access its attributes. + +```python +banner = BannerDoc( + image_url='https://example.com/image.png', + title='Hello World', + description='This is a banner', +) + +assert banner.image_url == 'https://example.com/image.png' +assert banner.title == 'Hello World' +assert banner.description == 'This is a banner' +``` + +## `BaseDoc` is a Pydantic `BaseModel` + +The class [BaseDoc][docarray.base_doc.doc.BaseDoc] inherits from Pydantic [BaseModel](https://docs.pydantic.dev/usage/models). So you can use +all the features of `BaseModel` in your `Doc` class. + +This means that `BaseDoc`: + +* Will perform data validation: `BaseDoc` will check that the data you pass to it is valid. If not, it will raise an +error. Data being "valid" is actually defined by the type used in the type hint itself, but we will come back to this concept later. (TODO add typing section) +* Can be configured using a nested `Config` class, see Pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more detail on what kind of config pydantic offers. +* Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools that use Pydantic like [FastAPI]('https://fastapi.tiangolo.com/'). + +### What is the difference with Pydantic `BaseModel`? (INCOMPLETE) + +LINK TO THE VERSUS (not ready) + +[BaseDoc][docarray.base_doc.doc.BaseDoc] is not only a [BaseModel](https://docs.pydantic.dev/usage/models), + +* You can use it with DocArray [Typed](docarray.typing) that are oriented toward MultiModal (image, audio, ...) data and for +Machine Learning use case TODO link the type section. + +Another difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has an `id` field that is generated by default that is used to uniquely identify a Document. + +## `BaseDoc` allows representing multimodal and nested data + +Let's say you want to represent a YouTube video in your application, perhaps to build a search system for YouTube videos. +A YouTube video is not only composed of a video, but also has a title, description, thumbnail (and more, but let's keep it simple). + +All of these elements are from different `modalities` LINK TO MODALITIES SECTION (not ready): the title and description are text, the thumbnail is an image, and the video in itself is, well, a video. + +DocArray allows to represent all of this multimodal data in a single object. + +Let's first create an `BaseDoc` for each of the elements that compose the YouTube video. + +First for the thumbnail which is an image: + +```python +from docarray import BaseDoc +from docarray.typing import ImageUrl, ImageBytes + + +class ImageDoc(BaseDoc): + url: ImageUrl + bytes: ImageBytes = ( + None # bytes are not always loaded in memory, so we make it optional + ) +``` + +Then for the video itself: + +```python +from docarray import BaseDoc +from docarray.typing import VideoUrl, VideoBytes + + +class VideoDoc(BaseDoc): + url: VideoUrl + bytes: VideoBytes = ( + None # bytes are not always loaded in memory, so we make it optional + ) +``` + +Then for the title and description (which are text) we will just use a `str` type. + +All the elements that compose a YouTube video are ready: + +```python +from docarray import BaseDoc + + +class YouTubeVideoDoc(BaseDoc): + title: str + description: str + thumbnail: ImageDoc + video: VideoDoc +``` + +You now have `YouTubeVideoDoc` which is a pythonic representation of a YouTube video. + +This representation can now be used to send (LINK) or to store (LINK) data. You can even use it directly to [train a machine learning](../../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation. + +!!! note + + You see here that `ImageDoc` and `VideoDoc` are also [BaseDoc][docarray.base_doc.doc.BaseDoc], and they later used inside another [BaseDoc][docarray.base_doc.doc.BaseDoc]`. + This is what we call nested data representation. + + [BaseDoc][docarray.base_doc.doc.BaseDoc] can be nested to represent any kind of data hierarchy. + +See also: + +* [BaseDoc][docarray.base_doc.doc.BaseDoc] API Reference +* DOCUMENT_ARARY REF +* DOCUMENT INDEX REF +* DOCUMENT STORE REF +* ... diff --git a/docs/user_guide/sending/first_step.md b/docs/user_guide/sending/first_step.md new file mode 100644 index 00000000000..a18433535b9 --- /dev/null +++ b/docs/user_guide/sending/first_step.md @@ -0,0 +1 @@ +# Sending diff --git a/docs/user_guide/storing/first_step.md b/docs/user_guide/storing/first_step.md new file mode 100644 index 00000000000..5be8b39165b --- /dev/null +++ b/docs/user_guide/storing/first_step.md @@ -0,0 +1 @@ +# Storing diff --git a/mkdocs.yml b/mkdocs.yml index e7749bc2874..9e4209520ef 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -74,7 +74,9 @@ nav: - Home: README.md - Tutorial - User Guide: - user_guide/intro.md - - user_guide/first_step.md + - user_guide/representing/first_step.md + - user_guide/sending/first_step.md + - user_guide/storing/first_step.md - How-to: - how_to/add_doc_index.md diff --git a/poetry.lock b/poetry.lock index a9bc680af7f..12dd4370927 100644 --- a/poetry.lock +++ b/poetry.lock @@ -4590,7 +4590,6 @@ testing = ["flake8 (<5)", "func-timeout", "jaraco.functools", "jaraco.itertools" [extras] audio = ["pydub"] aws = ["smart-open"] -common = ["protobuf", "lz4"] elasticsearch = ["elasticsearch"] full = ["protobuf", "lz4", "pandas", "pillow", "types-pillow", "av", "pydub", "trimesh"] hnswlib = ["hnswlib"] @@ -4598,6 +4597,7 @@ image = ["pillow", "types-pillow"] jac = ["jina-hubble-sdk"] mesh = ["trimesh"] pandas = ["pandas"] +proto = ["protobuf", "lz4"] torch = ["torch"] video = ["av"] web = ["fastapi"] @@ -4605,4 +4605,4 @@ web = ["fastapi"] [metadata] lock-version = "2.0" python-versions = ">=3.7,<4.0" -content-hash = "821f6cd00f78c456f6146f39c14f0704e4f2d113c35db00c58462d8cfbe3a538" +content-hash = "dd56d7cfa5b6758063baba58a5259f06535e0f425366360d042836aa479eab15" diff --git a/pyproject.toml b/pyproject.toml index 3114ff8dc61..6982b351b47 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -29,7 +29,7 @@ smart-open = {version = ">=6.3.0", extras = ["s3"], optional = true} jina-hubble-sdk = {version = ">=0.34.0", optional = true} [tool.poetry.extras] -common = ["protobuf", "lz4"] +proto = ["protobuf", "lz4"] pandas = ["pandas"] image = ["pillow", "types-pillow"] video = ["av"]