From ce3720085643a6c14e430dd009cdfd4c1a17a1b1 Mon Sep 17 00:00:00 2001 From: samsja Date: Wed, 29 Mar 2023 15:03:16 +0200 Subject: [PATCH 01/11] docs: add user guide Signed-off-by: samsja --- docs/user_guide/first_step.md | 71 +++++++++++++++++++++++++++++++++++ docs/user_guide/intro.md | 43 +++++++++++++++++++++ 2 files changed, 114 insertions(+) diff --git a/docs/user_guide/first_step.md b/docs/user_guide/first_step.md index 0671e3a096a..f4850f11489 100644 --- a/docs/user_guide/first_step.md +++ b/docs/user_guide/first_step.md @@ -1 +1,72 @@ # First Step : BaseDoc + +At the heart of `DocArray` lies the concept of [`BaseDoc`][docarray.base_doc.doc.BaseDoc]. + +A [BaseDoc][docarray.base_doc.doc.BaseDoc] is very similar to [Pydantic](https://docs.pydantic.dev/) +[`BaseModel`](https://docs.pydantic.dev/usage/models). It allows to define custom `Document` schema (or `Model` in +the Pydantic world) to represent your data. + +## Basic `Doc` usage. + +Before going in detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's +take a look at how it looks like in practice. + +The following python code will define a `BannerDoc` class that will be used to represent banner data. + +```python +from docarray import BaseDoc +from docarray.typing import ImageUrl + + +class BannerDoc(BaseDoc): + image_url: ImageUrl + title: str + description: str +``` + +you can then instantiate a `BannerDoc` object and access its attributes. + +```python +banner = BannerDoc( + image_url="https://example.com/image.png", + title="Hello World", + description="This is a banner", +) + +assert banner.image_url == "https://example.com/image.png" +assert banner.title == "Hello World" +assert banner.description == "This is a banner" +``` + +## `BaseDoc` allows to represent MultiModal and nested Data. + +more complex example + + +## `BaseDoc` is a Pydantic `BaseModel` + +The class [BaseDoc][docarray.base_doc.doc.BaseDoc] inherits from pydantic [BaseModel](https://docs.pydantic.dev/usage/models) from Pydantic. So you can use +all the features of `BaseModel` in your `Doc` class. + +This namely means that `BaseDoc`: + +* Will perform data validation: `BaseDoc` will check that the data you pass to it is valid. If not, it will raise an + error. Data being "valid" is actually define by the type use in the docstring itself, but we will come back on this concept later (TODO add typing section) + +* Can be configured using a nested `Config` class, see pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more details on what kind of config Pydantic offer. + +* Can be used as a drop in replacement for `BaseModel` in your code and is compatible with tools using Pydantic like [FastAPI]('https://fastapi.tiangolo.com/'). + + +### What is the difference with Pydantic `BaseModel`? (INCOMPLETE) + +here maybe need the link to the versus section + +[BaseDoc][docarray.base_doc.doc.BaseDoc] is not only a [BaseModel](https://docs.pydantic.dev/usage/models), + +* it allows to be used with DocArray [Typed](docarray.typing) that are oriented toward MultiModal (image, audio, ...) data and for +Machine Learning use case TODO link the type section. + +Another tiny difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has a generated by default `id` field that is used to uniquely identify a document. + + diff --git a/docs/user_guide/intro.md b/docs/user_guide/intro.md index c500c92629f..cb51945c154 100644 --- a/docs/user_guide/intro.md +++ b/docs/user_guide/intro.md @@ -1 +1,44 @@ # User Guide - Intro + +This user guide show you how to use `DocArray` with most of its features, step by step. + +You wil first need to install `DocArray` in you python environment. +## Install DocArray + +To install `DocArray` to follow this user guide, you can use the following command: + +```console +pip install "docarray[full]" +``` + +This will install the main dependencies of `DocArray` and will work will all the modalities supported. + + +!!! note + To install a very light version of `DocArray` with only the core dependencies, you can use the following command: + ``` + pip install "docarray" + ``` + + If you want to install user protobuf with the minimal dependencies you can do + + ``` + pip install "docarray[common]" + ``` + +Depending on your usage you might want to only use `DocArray` with only a couple of specific modalities. +For instance lets say you only want to work with images, you can do install `DocArray` using the following command: + +``` +pip install "docarray[image]" +``` + +or with image and audio + + +``` +pip install "docarray[image, audio]" +``` + +!!! warning + This way of installing `DocArray` is only valid starting with version `0.30` \ No newline at end of file From 7da84f4f51f762f7dad79ccafa1b46900e063085 Mon Sep 17 00:00:00 2001 From: samsja Date: Wed, 29 Mar 2023 15:55:43 +0200 Subject: [PATCH 02/11] docs: add base docs docs Signed-off-by: samsja --- docs/user_guide/first_step.md | 89 +++++++++++++++++++++++++++++++++-- 1 file changed, 84 insertions(+), 5 deletions(-) diff --git a/docs/user_guide/first_step.md b/docs/user_guide/first_step.md index f4850f11489..bc66baff83b 100644 --- a/docs/user_guide/first_step.md +++ b/docs/user_guide/first_step.md @@ -9,7 +9,7 @@ the Pydantic world) to represent your data. ## Basic `Doc` usage. Before going in detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's -take a look at how it looks like in practice. +see how it looks like in practice. The following python code will define a `BannerDoc` class that will be used to represent banner data. @@ -38,9 +38,7 @@ assert banner.title == "Hello World" assert banner.description == "This is a banner" ``` -## `BaseDoc` allows to represent MultiModal and nested Data. -more complex example ## `BaseDoc` is a Pydantic `BaseModel` @@ -60,13 +58,94 @@ This namely means that `BaseDoc`: ### What is the difference with Pydantic `BaseModel`? (INCOMPLETE) -here maybe need the link to the versus section +LINK TO THE VERSUS (not ready) [BaseDoc][docarray.base_doc.doc.BaseDoc] is not only a [BaseModel](https://docs.pydantic.dev/usage/models), * it allows to be used with DocArray [Typed](docarray.typing) that are oriented toward MultiModal (image, audio, ...) data and for Machine Learning use case TODO link the type section. -Another tiny difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has a generated by default `id` field that is used to uniquely identify a document. +Another difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has a generated by default `id` field that is used to uniquely identify a document. + + + +## `BaseDoc` allows to represent MultiModal and nested Data. + +Let's say you want to represent a Youtube video in your application. Maybe to build a search system of Youtube video. +A Youtube video is not only composed of a video, but it also has a title, a description, a thumbnail (and more but let's keep it simple). + +All of these elements are from different `modalities` LINK TO MODALITIES SECTION (not ready), title and description are text, the thumbnail is an image, and the video in itself is, well, a video. + +DocArray allows to represent all of this Multi Modal data in a single object. + +Let's first create an `BaseDoc` for each of elements of that compose the Youtube video. + +First for the thumbnail which is an image +```python +from docarray import BaseDoc +from docarray.typing import ImageUrl, ImageBytes + + +class ImageDoc(BaseDoc): + url: ImageUrl + bytes: ImageBytes = ( + None # bytes are not always loaded in memory, so we make it optional + ) +``` + +Then for the video which is a video +```python +from docarray import BaseDoc +from docarray.typing import VideoUrl, VideoBytes + + +class ImageDoc(BaseDoc): + url: VideoUrl + bytes: VideoBytes = ( + None # bytes are not always loaded in memory, so we make it optional + ) +``` + + +Then for the title and description which are text we will just use a `str` type. + +All the elements that compose a Youtube video are ready: + +```python +from docarray import BaseDoc + + +class YoutubeVideoDoc(BaseDoc): + title: str + description: str + thumbnail: ImageDoc + video: VideoDoc +``` + + +You now hava `YoutubeVideoDoc` that is a pythonic representation of a Youtube video. + +This representation can now be used to send (LINK) or to store (LINK) data. You can even use it directly to [train a machine learning](../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation. + + +!!! note + + You see here that `ImageDoc` and `VideoDoc` are as well [BaseDoc][docarray.base_doc.doc.BaseDoc] that is later use inside another [BaseDoc][docarray.base_doc.doc.BaseDoc]`. + This is what we call nested data representation. + + [BaseDoc][docarray.base_doc.doc.BaseDoc] can be nested to represent any kind of data hierarchy. + + + + +See also: + +* [BaseDoc][docarray.base_doc.doc.BaseDoc] API Reference +* DOCUMENT_ARARY REF +* DOCUMENT INDEX REF +* DOCUMENT STORE REF +* ... + +See also \ No newline at end of file From 58679105c0bc7f26f3f21831542d0395bf305b3e Mon Sep 17 00:00:00 2001 From: samsja Date: Wed, 29 Mar 2023 16:05:55 +0200 Subject: [PATCH 03/11] docs: add base docs docs Signed-off-by: samsja --- docs/user_guide/intro.md | 10 +++++++++- docs/user_guide/{ => representing}/first_step.md | 4 ++-- docs/user_guide/sending/first_step.md | 1 + docs/user_guide/storing/first_step.md | 1 + mkdocs.yml | 4 +++- 5 files changed, 16 insertions(+), 4 deletions(-) rename docs/user_guide/{ => representing}/first_step.md (96%) create mode 100644 docs/user_guide/sending/first_step.md create mode 100644 docs/user_guide/storing/first_step.md diff --git a/docs/user_guide/intro.md b/docs/user_guide/intro.md index cb51945c154..edd59f9663a 100644 --- a/docs/user_guide/intro.md +++ b/docs/user_guide/intro.md @@ -1,6 +1,14 @@ # User Guide - Intro -This user guide show you how to use `DocArray` with most of its features, step by step. +This user guide show you how to use `DocArray` with most of its features. + +They are three main section: + +- [Representing Data](representing/first_step.md): This section will show you how to use `DocArray` to represent your data. +- [Sending Data](sending/first_step.md): This section will show you how to use `DocArray` to send your data. +- [Storing Data](storing/first_step.md): This section will show you how to use `DocArray` to store your data. + +You should first start by reading the [Representing Data](representing/first_step.md) section and both the [Sending Data](sending/first_step.md) and [Storing Data](storing/first_step.md) section can be read in any order. You wil first need to install `DocArray` in you python environment. ## Install DocArray diff --git a/docs/user_guide/first_step.md b/docs/user_guide/representing/first_step.md similarity index 96% rename from docs/user_guide/first_step.md rename to docs/user_guide/representing/first_step.md index bc66baff83b..8fa16f6a0fc 100644 --- a/docs/user_guide/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -1,4 +1,4 @@ -# First Step : BaseDoc +# Representing At the heart of `DocArray` lies the concept of [`BaseDoc`][docarray.base_doc.doc.BaseDoc]. @@ -125,7 +125,7 @@ class YoutubeVideoDoc(BaseDoc): You now hava `YoutubeVideoDoc` that is a pythonic representation of a Youtube video. -This representation can now be used to send (LINK) or to store (LINK) data. You can even use it directly to [train a machine learning](../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation. +This representation can now be used to send (LINK) or to store (LINK) data. You can even use it directly to [train a machine learning](../../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation. !!! note diff --git a/docs/user_guide/sending/first_step.md b/docs/user_guide/sending/first_step.md new file mode 100644 index 00000000000..a18433535b9 --- /dev/null +++ b/docs/user_guide/sending/first_step.md @@ -0,0 +1 @@ +# Sending diff --git a/docs/user_guide/storing/first_step.md b/docs/user_guide/storing/first_step.md new file mode 100644 index 00000000000..5be8b39165b --- /dev/null +++ b/docs/user_guide/storing/first_step.md @@ -0,0 +1 @@ +# Storing diff --git a/mkdocs.yml b/mkdocs.yml index e7749bc2874..9e4209520ef 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -74,7 +74,9 @@ nav: - Home: README.md - Tutorial - User Guide: - user_guide/intro.md - - user_guide/first_step.md + - user_guide/representing/first_step.md + - user_guide/sending/first_step.md + - user_guide/storing/first_step.md - How-to: - how_to/add_doc_index.md From 835ea7e87e59057541ae622f9a01b6c08a48f419 Mon Sep 17 00:00:00 2001 From: samsja Date: Thu, 30 Mar 2023 11:11:06 +0200 Subject: [PATCH 04/11] fix: apply grammarly on frist step Signed-off-by: samsja --- docs/user_guide/representing/first_step.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index 8fa16f6a0fc..2a93cec6032 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -49,11 +49,11 @@ all the features of `BaseModel` in your `Doc` class. This namely means that `BaseDoc`: * Will perform data validation: `BaseDoc` will check that the data you pass to it is valid. If not, it will raise an - error. Data being "valid" is actually define by the type use in the docstring itself, but we will come back on this concept later (TODO add typing section) + error. Data being "valid" is actually defined by the type used in the docstring itself, but we will come back to this concept later (TODO add typing section) * Can be configured using a nested `Config` class, see pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more details on what kind of config Pydantic offer. -* Can be used as a drop in replacement for `BaseModel` in your code and is compatible with tools using Pydantic like [FastAPI]('https://fastapi.tiangolo.com/'). +* Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools using Pydantic like [FastAPI]('https://fastapi.tiangolo.com/'). ### What is the difference with Pydantic `BaseModel`? (INCOMPLETE) @@ -71,14 +71,14 @@ Another difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has a genera ## `BaseDoc` allows to represent MultiModal and nested Data. -Let's say you want to represent a Youtube video in your application. Maybe to build a search system of Youtube video. +Let's say you want to represent a Youtube video in your application. Maybe to build a search system for Youtube video. A Youtube video is not only composed of a video, but it also has a title, a description, a thumbnail (and more but let's keep it simple). All of these elements are from different `modalities` LINK TO MODALITIES SECTION (not ready), title and description are text, the thumbnail is an image, and the video in itself is, well, a video. DocArray allows to represent all of this Multi Modal data in a single object. -Let's first create an `BaseDoc` for each of elements of that compose the Youtube video. +Let's first create an `BaseDoc` for each of the elements that compose the Youtube video. First for the thumbnail which is an image ```python @@ -123,14 +123,14 @@ class YoutubeVideoDoc(BaseDoc): ``` -You now hava `YoutubeVideoDoc` that is a pythonic representation of a Youtube video. +You now have `YoutubeVideoDoc` which is a pythonic representation of a Youtube video. This representation can now be used to send (LINK) or to store (LINK) data. You can even use it directly to [train a machine learning](../../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation. !!! note - You see here that `ImageDoc` and `VideoDoc` are as well [BaseDoc][docarray.base_doc.doc.BaseDoc] that is later use inside another [BaseDoc][docarray.base_doc.doc.BaseDoc]`. + You see here that `ImageDoc` and `VideoDoc` are as well [BaseDoc][docarray.base_doc.doc.BaseDoc] that is later used inside another [BaseDoc][docarray.base_doc.doc.BaseDoc]`. This is what we call nested data representation. [BaseDoc][docarray.base_doc.doc.BaseDoc] can be nested to represent any kind of data hierarchy. From 4dd920748099e26530f567cdab3a7fcf6de779b6 Mon Sep 17 00:00:00 2001 From: samsja Date: Thu, 30 Mar 2023 11:12:14 +0200 Subject: [PATCH 05/11] fix: apply grammarly on isntall Signed-off-by: samsja --- docs/user_guide/intro.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/user_guide/intro.md b/docs/user_guide/intro.md index edd59f9663a..146902dbcc8 100644 --- a/docs/user_guide/intro.md +++ b/docs/user_guide/intro.md @@ -1,14 +1,14 @@ # User Guide - Intro -This user guide show you how to use `DocArray` with most of its features. +This user guide shows you how to use `DocArray` with most of its features. -They are three main section: +They are three main sections: - [Representing Data](representing/first_step.md): This section will show you how to use `DocArray` to represent your data. - [Sending Data](sending/first_step.md): This section will show you how to use `DocArray` to send your data. - [Storing Data](storing/first_step.md): This section will show you how to use `DocArray` to store your data. -You should first start by reading the [Representing Data](representing/first_step.md) section and both the [Sending Data](sending/first_step.md) and [Storing Data](storing/first_step.md) section can be read in any order. +You should first start by reading the [Representing Data](representing/first_step.md) section and both the [Sending Data](sending/first_step.md) and [Storing Data](storing/first_step.md) sections can be read in any order. You wil first need to install `DocArray` in you python environment. ## Install DocArray @@ -28,14 +28,14 @@ This will install the main dependencies of `DocArray` and will work will all the pip install "docarray" ``` - If you want to install user protobuf with the minimal dependencies you can do + If you want to install user protobuf with minimal dependencies you can do ``` pip install "docarray[common]" ``` Depending on your usage you might want to only use `DocArray` with only a couple of specific modalities. -For instance lets say you only want to work with images, you can do install `DocArray` using the following command: +For instance let's say you only want to work with images, you can install `DocArray` using the following command: ``` pip install "docarray[image]" From bfe447018d2985edea30b76481bc4bd1915eb7d8 Mon Sep 17 00:00:00 2001 From: samsja Date: Fri, 31 Mar 2023 11:19:13 +0200 Subject: [PATCH 06/11] fix: apply johannes suggestion Signed-off-by: samsja --- README.md | 4 ++-- docs/user_guide/intro.md | 4 ++-- pyproject.toml | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 52826b5cd8b..8d4b45ae264 100644 --- a/README.md +++ b/README.md @@ -482,13 +482,13 @@ INFO - docarray - HnswDocumentIndex[SimpleDoc] has been initialized To try out the alpha you can install it via git: ```shell -pip install "git+https://github.com/docarray/docarray@2023.01.18.alpha#egg=docarray[common,torch,image]" +pip install "git+https://github.com/docarray/docarray@2023.01.18.alpha#egg=docarray[proto,torch,image]" ``` ...or from the latest development branch ```shell -pip install "git+https://github.com/docarray/docarray@feat-rewrite-v2#egg=docarray[common,torch,image]" +pip install "git+https://github.com/docarray/docarray@feat-rewrite-v2#egg=docarray[proto,torch,image]" ``` ## See also diff --git a/docs/user_guide/intro.md b/docs/user_guide/intro.md index 146902dbcc8..bf3e14c1cba 100644 --- a/docs/user_guide/intro.md +++ b/docs/user_guide/intro.md @@ -28,10 +28,10 @@ This will install the main dependencies of `DocArray` and will work will all the pip install "docarray" ``` - If you want to install user protobuf with minimal dependencies you can do + If you want to use protobuf and DocArray you can do ``` - pip install "docarray[common]" + pip install "docarray[proto]" ``` Depending on your usage you might want to only use `DocArray` with only a couple of specific modalities. diff --git a/pyproject.toml b/pyproject.toml index 3114ff8dc61..6982b351b47 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -29,7 +29,7 @@ smart-open = {version = ">=6.3.0", extras = ["s3"], optional = true} jina-hubble-sdk = {version = ">=0.34.0", optional = true} [tool.poetry.extras] -common = ["protobuf", "lz4"] +proto = ["protobuf", "lz4"] pandas = ["pandas"] image = ["pillow", "types-pillow"] video = ["av"] From 2a3702149bb099f520e8a5d5d43b333e246bbf9d Mon Sep 17 00:00:00 2001 From: samsja Date: Fri, 31 Mar 2023 11:25:56 +0200 Subject: [PATCH 07/11] fix: poetry lock Signed-off-by: samsja --- poetry.lock | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/poetry.lock b/poetry.lock index a9bc680af7f..12dd4370927 100644 --- a/poetry.lock +++ b/poetry.lock @@ -4590,7 +4590,6 @@ testing = ["flake8 (<5)", "func-timeout", "jaraco.functools", "jaraco.itertools" [extras] audio = ["pydub"] aws = ["smart-open"] -common = ["protobuf", "lz4"] elasticsearch = ["elasticsearch"] full = ["protobuf", "lz4", "pandas", "pillow", "types-pillow", "av", "pydub", "trimesh"] hnswlib = ["hnswlib"] @@ -4598,6 +4597,7 @@ image = ["pillow", "types-pillow"] jac = ["jina-hubble-sdk"] mesh = ["trimesh"] pandas = ["pandas"] +proto = ["protobuf", "lz4"] torch = ["torch"] video = ["av"] web = ["fastapi"] @@ -4605,4 +4605,4 @@ web = ["fastapi"] [metadata] lock-version = "2.0" python-versions = ">=3.7,<4.0" -content-hash = "821f6cd00f78c456f6146f39c14f0704e4f2d113c35db00c58462d8cfbe3a538" +content-hash = "dd56d7cfa5b6758063baba58a5259f06535e0f425366360d042836aa479eab15" From b27e09a4c26d53ae33de9a4efa92aeab20327b89 Mon Sep 17 00:00:00 2001 From: samsja <55492238+samsja@users.noreply.github.com> Date: Fri, 31 Mar 2023 11:50:53 +0200 Subject: [PATCH 08/11] feat: apply johannes suggestion Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: samsja <55492238+samsja@users.noreply.github.com> --- docs/user_guide/intro.md | 16 +++++++-------- docs/user_guide/representing/first_step.md | 24 +++++++++++----------- 2 files changed, 20 insertions(+), 20 deletions(-) diff --git a/docs/user_guide/intro.md b/docs/user_guide/intro.md index bf3e14c1cba..9597bcd686d 100644 --- a/docs/user_guide/intro.md +++ b/docs/user_guide/intro.md @@ -2,15 +2,15 @@ This user guide shows you how to use `DocArray` with most of its features. -They are three main sections: +There are three main sections: -- [Representing Data](representing/first_step.md): This section will show you how to use `DocArray` to represent your data. -- [Sending Data](sending/first_step.md): This section will show you how to use `DocArray` to send your data. -- [Storing Data](storing/first_step.md): This section will show you how to use `DocArray` to store your data. +- [Representing Data](representing/first_step.md): This section will show you how to use `DocArray` to represent your data. This is a great starting point if you want to better organize the data in your ML models, or if you are looking for a "pydantic for ML". +- [Sending Data](sending/first_step.md): This section will show you how to use `DocArray` to send your data. This is a great starting point if you want to serve your ML model, for example through FastAPI. +- [Storing Data](storing/first_step.md): This section will show you how to use `DocArray` to store your data. This is a great starting point if you are looking for an "ORM for vector databases". You should first start by reading the [Representing Data](representing/first_step.md) section and both the [Sending Data](sending/first_step.md) and [Storing Data](storing/first_step.md) sections can be read in any order. -You wil first need to install `DocArray` in you python environment. +You will first need to install `DocArray` in your Python environment. ## Install DocArray To install `DocArray` to follow this user guide, you can use the following command: @@ -19,7 +19,7 @@ To install `DocArray` to follow this user guide, you can use the following comma pip install "docarray[full]" ``` -This will install the main dependencies of `DocArray` and will work will all the modalities supported. +This will install the main dependencies of `DocArray` and will work will all the supported data modalities. !!! note @@ -34,8 +34,8 @@ This will install the main dependencies of `DocArray` and will work will all the pip install "docarray[proto]" ``` -Depending on your usage you might want to only use `DocArray` with only a couple of specific modalities. -For instance let's say you only want to work with images, you can install `DocArray` using the following command: +Depending on your usage you might want to use `DocArray` with only a couple of specific modalities and their dependencies. +For instance, let's say you only want to work with images, you can install `DocArray` using the following command: ``` pip install "docarray[image]" diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index 2a93cec6032..c65e66fa976 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -2,16 +2,16 @@ At the heart of `DocArray` lies the concept of [`BaseDoc`][docarray.base_doc.doc.BaseDoc]. -A [BaseDoc][docarray.base_doc.doc.BaseDoc] is very similar to [Pydantic](https://docs.pydantic.dev/) -[`BaseModel`](https://docs.pydantic.dev/usage/models). It allows to define custom `Document` schema (or `Model` in +A [BaseDoc][docarray.base_doc.doc.BaseDoc] is very similar to a [Pydantic](https://docs.pydantic.dev/) +[`BaseModel`](https://docs.pydantic.dev/usage/models) - in fact it _is_ a specialized Pydantic `BaseModel`. It allows you to define custom `Document` schemas (or `Model` in the Pydantic world) to represent your data. ## Basic `Doc` usage. Before going in detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's -see how it looks like in practice. +see what it looks like in practice. -The following python code will define a `BannerDoc` class that will be used to represent banner data. +The following Python code defines a `BannerDoc` class that can be used to represent the data of a website banner. ```python from docarray import BaseDoc @@ -24,7 +24,7 @@ class BannerDoc(BaseDoc): description: str ``` -you can then instantiate a `BannerDoc` object and access its attributes. +You can then instantiate a `BannerDoc` object and access its attributes. ```python banner = BannerDoc( @@ -43,13 +43,13 @@ assert banner.description == "This is a banner" ## `BaseDoc` is a Pydantic `BaseModel` -The class [BaseDoc][docarray.base_doc.doc.BaseDoc] inherits from pydantic [BaseModel](https://docs.pydantic.dev/usage/models) from Pydantic. So you can use +The class [BaseDoc][docarray.base_doc.doc.BaseDoc] inherits from pydantic [BaseModel](https://docs.pydantic.dev/usage/models). So you can use all the features of `BaseModel` in your `Doc` class. This namely means that `BaseDoc`: * Will perform data validation: `BaseDoc` will check that the data you pass to it is valid. If not, it will raise an - error. Data being "valid" is actually defined by the type used in the docstring itself, but we will come back to this concept later (TODO add typing section) + error. Data being "valid" is actually defined by the type used in the type hint itself, but we will come back to this concept later (TODO add typing section) * Can be configured using a nested `Config` class, see pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more details on what kind of config Pydantic offer. @@ -69,14 +69,14 @@ Another difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has a genera -## `BaseDoc` allows to represent MultiModal and nested Data. +## `BaseDoc` allows to represent multimodal and nested data. -Let's say you want to represent a Youtube video in your application. Maybe to build a search system for Youtube video. -A Youtube video is not only composed of a video, but it also has a title, a description, a thumbnail (and more but let's keep it simple). +Let's say you want to represent a Youtube video in your application, perhaps to build a search system for Youtube videos. +A Youtube video is not only composed of a video, but it also has a title, a description, a thumbnail (and more, but let's keep it simple). -All of these elements are from different `modalities` LINK TO MODALITIES SECTION (not ready), title and description are text, the thumbnail is an image, and the video in itself is, well, a video. +All of these elements are from different `modalities` LINK TO MODALITIES SECTION (not ready): title and description are text, the thumbnail is an image, and the video in itself is, well, a video. -DocArray allows to represent all of this Multi Modal data in a single object. +DocArray allows to represent all of this multimodal data in a single object. Let's first create an `BaseDoc` for each of the elements that compose the Youtube video. From 1a66cff7007689423ad9a32e5a6bbd57def168ea Mon Sep 17 00:00:00 2001 From: samsja Date: Fri, 31 Mar 2023 12:27:06 +0200 Subject: [PATCH 09/11] fix: fix name Signed-off-by: samsja --- docs/user_guide/representing/first_step.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index c65e66fa976..8de8ac00051 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -99,7 +99,7 @@ from docarray import BaseDoc from docarray.typing import VideoUrl, VideoBytes -class ImageDoc(BaseDoc): +class VideoDoc(BaseDoc): url: VideoUrl bytes: VideoBytes = ( None # bytes are not always loaded in memory, so we make it optional From 09a75e6752d5441226dea96684990baa1b8d407b Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Fri, 31 Mar 2023 12:31:08 +0200 Subject: [PATCH 10/11] docs: tidy up wording Signed-off-by: Alex C-G --- docs/user_guide/intro.md | 13 ++-- docs/user_guide/representing/first_step.md | 76 +++++++++------------- 2 files changed, 36 insertions(+), 53 deletions(-) diff --git a/docs/user_guide/intro.md b/docs/user_guide/intro.md index 9597bcd686d..084805bddd2 100644 --- a/docs/user_guide/intro.md +++ b/docs/user_guide/intro.md @@ -1,4 +1,4 @@ -# User Guide - Intro +# User Guide - Introduction This user guide shows you how to use `DocArray` with most of its features. @@ -8,9 +8,10 @@ There are three main sections: - [Sending Data](sending/first_step.md): This section will show you how to use `DocArray` to send your data. This is a great starting point if you want to serve your ML model, for example through FastAPI. - [Storing Data](storing/first_step.md): This section will show you how to use `DocArray` to store your data. This is a great starting point if you are looking for an "ORM for vector databases". -You should first start by reading the [Representing Data](representing/first_step.md) section and both the [Sending Data](sending/first_step.md) and [Storing Data](storing/first_step.md) sections can be read in any order. +You should start by reading the [Representing Data](representing/first_step.md) section, and then the [Sending Data](sending/first_step.md) and [Storing Data](storing/first_step.md) sections can be read in any order. You will first need to install `DocArray` in your Python environment. + ## Install DocArray To install `DocArray` to follow this user guide, you can use the following command: @@ -21,14 +22,13 @@ pip install "docarray[full]" This will install the main dependencies of `DocArray` and will work will all the supported data modalities. - !!! note To install a very light version of `DocArray` with only the core dependencies, you can use the following command: ``` pip install "docarray" ``` - If you want to use protobuf and DocArray you can do + If you want to use protobuf and DocArray you can run: ``` pip install "docarray[proto]" @@ -41,12 +41,11 @@ For instance, let's say you only want to work with images, you can install `DocA pip install "docarray[image]" ``` -or with image and audio - +...or with images and audio: ``` pip install "docarray[image, audio]" ``` !!! warning - This way of installing `DocArray` is only valid starting with version `0.30` \ No newline at end of file + This way of installing `DocArray` is only valid starting with version `0.30` diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index 8de8ac00051..c20b0dc553f 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -3,12 +3,12 @@ At the heart of `DocArray` lies the concept of [`BaseDoc`][docarray.base_doc.doc.BaseDoc]. A [BaseDoc][docarray.base_doc.doc.BaseDoc] is very similar to a [Pydantic](https://docs.pydantic.dev/) -[`BaseModel`](https://docs.pydantic.dev/usage/models) - in fact it _is_ a specialized Pydantic `BaseModel`. It allows you to define custom `Document` schemas (or `Model` in +[`BaseModel`](https://docs.Pydantic.dev/usage/models) - in fact it _is_ a specialized Pydantic `BaseModel`. It allows you to define custom `Document` schemas (or `Model` in the Pydantic world) to represent your data. ## Basic `Doc` usage. -Before going in detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's +Before going into detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's see what it looks like in practice. The following Python code defines a `BannerDoc` class that can be used to represent the data of a website banner. @@ -28,33 +28,27 @@ You can then instantiate a `BannerDoc` object and access its attributes. ```python banner = BannerDoc( - image_url="https://example.com/image.png", - title="Hello World", - description="This is a banner", + image_url='https://example.com/image.png', + title='Hello World', + description='This is a banner', ) -assert banner.image_url == "https://example.com/image.png" -assert banner.title == "Hello World" -assert banner.description == "This is a banner" +assert banner.image_url == 'https://example.com/image.png' +assert banner.title == 'Hello World' +assert banner.description == 'This is a banner' ``` - - - ## `BaseDoc` is a Pydantic `BaseModel` -The class [BaseDoc][docarray.base_doc.doc.BaseDoc] inherits from pydantic [BaseModel](https://docs.pydantic.dev/usage/models). So you can use +The class [BaseDoc][docarray.base_doc.doc.BaseDoc] inherits from Pydantic [BaseModel](https://docs.pydantic.dev/usage/models). So you can use all the features of `BaseModel` in your `Doc` class. -This namely means that `BaseDoc`: - -* Will perform data validation: `BaseDoc` will check that the data you pass to it is valid. If not, it will raise an - error. Data being "valid" is actually defined by the type used in the type hint itself, but we will come back to this concept later (TODO add typing section) - -* Can be configured using a nested `Config` class, see pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more details on what kind of config Pydantic offer. - -* Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools using Pydantic like [FastAPI]('https://fastapi.tiangolo.com/'). +This means that `BaseDoc`: +* Will perform data validation: `BaseDoc` will check that the data you pass to it is valid. If not, it will raise an +error. Data being "valid" is actually defined by the type used in the type hint itself, but we will come back to this concept later. (TODO add typing section) +* Can be configured using a nested `Config` class, see Pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more detail on what kind of config pydantic offers. +* Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools that use Pydantic like [FastAPI]('https://fastapi.tiangolo.com/'). ### What is the difference with Pydantic `BaseModel`? (INCOMPLETE) @@ -62,25 +56,24 @@ LINK TO THE VERSUS (not ready) [BaseDoc][docarray.base_doc.doc.BaseDoc] is not only a [BaseModel](https://docs.pydantic.dev/usage/models), -* it allows to be used with DocArray [Typed](docarray.typing) that are oriented toward MultiModal (image, audio, ...) data and for +* You can use it with DocArray [Typed](docarray.typing) that are oriented toward MultiModal (image, audio, ...) data and for Machine Learning use case TODO link the type section. -Another difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has a generated by default `id` field that is used to uniquely identify a document. - +Another difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has an `id` field that is generated by default that is used to uniquely identify a Document. +## `BaseDoc` allows representing multimodal and nested data -## `BaseDoc` allows to represent multimodal and nested data. +Let's say you want to represent a YouTube video in your application, perhaps to build a search system for YouTube videos. +A YouTube video is not only composed of a video, but also has a title, description, thumbnail (and more, but let's keep it simple). -Let's say you want to represent a Youtube video in your application, perhaps to build a search system for Youtube videos. -A Youtube video is not only composed of a video, but it also has a title, a description, a thumbnail (and more, but let's keep it simple). - -All of these elements are from different `modalities` LINK TO MODALITIES SECTION (not ready): title and description are text, the thumbnail is an image, and the video in itself is, well, a video. +All of these elements are from different `modalities` LINK TO MODALITIES SECTION (not ready): the title and description are text, the thumbnail is an image, and the video in itself is, well, a video. DocArray allows to represent all of this multimodal data in a single object. -Let's first create an `BaseDoc` for each of the elements that compose the Youtube video. +Let's first create an `BaseDoc` for each of the elements that compose the YouTube video. + +First for the thumbnail which is an image: -First for the thumbnail which is an image ```python from docarray import BaseDoc from docarray.typing import ImageUrl, ImageBytes @@ -93,7 +86,8 @@ class ImageDoc(BaseDoc): ) ``` -Then for the video which is a video +Then for the video itself: + ```python from docarray import BaseDoc from docarray.typing import VideoUrl, VideoBytes @@ -106,37 +100,31 @@ class VideoDoc(BaseDoc): ) ``` +Then for the title and description (which are text) we will just use a `str` type. -Then for the title and description which are text we will just use a `str` type. - -All the elements that compose a Youtube video are ready: +All the elements that compose a YouTube video are ready: ```python from docarray import BaseDoc -class YoutubeVideoDoc(BaseDoc): +class YouTubeVideoDoc(BaseDoc): title: str description: str thumbnail: ImageDoc video: VideoDoc ``` - -You now have `YoutubeVideoDoc` which is a pythonic representation of a Youtube video. +You now have `YouTubeVideoDoc` which is a pythonic representation of a YouTube video. This representation can now be used to send (LINK) or to store (LINK) data. You can even use it directly to [train a machine learning](../../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation. - !!! note - You see here that `ImageDoc` and `VideoDoc` are as well [BaseDoc][docarray.base_doc.doc.BaseDoc] that is later used inside another [BaseDoc][docarray.base_doc.doc.BaseDoc]`. + You see here that `ImageDoc` and `VideoDoc` are also [BaseDoc][docarray.base_doc.doc.BaseDoc], and they later used inside another [BaseDoc][docarray.base_doc.doc.BaseDoc]`. This is what we call nested data representation. [BaseDoc][docarray.base_doc.doc.BaseDoc] can be nested to represent any kind of data hierarchy. - - - See also: @@ -145,7 +133,3 @@ See also: * DOCUMENT INDEX REF * DOCUMENT STORE REF * ... - - - -See also \ No newline at end of file From 75624d0b77b32d530abd180c66de6f066dcb2d3a Mon Sep 17 00:00:00 2001 From: samsja <55492238+samsja@users.noreply.github.com> Date: Mon, 3 Apr 2023 08:59:02 +0200 Subject: [PATCH 11/11] feat: apply saba suggestion Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com> Signed-off-by: samsja <55492238+samsja@users.noreply.github.com> --- docs/user_guide/intro.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user_guide/intro.md b/docs/user_guide/intro.md index 084805bddd2..5c9fbb14d1f 100644 --- a/docs/user_guide/intro.md +++ b/docs/user_guide/intro.md @@ -14,13 +14,13 @@ You will first need to install `DocArray` in your Python environment. ## Install DocArray -To install `DocArray` to follow this user guide, you can use the following command: +To install `DocArray`, you can use the following command: ```console pip install "docarray[full]" ``` -This will install the main dependencies of `DocArray` and will work will all the supported data modalities. +This will install the main dependencies of `DocArray` and will work with all the supported data modalities. !!! note To install a very light version of `DocArray` with only the core dependencies, you can use the following command: @@ -28,7 +28,7 @@ This will install the main dependencies of `DocArray` and will work will all the pip install "docarray" ``` - If you want to use protobuf and DocArray you can run: + If you want to use `protobuf` and `DocArray`, you can run: ``` pip install "docarray[proto]"