-
Notifications
You must be signed in to change notification settings - Fork 238
docs: add user guide #1292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add user guide #1292
Changes from all commits
ce37200
7da84f4
5867910
835ea7e
4dd9207
3d5b660
bfe4470
2a37021
b27e09a
1a66cff
09a75e6
75624d0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -482,13 +482,13 @@ INFO - docarray - HnswDocumentIndex[SimpleDoc] has been initialized | |
| To try out the alpha you can install it via git: | ||
|
|
||
| ```shell | ||
| pip install "git+https://github.com/docarray/[email protected]#egg=docarray[common,torch,image]" | ||
| pip install "git+https://github.com/docarray/[email protected]#egg=docarray[proto,torch,image]" | ||
| ``` | ||
|
|
||
| ...or from the latest development branch | ||
|
|
||
| ```shell | ||
| pip install "git+https://github.com/docarray/docarray@feat-rewrite-v2#egg=docarray[common,torch,image]" | ||
| pip install "git+https://github.com/docarray/docarray@feat-rewrite-v2#egg=docarray[proto,torch,image]" | ||
| ``` | ||
|
|
||
| ## See also | ||
|
|
||
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1,51 @@ | ||
| # User Guide - Intro | ||
| # User Guide - Introduction | ||
|
|
||
| This user guide shows you how to use `DocArray` with most of its features. | ||
|
|
||
| There are three main sections: | ||
|
|
||
| - [Representing Data](representing/first_step.md): This section will show you how to use `DocArray` to represent your data. This is a great starting point if you want to better organize the data in your ML models, or if you are looking for a "pydantic for ML". | ||
| - [Sending Data](sending/first_step.md): This section will show you how to use `DocArray` to send your data. This is a great starting point if you want to serve your ML model, for example through FastAPI. | ||
| - [Storing Data](storing/first_step.md): This section will show you how to use `DocArray` to store your data. This is a great starting point if you are looking for an "ORM for vector databases". | ||
|
|
||
| You should start by reading the [Representing Data](representing/first_step.md) section, and then the [Sending Data](sending/first_step.md) and [Storing Data](storing/first_step.md) sections can be read in any order. | ||
|
|
||
| You will first need to install `DocArray` in your Python environment. | ||
|
|
||
| ## Install DocArray | ||
|
|
||
| To install `DocArray`, you can use the following command: | ||
|
|
||
| ```console | ||
| pip install "docarray[full]" | ||
| ``` | ||
|
|
||
| This will install the main dependencies of `DocArray` and will work with all the supported data modalities. | ||
|
|
||
| !!! note | ||
| To install a very light version of `DocArray` with only the core dependencies, you can use the following command: | ||
| ``` | ||
| pip install "docarray" | ||
| ``` | ||
|
|
||
| If you want to use `protobuf` and `DocArray`, you can run: | ||
|
|
||
| ``` | ||
| pip install "docarray[proto]" | ||
| ``` | ||
|
|
||
| Depending on your usage you might want to use `DocArray` with only a couple of specific modalities and their dependencies. | ||
| For instance, let's say you only want to work with images, you can install `DocArray` using the following command: | ||
|
|
||
| ``` | ||
| pip install "docarray[image]" | ||
| ``` | ||
|
|
||
| ...or with images and audio: | ||
|
|
||
| ``` | ||
| pip install "docarray[image, audio]" | ||
| ``` | ||
|
|
||
| !!! warning | ||
| This way of installing `DocArray` is only valid starting with version `0.30` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,135 @@ | ||
| # Representing | ||
|
|
||
| At the heart of `DocArray` lies the concept of [`BaseDoc`][docarray.base_doc.doc.BaseDoc]. | ||
|
|
||
| A [BaseDoc][docarray.base_doc.doc.BaseDoc] is very similar to a [Pydantic](https://docs.pydantic.dev/) | ||
| [`BaseModel`](https://docs.Pydantic.dev/usage/models) - in fact it _is_ a specialized Pydantic `BaseModel`. It allows you to define custom `Document` schemas (or `Model` in | ||
| the Pydantic world) to represent your data. | ||
|
|
||
| ## Basic `Doc` usage. | ||
|
|
||
| Before going into detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's | ||
| see what it looks like in practice. | ||
|
|
||
| The following Python code defines a `BannerDoc` class that can be used to represent the data of a website banner. | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc | ||
| from docarray.typing import ImageUrl | ||
|
|
||
|
|
||
| class BannerDoc(BaseDoc): | ||
| image_url: ImageUrl | ||
| title: str | ||
| description: str | ||
| ``` | ||
|
|
||
| You can then instantiate a `BannerDoc` object and access its attributes. | ||
|
|
||
| ```python | ||
| banner = BannerDoc( | ||
| image_url='https://example.com/image.png', | ||
| title='Hello World', | ||
| description='This is a banner', | ||
| ) | ||
|
|
||
| assert banner.image_url == 'https://example.com/image.png' | ||
| assert banner.title == 'Hello World' | ||
| assert banner.description == 'This is a banner' | ||
| ``` | ||
|
|
||
| ## `BaseDoc` is a Pydantic `BaseModel` | ||
|
|
||
| The class [BaseDoc][docarray.base_doc.doc.BaseDoc] inherits from Pydantic [BaseModel](https://docs.pydantic.dev/usage/models). So you can use | ||
| all the features of `BaseModel` in your `Doc` class. | ||
|
|
||
| This means that `BaseDoc`: | ||
|
|
||
| * Will perform data validation: `BaseDoc` will check that the data you pass to it is valid. If not, it will raise an | ||
| error. Data being "valid" is actually defined by the type used in the type hint itself, but we will come back to this concept later. (TODO add typing section) | ||
| * Can be configured using a nested `Config` class, see Pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more detail on what kind of config pydantic offers. | ||
| * Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools that use Pydantic like [FastAPI]('https://fastapi.tiangolo.com/'). | ||
|
|
||
| ### What is the difference with Pydantic `BaseModel`? (INCOMPLETE) | ||
|
|
||
| LINK TO THE VERSUS (not ready) | ||
|
|
||
| [BaseDoc][docarray.base_doc.doc.BaseDoc] is not only a [BaseModel](https://docs.pydantic.dev/usage/models), | ||
|
|
||
| * You can use it with DocArray [Typed](docarray.typing) that are oriented toward MultiModal (image, audio, ...) data and for | ||
| Machine Learning use case TODO link the type section. | ||
|
|
||
| Another difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has an `id` field that is generated by default that is used to uniquely identify a Document. | ||
|
|
||
| ## `BaseDoc` allows representing multimodal and nested data | ||
|
|
||
| Let's say you want to represent a YouTube video in your application, perhaps to build a search system for YouTube videos. | ||
| A YouTube video is not only composed of a video, but also has a title, description, thumbnail (and more, but let's keep it simple). | ||
|
|
||
| All of these elements are from different `modalities` LINK TO MODALITIES SECTION (not ready): the title and description are text, the thumbnail is an image, and the video in itself is, well, a video. | ||
|
|
||
| DocArray allows to represent all of this multimodal data in a single object. | ||
|
|
||
| Let's first create an `BaseDoc` for each of the elements that compose the YouTube video. | ||
|
|
||
| First for the thumbnail which is an image: | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc | ||
| from docarray.typing import ImageUrl, ImageBytes | ||
|
|
||
|
|
||
| class ImageDoc(BaseDoc): | ||
| url: ImageUrl | ||
| bytes: ImageBytes = ( | ||
samsja marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| None # bytes are not always loaded in memory, so we make it optional | ||
| ) | ||
| ``` | ||
|
|
||
| Then for the video itself: | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc | ||
| from docarray.typing import VideoUrl, VideoBytes | ||
|
|
||
|
|
||
| class VideoDoc(BaseDoc): | ||
| url: VideoUrl | ||
| bytes: VideoBytes = ( | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No if you put
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you think it is more explicit to have optional ? |
||
| None # bytes are not always loaded in memory, so we make it optional | ||
| ) | ||
| ``` | ||
|
|
||
| Then for the title and description (which are text) we will just use a `str` type. | ||
|
|
||
| All the elements that compose a YouTube video are ready: | ||
|
|
||
| ```python | ||
| from docarray import BaseDoc | ||
|
|
||
|
|
||
| class YouTubeVideoDoc(BaseDoc): | ||
| title: str | ||
| description: str | ||
| thumbnail: ImageDoc | ||
| video: VideoDoc | ||
| ``` | ||
|
|
||
| You now have `YouTubeVideoDoc` which is a pythonic representation of a YouTube video. | ||
|
|
||
| This representation can now be used to send (LINK) or to store (LINK) data. You can even use it directly to [train a machine learning](../../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation. | ||
|
|
||
| !!! note | ||
|
|
||
| You see here that `ImageDoc` and `VideoDoc` are also [BaseDoc][docarray.base_doc.doc.BaseDoc], and they later used inside another [BaseDoc][docarray.base_doc.doc.BaseDoc]`. | ||
| This is what we call nested data representation. | ||
|
|
||
| [BaseDoc][docarray.base_doc.doc.BaseDoc] can be nested to represent any kind of data hierarchy. | ||
|
|
||
| See also: | ||
|
|
||
| * [BaseDoc][docarray.base_doc.doc.BaseDoc] API Reference | ||
| * DOCUMENT_ARARY REF | ||
| * DOCUMENT INDEX REF | ||
| * DOCUMENT STORE REF | ||
| * ... | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| # Sending |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| # Storing |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note to reference FastAPI integration page here that I'm currently working on for whoever merges the second