From 8f1d49d1fac4b73add4cf7b535e69a586644af8e Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Mon, 17 Apr 2023 14:28:27 +0200 Subject: [PATCH 01/10] docs(readme): fix v1 notice Signed-off-by: Alex C-G --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 6b6e8bc488d..d31579acbc5 100644 --- a/README.md +++ b/README.md @@ -12,8 +12,8 @@

-> ⬆️ **DocArray v2**: This readme refer to the second version of DocArray (starting at 0.30). If you want to use the old -> DocArray v1 version (below 0.30) check out the [docarray-v1-fixe](https://github.com/docarray/docarray/tree/docarray-v1-fixes) branch +> ⬆️ **DocArray v2**: This readme is for the second version of DocArray (starting at 0.30). If you want to use the older +> DocArray version (prior to 0.30) check out the [docarray-v1-fixes](https://github.com/docarray/docarray/tree/docarray-v1-fixes) branch DocArray is a library for **representing, sending and storing multi-modal data**, perfect for **Machine Learning applications**. @@ -804,4 +804,4 @@ pip install "git+https://github.com/docarray/docarray" - ["Legacy" DocArray github page](https://github.com/docarray/docarray/tree/docarray-v1-fixes) - ["Legacy" DocArray documentation](https://docarray.jina.ai/) -> DocArray is a trademark of LF AI Projects, LLC \ No newline at end of file +> DocArray is a trademark of LF AI Projects, LLC From b9f8ab43c8e38162777c4e3c5cad12fc8424c2fe Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Mon, 17 Apr 2023 14:28:37 +0200 Subject: [PATCH 02/10] docs(represent): fix header punctuation Signed-off-by: Alex C-G --- docs/user_guide/representing/first_step.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index c1b41b623c6..700b6cb5686 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -10,7 +10,7 @@ the Pydantic world) to represent your data. Naming convention: When we refer to a `BaseDoc`, we refer to a class that inherits from [BaseDoc][docarray.base_doc.doc.BaseDoc]. When we refer to a `Document` we refer to an instance of a `BaseDoc` class. -## Basic `Doc` usage. +## Basic `Doc` usage Before going into detail about what we can do with [BaseDoc][docarray.base_doc.doc.BaseDoc] and how to use it, let's see what it looks like in practice. From fbd400ef7237c6a8bd470a8b39f902c749e6cd77 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Mon, 17 Apr 2023 17:03:46 +0200 Subject: [PATCH 03/10] docs: first round of final fixes Signed-off-by: Alex C-G --- docs/data_types/3d_mesh/3d_mesh.md | 37 ++-- docs/data_types/audio/audio.md | 28 +-- docs/data_types/first_steps.md | 6 +- docs/data_types/image/image.md | 17 +- docs/data_types/multimodal/multimodal.md | 16 +- docs/data_types/table/table.md | 35 ++-- docs/data_types/text/text.md | 14 +- docs/data_types/video/video.md | 33 ++- docs/how_to/add_doc_index.md | 191 ++++++++++-------- .../how_to/multimodal_training_and_serving.md | 87 ++++---- ...optimize_performance_with_id_generation.md | 6 +- docs/user_guide/sending/api/fastAPI.md | 28 +-- docs/user_guide/sending/api/jina.md | 53 +++-- docs/user_guide/sending/first_step.md | 11 +- docs/user_guide/sending/ser/send_doc.md | 25 ++- docs/user_guide/sending/ser/send_doclist.md | 62 ++++-- docs/user_guide/sending/ser/send_docvec.md | 20 +- 17 files changed, 369 insertions(+), 300 deletions(-) diff --git a/docs/data_types/3d_mesh/3d_mesh.md b/docs/data_types/3d_mesh/3d_mesh.md index 20db151bd23..4895b0b38e4 100644 --- a/docs/data_types/3d_mesh/3d_mesh.md +++ b/docs/data_types/3d_mesh/3d_mesh.md @@ -3,12 +3,12 @@ DocArray supports many different modalities including `3D Mesh`. This section will show you how to load and handle 3D data using DocArray. -A 3D mesh is the structural build of a 3D model consisting of polygons. Most 3D meshes are created via professional software packages, such as commercial suites like Unity, or the free open-source Blender 3D. - +A 3D mesh is the structural build of a 3D model consisting of polygons. Most 3D meshes are created via professional software packages, such as commercial suites like [Unity](https://unity.com/), or the open-source [Blender](https://www.blender.org/). !!! note This feature requires `trimesh`. You can install all necessary dependencies via: - ```cm + + ```cmd pip install "docarray[mesh]" ``` @@ -21,13 +21,16 @@ A 3D mesh can be represented by its vertices and faces: ### Load vertices and faces -First, let's define our class `MyMesh3D`, which extends [`BaseDoc`][docarray.base_doc.doc.BaseDoc] and provides attributes to store our 3D data. It has an `url` attribute of type [`Mesh3DUrl`][docarray.typing.url.url_3d.mesh_url.Mesh3DUrl]. To store the vertices and faces, DocArray provides the [`VerticesAndFaces`][docarray.documents.mesh.vertices_and_faces.VerticesAndFaces] class, which has a `vertices` attribute and a `faces` attribute, both of type [`AnyTensor`](../../../../api_references/typing/tensor/tensor). This especially comes in handy later when we want to display our 3D mesh. +First, let's define our class `MyMesh3D`, which extends [`BaseDoc`][docarray.base_doc.doc.BaseDoc] and provides attributes to store our 3D data: + +- The `mesh_url` attribute of type [`Mesh3DUrl`][docarray.typing.url.url_3d.mesh_url.Mesh3DUrl]. +- The optional `tensors` attribute, of type [`VerticesAndFaces`][docarray.documents.mesh.vertices_and_faces.VerticesAndFaces] + - The `VerticesAndFaces` class has the attributes `vertices` and `faces`, both of type [`AnyTensor`](../../../../api_references/typing/tensor/tensor). This especially comes in handy later when we want to display our 3D mesh. !!! tip Check out our predefined [`Mesh3D`](#getting-started-predefined-docs) to get started and play around with our 3D features. -But for now, let's create a `MyMesh3D` instance with an URL to a remote `.obj` file: - +But for now, let's create a `MyMesh3D` instance with a URL to a remote `.obj` file: ```python from typing import Optional @@ -45,7 +48,7 @@ class MyMesh3D(BaseDoc): doc = MyMesh3D(mesh_url="https://people.sc.fsu.edu/~jburkardt/data/obj/al.obj") ``` -To load the vertices and faces information, you can simply call [`.load()`][docarray.typing.url.url_3d.mesh_url.Mesh3DUrl.load] on the [`Mesh3DUrl`][docarray.typing.url.url_3d.mesh_url.Mesh3DUrl] instance. This will return a [`VerticesAndFaces`][docarray.documents.mesh.vertices_and_faces.VerticesAndFaces] object. +To load the vertices and faces information, you can call [`.load()`][docarray.typing.url.url_3d.mesh_url.Mesh3DUrl.load] on the [`Mesh3DUrl`][docarray.typing.url.url_3d.mesh_url.Mesh3DUrl] instance. This will return a [`VerticesAndFaces`][docarray.documents.mesh.vertices_and_faces.VerticesAndFaces] object: ```python doc.tensors = doc.mesh_url.load() @@ -1329,10 +1332,9 @@ function render(){tracklight.position.copy(camera.position);renderer.render(scen init(); " width="100%" height="500px" style="border:none;"> - ## Point cloud representation -A point cloud is a representation of a 3D mesh. It is made by repeatedly and uniformly sampling points within the surface of the 3D body. Compared to the mesh representation, the point cloud is a fixed size ndarray and hence easier for deep learning algorithms to handle. +A point cloud is a representation of a 3D mesh. It is made by repeatedly and uniformly sampling points within the surface of the 3D body. Compared to the mesh representation, the point cloud is a fixed size `ndarray` and hence easier for deep learning algorithms to handle. ### Load point cloud @@ -1341,7 +1343,7 @@ A point cloud is a representation of a 3D mesh. It is made by repeatedly and uni In DocArray, loading a point cloud from a [`PointCloud3DUrl`][docarray.typing.url.url_3d.point_cloud_url.PointCloud3DUrl] instance will return a [`PointsAndColors`][docarray.documents.point_cloud.points_and_colors.PointsAndColors] instance. Such an object has a `points` attribute containing the information about the points in 3D space as well as an optional `colors` attribute. -First, let's define our class `MyPointCloud`, which extends [`BaseDoc`][docarray.base_doc.doc.BaseDoc] and provides attributes to store the point cloud information. +First, let's define our class `MyPointCloud`, which extends [`BaseDoc`][docarray.base_doc.doc.BaseDoc] and provides attributes to store the point cloud information: ```python from typing import Optional @@ -1365,6 +1367,7 @@ Next, we can load a point cloud of size `samples` by simply calling [`.load()`][ doc.tensors = doc.url.load(samples=1000) doc.summary() ``` +
Output ``` { .text .no-copy } @@ -1386,8 +1389,8 @@ doc.summary() ```
- ### Display 3D point cloud in notebook + You can display your point cloud and interact with it from its URL as well as from a PointsAndColors instance. The first will always display without color, whereas the display from [`PointsAndColors`][docarray.documents.point_cloud.points_and_colors.PointsAndColors] will show with color if `.colors` is not None. ``` { .python} @@ -2642,16 +2645,13 @@ function render(){tracklight.position.copy(camera.position);renderer.render(scen init(); " width="100%" height="500px" style="border:none;"> +## Getting started - Predefined documents - - - -## Getting started - Predefined Docs To get started and play around with the 3D modalities, DocArray provides the predefined documents [`Mesh3D`][docarray.documents.mesh.Mesh3D] and [`PointCloud3D`][docarray.documents.point_cloud.PointCloud3D], which includes all of the previously mentioned functionalities. ### `Mesh3D` -The [`Mesh3D`][docarray.documents.mesh.Mesh3D] class for instance provides a [`Mesh3DUrl`][docarray.typing.Mesh3DUrl] field as well as a [`VerticesAndFaces`][docarray.documents.mesh.vertices_and_faces.VerticesAndFaces] field. +The [`Mesh3D`][docarray.documents.mesh.Mesh3D] class provides a [`Mesh3DUrl`][docarray.typing.Mesh3DUrl] field and [`VerticesAndFaces`][docarray.documents.mesh.vertices_and_faces.VerticesAndFaces] field. ``` { .python } class Mesh3D(BaseDoc): @@ -2671,7 +2671,7 @@ class PointCloud3D(BaseDoc): bytes_: Optional[bytes] ``` -You can use them directly, extend or compose them. +You can use them directly, extend or compose them: ```python from docarray import BaseDoc @@ -2692,7 +2692,6 @@ doc = My3DObject( pc=PointCloud3D(url=obj_file), ) - doc.mesh.tensors = doc.mesh.url.load() doc.pc.tensors = doc.pc.url.load(samples=100) -``` \ No newline at end of file +``` diff --git a/docs/data_types/audio/audio.md b/docs/data_types/audio/audio.md index 2a73d22f2aa..ea12b0a5e35 100644 --- a/docs/data_types/audio/audio.md +++ b/docs/data_types/audio/audio.md @@ -7,28 +7,31 @@ Moreover, you will learn about DocArray's audio-specific types, to represent you !!! note This requires a `pydub` dependency. You can install all necessary dependencies via: + ```cmd pip install "docarray[audio]" ``` + Additionally, you have to install `ffmpeg` (see more info [here](https://github.com/jiaaro/pydub#getting-ffmpeg-set-up)): + ```cmd # on Mac with brew: brew install ffmpeg ``` + ```cmd # on Linux with apt-get apt-get install ffmpeg libavcodec-extra ``` - ## Load audio file -First, let's define a class, which extends [`BaseDoc`][docarray.base_doc.doc.BaseDoc] and has an `url` attribute of type [`AudioUrl`][docarray.typing.url.AudioUrl], and an optional `tensor` attribute of type [`AudioTensor`](../../../../api_references/typing/tensor/audio). +First, let's define a class which extends [`BaseDoc`][docarray.base_doc.doc.BaseDoc] and has a `url` attribute of type [`AudioUrl`][docarray.typing.url.AudioUrl], and an optional `tensor` attribute of type [`AudioTensor`](../../../../api_references/typing/tensor/audio). !!! tip Check out our predefined [`AudioDoc`](#getting-started-predefined-audiodoc) to get started and play around with our audio features. -Next, you can instantiate an object of that class with a local or remote URL. +Next, you can instantiate an object of that class with a local or remote URL: ```python from docarray import BaseDoc @@ -50,13 +53,14 @@ Loading the content of the audio file is as easy as calling [`.load()`][docarray This will return a tuple of: -- an [`AudioNdArray`][docarray.typing.tensor.audio.AudioNdArray] representing the audio file content -- an integer representing the frame rate (number of signals for a certain period of time) +- An [`AudioNdArray`][docarray.typing.tensor.audio.AudioNdArray] representing the audio file content +- An integer representing the frame rate (number of signals for a certain period of time) ```python doc.tensor, doc.frame_rate = doc.url.load() doc.summary() ``` +
Output ``` { .text .no-copy } @@ -72,7 +76,6 @@ doc.summary() ```
- ## AudioTensor DocArray offers several [`AudioTensor`s](../../../../api_references/typing/tensor/audio) to store your data to: @@ -105,7 +108,6 @@ assert isinstance(doc.tf_tensor, AudioTensorFlowTensor) assert isinstance(doc.torch_tensor, AudioTorchTensor) ``` - ## AudioBytes Alternatively, you can load your [`AudioUrl`][docarray.typing.url.AudioUrl] instance to [`AudioBytes`][docarray.typing.bytes.AudioBytes], and your [`AudioBytes`][docarray.typing.bytes.AudioBytes] instance to an [`AudioTensor`](../../../../api_references/typing/tensor/audio) of your choice: @@ -142,7 +144,9 @@ assert isinstance(bytes_from_tensor, AudioBytes) ``` ## Save audio to file + You can save your [`AudioTensor`](../../../../api_references/typing/tensor/audio) to an audio file of any format as follows: + ``` { .python } tensor_reversed = doc.tensor[::-1] tensor_reversed.save( @@ -152,7 +156,7 @@ tensor_reversed.save( ``` ## Play audio in a notebook -You can play your audio sound in a notebook from its URL as well as its tensor, by calling `.display()` on either one. +You can play your audio sound in a notebook from its URL or tensor, by calling `.display()` on either one. Play from `url`: ``` { .python } @@ -166,18 +170,17 @@ doc.url.display() Play from `tensor`: + ``` { .python } tensor_reversed.display() ``` +
- - - ## Getting started - Predefined `AudioDoc` To get started and play around with your audio data, DocArray provides a predefined [`AudioDoc`][docarray.documents.audio.AudioDoc], which includes all of the previously mentioned functionalities: @@ -192,6 +195,7 @@ class AudioDoc(BaseDoc): ``` You can use this class directly or extend it to your preference: + ```python from docarray.documents import AudioDoc from typing import Optional @@ -205,7 +209,7 @@ class MyAudio(AudioDoc): audio = MyAudio( url='https://github.com/docarray/docarray/blob/main/tests/toydata/hello.mp3?raw=true' ) + audio.name = 'My first audio doc!' audio.tensor, audio.frame_rate = audio.url.load() ``` - diff --git a/docs/data_types/first_steps.md b/docs/data_types/first_steps.md index 4119e9df1e8..542f60356ee 100644 --- a/docs/data_types/first_steps.md +++ b/docs/data_types/first_steps.md @@ -1,7 +1,7 @@ -# Intro +# Introduction With DocArray you can represent text, image, video, audio, and 3D meshes, whether separate, nested or combined, -and process them as a DocList. +and process them as a [`DocList`][docarray.array.doc_list.doc_list.DocList]. This section covers the following sections: @@ -11,4 +11,4 @@ This section covers the following sections: - [Video](video/video.md) - [3D Mesh](3d_mesh/3d_mesh.md) - [Table](table/table.md) -- [Multimodal data](multimodal/multimodal.md) \ No newline at end of file +- [Multimodal data](multimodal/multimodal.md) diff --git a/docs/data_types/image/image.md b/docs/data_types/image/image.md index 892542e3a45..27c7a5bfe0c 100644 --- a/docs/data_types/image/image.md +++ b/docs/data_types/image/image.md @@ -7,6 +7,7 @@ Moreover, we will introduce DocArray's image-specific types, to represent your i !!! note This requires `Pillow` dependency. You can install all necessary dependencies via: + ```cmd pip install "docarray[image]" ``` @@ -16,9 +17,9 @@ Moreover, we will introduce DocArray's image-specific types, to represent your i !!! tip Check out our predefined [`ImageDoc`](#getting-started-predefined-imagedoc) to get started and play around with our image features. -First, let's define our class `MyImage`, which extends [`BaseDoc`][docarray.base_doc.doc.BaseDoc] and has an `url` attribute of type [`ImageUrl`][docarray.typing.url.ImageUrl], as well as an optional `tensor` attribute of type [`ImageTensor`](../../../../api_references/typing/tensor/image). +First, let's define the class `MyImage`, which extends [`BaseDoc`][docarray.base_doc.doc.BaseDoc] and has a `url` attribute of type [`ImageUrl`][docarray.typing.url.ImageUrl], as well as an optional `tensor` attribute of type [`ImageTensor`](../../../../api_references/typing/tensor/image). -Next, let's instantiate a `MyImage` object with a local or remote URL. +Next, let's instantiate a `MyImage` object with a local or remote URL: ```python from docarray.typing import ImageTensor, ImageUrl @@ -35,7 +36,7 @@ img = MyImage( ) ``` -To load the image data you can call [`.load()`][docarray.typing.url.ImageUrl.load] on the `url` attribute. By default, [`ImageUrl.load()`][docarray.typing.url.ImageUrl.load] returns an [`ImageNdArray`][docarray.typing.tensor.image.image_ndarray.ImageNdArray] object. +To load the image data you can call [`.load()`][docarray.typing.url.ImageUrl.load] on the `url` attribute. By default, [`ImageUrl.load()`][docarray.typing.url.ImageUrl.load] returns an [`ImageNdArray`][docarray.typing.tensor.image.image_ndarray.ImageNdArray] object: ```python from docarray.typing import ImageNdArray @@ -108,7 +109,7 @@ img = MyImage(tensor=np.ones(shape=(200, 300, 3))) # img = MyImage(tensor=np.ones(shape=(224, 224, 3))) ``` -If you have RGB images of different shapes, you could specify only the dimension as well as the number of channels: +If you have RGB images of different shapes, you can specify only the dimensions and number of channels: ```python import numpy as np @@ -124,8 +125,6 @@ img_1 = MyFlexibleImage(tensor=np.zeros(shape=(200, 300, 3))) img_2 = MyFlexibleImage(tensor=np.ones(shape=(224, 224, 3))) ``` - - ## ImageBytes Alternatively, you can load your [`ImageUrl`][docarray.typing.url.ImageUrl] instance to [`ImageBytes`][docarray.typing.bytes.ImageBytes], and your [`ImageBytes`][docarray.typing.bytes.ImageBytes] instance to an [`ImageTensor`](../../../../api_references/typing/tensor/image) of your choice. @@ -162,15 +161,13 @@ assert isinstance(bytes_from_tensor, ImageBytes) You can display your image in a notebook from both an [`ImageUrl`][docarray.typing.url.ImageUrl] instance as well as an [`ImageTensor`](../../../../api_references/typing/tensor/image) instance. -
![](display_notebook.jpg){ width="900" }
- ## Getting started - Predefined `ImageDoc` -To get started and play around with the image-modality, DocArray provides a predefined [`ImageDoc`][docarray.documents.image.ImageDoc], which includes all of the previously mentioned functionalities: +To get started and play around with the image modality, DocArray provides a predefined [`ImageDoc`][docarray.documents.image.ImageDoc], which includes all of the previously mentioned functionalities: ``` { .python } class ImageDoc(BaseDoc): @@ -181,6 +178,7 @@ class ImageDoc(BaseDoc): ``` You can use this class directly or extend it to your preference: + ``` { .python } from docarray.documents import ImageDoc from docarray.typing import AnyEmbedding @@ -197,6 +195,7 @@ image = MyImage( image_title='My first image', url='http://www.jina.ai/image.jpg', ) + image.tensor = image.url.load() model = SomeEmbeddingModel() image.embedding = model(image.tensor) diff --git a/docs/data_types/multimodal/multimodal.md b/docs/data_types/multimodal/multimodal.md index 57a2b1af56f..e99fbfad8b3 100644 --- a/docs/data_types/multimodal/multimodal.md +++ b/docs/data_types/multimodal/multimodal.md @@ -4,8 +4,8 @@ In this section, we will walk through how to use DocArray to process multiple da !!! tip "See also" In this section, we will work with image and text data. If you are not yet familiar with how to process these - modalities individually, you may want to check out the respective examples first: [`Image`](../image/image.md) - and [`Text`](../text/text.md) + modalities individually, you may want to check out the [`Image`](../image/image.md) + and [`Text`](../text/text.md) examples first. ## Model your data @@ -14,7 +14,7 @@ DocArray allows you to model your data and these relationships. ### Define a schema -Let's suppose you want to model a page of a newspaper that contains a main text, an image URL, a corresponding tensor +Suppose you want to model a page of a newspaper that contains a main text, an image URL, a corresponding tensor as well as a description. You can model this example in the following way: ```python @@ -40,10 +40,12 @@ page = Page( img_url='https://github.com/docarray/docarray/blob/main/docs/assets/favicon.png?raw=true', img_description='This is the image of an apple', ) + page.img_tensor = page.img_url.load() page.summary() ``` +
Output ``` { .text .no-copy } @@ -71,6 +73,7 @@ print(page.img_url) print(page.img_description) print(page.img_tensor) ``` +
Output ``` { .text .no-copy } @@ -93,7 +96,7 @@ For this example, let's try to define a schema to represent a newspaper. The new any number of following pages, and some metadata. Further, each page contains a main text and can contain an image and an image description. -To implement this you can simply add a `Newspaper` class to our previous implementation. The newspaper has a required +To implement this you can add a `Newspaper` class to the previous implementation. The newspaper has a required `cover_page` attribute of type `Page` as well as a `pages` attribute, which is a `DocList` of `Page`s. ```python @@ -114,7 +117,7 @@ class Newspaper(BaseDoc): metadata: dict = None ``` -You can instantiate this more complex `Newspaper` object in the same way as before: +You can instantiate this more complex `Newspaper` object the same way as before: ```python cover_page = Page( @@ -142,6 +145,7 @@ docarray_daily = Newspaper( docarray_daily.summary() ``` +
Output ``` { .text .no-copy } @@ -179,4 +183,4 @@ docarray_daily.summary() │ ╰─────────────────────┴────────────────╯ └── ... 1 more Page documents ``` -
\ No newline at end of file +
diff --git a/docs/data_types/table/table.md b/docs/data_types/table/table.md index 701db376f19..45474ce80a1 100644 --- a/docs/data_types/table/table.md +++ b/docs/data_types/table/table.md @@ -6,8 +6,9 @@ This section will show you how to load and handle tabular data using DocArray. ## Load CSV table A common way to store tabular data is via `CSV` (comma-separated values) files. -You can easily load such data from a given `CSV` file into a [`DocList`][docarray.DocList]. -Let's take a look at the following example file, which includes data about books and their authors and publishing year. +You can load such data from a given `CSV` file into a [`DocList`][docarray.DocList]. + +Let's take a look at the following example file, which includes data about books and their authors and year of publication: ```text title,author,year @@ -16,7 +17,8 @@ Klara and the sun,Kazuo Ishiguro,2020 A little life,Hanya Yanagihara,2015 ``` -First, you have to define the Document schema describing the data. +First, define the Document schema describing the data: + ```python from docarray import BaseDoc @@ -26,7 +28,8 @@ class Book(BaseDoc): author: str year: int ``` -Next, you can load the content of the CSV file to a [`DocList`][docarray.DocList] instance of `Book`s via [`.from_csv()`][docarray.array.doc_list.io.IOMixinArray.from_csv]. +Next, load the content of the CSV file to a [`DocList`][docarray.DocList] instance of `Book`s via [`.from_csv()`][docarray.array.doc_list.io.IOMixinArray.from_csv]: + ```python from docarray import DocList @@ -36,6 +39,7 @@ docs = DocList[Book].from_csv( ) docs.summary() ``` +
Output ``` { .text .no-copy } @@ -56,19 +60,19 @@ docs.summary() ```
-The resulting [`DocList`][docarray.DocList] object contains three `Book`s since each row of the CSV file corresponds to one book and is assigned to one `Book` instance. - +The resulting [`DocList`][docarray.DocList] object contains three `Book`s, since each row of the CSV file corresponds to one book and is assigned to one `Book` instance. ## Save to CSV file -Vice versa, you can also store your [`DocList`][docarray.DocList] data in a `.csv` file using [`.to_csv()`][docarray.array.doc_list.io.IOMixinArray.to_csv]. +Vice versa, you can also store your [`DocList`][docarray.DocList] data in a `.csv` file using [`.to_csv()`][docarray.array.doc_list.io.IOMixinArray.to_csv]: + ``` { .python } docs.to_csv(file_path='/path/to/my_file.csv') ``` Tabular data is often not the best choice to represent nested Documents. Hence, nested Documents will be stored flattened and can be accessed by their `'__'`-separated access paths. -Let's take a look at an example. We now want to store not only the book data but moreover book review data. To do so, we define a `BookReview` class that has a nested `book` attribute as well as the non-nested attributes `n_ratings` and `stars`. +Let's take a look at an example. We now want to store not only the book data but moreover book review data. To do so, we define a `BookReview` class that has a nested `book` attribute as well as the non-nested attributes `n_ratings` and `stars`: ```python class BookReview(BaseDoc): @@ -82,6 +86,7 @@ review_docs = DocList[BookReview]( ) review_docs.summary() ``` +
Output ``` { .text .no-copy} @@ -105,16 +110,17 @@ review_docs.summary() ```
-As expected all nested attributes will be stored by there access path. +As expected all nested attributes will be stored by their access path: + ``` { .python } review_docs.to_csv(file_path='/path/to/nested_documents.csv') ``` + ``` { .text .no-copy hl_lines="1" } id,book__id,book__title,book__author,book__year,n_ratings,stars d6363aa3b78b4f4244fb976570a84ff7,8cd85fea52b3a3bc582cf56c9d612cbb,Harry Potter and the Philosopher's Stone,J. K. Rowling,1997,12345,5.0 5b53fff67e6b6cede5870f2ee09edb05,87b369b93593967226c525cf226e3325,Klara and the sun,Kazuo Ishiguro,2020,12345,5.0 addca0475756fc12cdec8faf8fb10d71,03194cec1b75927c2259b3c0fff1ab6f,A little life,Hanya Yanagihara,2015,12345,5.0 - ``` ## Handle TSV tables @@ -142,6 +148,7 @@ docs = DocList[Book].from_csv( for doc in docs: doc.summary() ``` +
Output ```text @@ -165,10 +172,10 @@ for doc in docs:
Great! All the data is correctly read and stored in `Book` instances. + ## Other separators -If your values are separated by yet another separator, you can create your own `csv.Dialect` class. -To do so you can create a class, that inherits from `csv.Dialect`. +If your values are separated by yet another separator, you can create your own class that inherits from `csv.Dialect`. Within this class, you can define your dialect's behavior by setting the provided [formatting parameters](https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters). For instance, let's assume you have a semicolon-separated table: @@ -180,6 +187,7 @@ John;Doe;1234 ``` Now, let's define our `SemicolonSeparator` class. Next to the `delimiter` parameter, we have to set some more formatting parameters such as `doublequote` and `lineterminator`. + ```python import csv @@ -191,7 +199,9 @@ class SemicolonSeparator(csv.Dialect): quotechar = '"' quoting = csv.QUOTE_MINIMAL ``` + Finally, you can load your data by setting the `dialect` parameter in [`.from_csv()`][docarray.array.doc_list.io.IOMixinArray.from_csv] to an instance of your `SemicolonSeparator`. + ```python docs = DocList[Book].from_csv( file_path='https://github.com/docarray/docarray/blob/main/tests/toydata/books_semicolon_sep.csv?raw=true', @@ -200,6 +210,7 @@ docs = DocList[Book].from_csv( for doc in docs: doc.summary() ``` +
Output ```text diff --git a/docs/data_types/text/text.md b/docs/data_types/text/text.md index 7af4cbe7ab7..2b1ec16384c 100644 --- a/docs/data_types/text/text.md +++ b/docs/data_types/text/text.md @@ -1,4 +1,3 @@ - # 🔤 Text DocArray supports many different modalities including `Text`. @@ -20,7 +19,7 @@ class MyText(BaseDoc): doc = MyText(text='Hello world!') ``` -The text can include any type of character, including emojis: +Text can include any type of character, including emojis: ```python doc.text = '👋 नमस्ते दुनिया! 你好世界!こんにちは世界! Привет мир!' @@ -28,7 +27,7 @@ doc.text = '👋 नमस्ते दुनिया! 你好世界!こんに ## Load text file -If your text data is too long to be written inline or if it is stored in a file, you can also define the URL as a [`TextUrl`][docarray.typing.url.text_url.TextUrl] first and then load the text data. +If your text data is too long to be written inline or if it is stored in a file, you can first define the URL as a [`TextUrl`][docarray.typing.url.text_url.TextUrl] and then load the text data. Let's first define a schema: @@ -41,7 +40,8 @@ class MyText(BaseDoc): text: str = None url: TextUrl = None ``` -Next, you can instantiate a `MyText` object with a `url` attribute and load its content to the `text` field. +Next, instantiate a `MyText` object with a `url` attribute and load its content to the `text` field. + ```python doc = MyText( url='https://www.w3.org/History/19921103-hypertext/hypertext/README.html', @@ -53,8 +53,8 @@ assert doc.text.startswith('Read Me') ## Segment long texts -Often times when you index or search text data, you don’t want to consider thousands of words as one huge string. -Instead, some finer granularity would be nice. You can do this by leveraging nested fields. For example, let’s split some page content into its sentences by `'.'`. +When you index or search text data, you often don’t want to consider thousands of words as one huge string. +Instead, some finer granularity would be nice. You can do this by leveraging nested fields. For example, let’s split some page content into its sentences by `'.'`: ```python from docarray import BaseDoc, DocList @@ -73,6 +73,7 @@ page = Page(content=[Sentence(text=t) for t in long_text.split('.')]) page.summary() ``` +
Output ``` { .text .no-copy } @@ -105,4 +106,3 @@ class TextDoc(BaseDoc): embedding: Optional[AnyEmbedding] bytes_: Optional[bytes] ``` - diff --git a/docs/data_types/video/video.md b/docs/data_types/video/video.md index f7c55765c22..f619af91085 100644 --- a/docs/data_types/video/video.md +++ b/docs/data_types/video/video.md @@ -42,14 +42,12 @@ doc = MyVideo( Now you can load the video file content by simply calling [`.load()`][docarray.typing.url.audio_url.AudioUrl.load] on your [`AudioUrl`][docarray.typing.url.audio_url.AudioUrl] instance. This will return a [NamedTuple](https://docs.python.org/3/library/typing.html#typing.NamedTuple) of a **video tensor**, an **audio tensor**, and the **key frame indices**: -- The video tensor is a 4-dim array of shape `(n_frames, height, width, channels)`.
The first dimension represents the frame id. -The last three dimensions represent the image data of the corresponding frame. - -- If the video contains audio, it will be stored as an AudioNdArray. - +- The video tensor is a 4-dimensional array of shape `(n_frames, height, width, channels)`. + - The first dimension represents the frame id. + - The last three dimensions represent the image data of the corresponding frame. +- If the video contains audio, it will be stored as an `AudioNdArray`. - Additionally, the key frame indices will be stored. A key frame is defined as the starting point of any smooth transition. - ```python doc.video, doc.audio, doc.key_frame_indices = doc.url.load() @@ -59,12 +57,13 @@ assert isinstance(doc.key_frame_indices, NdArray) print(doc.video.shape) ``` + ``` { .text .no-copy } (250, 176, 320, 3) ``` -For the given example you can infer from `doc.video`'s shape, that the video contains 250 frames of size 176x320 in RGB mode. -Based on the overall length of the video (10s), you can infer the framerate is approximately 250/10 = 25 frames per second (fps). +For the given example you can infer from `doc.video`'s shape that the video contains 250 frames of size 176x320 in RGB mode. +Based on the overall length of the video (10 seconds), you can infer the framerate is approximately 250/10 = 25 frames per second (fps). ## VideoTensor @@ -74,7 +73,7 @@ DocArray offers several [`VideoTensor`s](../../../../api_references/typing/tenso - [`VideoTorchTensor`][docarray.typing.tensor.video.VideoTorchTensor] - [`VideoTensorFlowTensor`][docarray.typing.tensor.video.VideoTensorFlowTensor] -If you specify the type of your tensor to one of the above, it will be cast to that automatically: +If you specify the type of your tensor as one of the above, it will be cast to that automatically: ```python hl_lines="7 8 15 16" from docarray import BaseDoc @@ -98,8 +97,6 @@ assert isinstance(doc.tf_tensor, VideoTensorFlowTensor) assert isinstance(doc.torch_tensor, VideoTorchTensor) ``` - - ## VideoBytes Alternatively, you can load your [`VideoUrl`][docarray.typing.url.VideoUrl] instance to [`VideoBytes`][docarray.typing.bytes.VideoBytes], and your [`VideoBytes`][docarray.typing.bytes.VideoBytes] instance to a [`VideoTensor`](../../../../api_references/typing/tensor/video) of your choice: @@ -133,11 +130,10 @@ bytes_from_tensor = doc.video.to_bytes() assert isinstance(bytes_from_tensor, VideoBytes) ``` - ## Key frame extraction A key frame is defined as the starting point of any smooth transition. -Given the key frame indices, you can access selected scenes. +Given the key frame indices, you can access selected scenes: ```python indices = doc.key_frame_indices @@ -164,11 +160,10 @@ for frame in key_frames: ![](key_frames.png){ width="350" } - - ## Save video to file -You can save your video tensor to a file. In the example below you save the video with a framerate of 60 fps, which results in a 4-secOND video, instead of the original 10-second video with a frame rate of 25 fps. +You can save your video tensor to a file. In the example below you save the video with a framerate of 60 fps, which results in a 4-second video, instead of the original 10-second video with a frame rate of 25 fps. + ``` { .python } doc.video.save( file_path="/path/my_video.mp4", @@ -184,9 +179,8 @@ You can play a video in a notebook from its URL as well as its tensor, by callin doc_fast = MyAudio(url="/path/my_video.mp4") doc_fast.url.display() ``` -![type:video](mov_bbb_framerate_60.mp4){: style='width: 600px; height: 330px'} - +![type:video](mov_bbb_framerate_60.mp4){: style='width: 600px; height: 330px'} ## Getting started - Predefined `VideoDoc` @@ -218,6 +212,7 @@ class MyVideo(VideoDoc): video = MyVideo( url='https://github.com/docarray/docarray/blob/main/tests/toydata/mov_bbb.mp4?raw=true' ) + video.name = 'My first video doc!' video.tensor = video.url.load().video -``` \ No newline at end of file +``` diff --git a/docs/how_to/add_doc_index.md b/docs/how_to/add_doc_index.md index 3d4cfb9b8bc..37833b277af 100644 --- a/docs/how_to/add_doc_index.md +++ b/docs/how_to/add_doc_index.md @@ -1,16 +1,14 @@ # Add a new Document Index -In DocArray there exists the concept of _Document Index_, a class that takes `Document`s, optionally persists them, -and makes them searchable. +In DocArray a _Document Index_ is a class that takes documents, optionally persists them, +and makes them searchable. Different Document Indexes leverage different backends, like Weaviate, Qdrant, HNSWLib etc. -There are different Document Indexes leveraging different backends, such as Weaviate, Qdrant, HNSWLib etc. +This document shows covers adding a new Document Index to DocArray. -This document shows how to add a new Document Index to DocArray. +This can be broken down into a number of steps: -That process can be broken down into a number of basic steps: - -1. Installation and user instructions -2. Create a new class that inherits from `BaseDocIndex`Create a new class that inherits from `BaseDocIndex` +1. Install and user instructions +2. Create a new class that inherits from `BaseDocIndex` 3. Declare default configurations for your Document Index 4. Implement abstract methods for indexing, searching, and deleting 5. Implement a Query Builder for your Document Index @@ -20,22 +18,32 @@ In general, the steps above can be followed in roughly that order. However, a Document Index implementation is usually very interconnected, so you will probably have to jump between these steps a bit, both in your implementation and in the guide below. -For an end-to-end example of this process, you can check out the [existing HNSWLib Document Index implementation](https://github.com/docarray/docarray/pull/1124). +For an end-to-end example of this process, check out the [existing HNSWLib Document Index implementation](https://github.com/docarray/docarray/pull/1124). -**Caution**: The HNSWLib Document Index implementation can be used as a reference, but it is special in some key ways. -For example, HNSWLib can only index vectors, so it uses SQLite to store the rest of the Documents alongside it. -This is _not_ how you should store Documents in your implementation! You can find guidance on how you _should_ do it below. +!!! warning + **Caution**: The HNSWLib Document Index implementation can be used as a reference, but it is special in some key ways. + For example, HNSWLib can only index vectors, so it uses SQLite to store the rest of the documents alongside it. + This is _not_ how you should store documents in your implementation! You can find guidance on how you _should_ do it below. ## Installation and user instructions -Add the library required for your Index via poetry: `poetry add {my_index_lib}`. -In the `pyproject.toml` file, it will look like this: + +Add the library required for your Index via poetry: + +```shell +poetry add {my_index_lib} ``` + +The `pyproject.toml` file should now look like this: + +```toml [tool.poetry.dependencies] my_index_lib = ">=123.456.789" ``` -Mark it as optional and manually create an extra for it: -``` + +Mark the library as optional and manually create an `extra` for it: + +```toml [tool.poetry.dependencies] my_index_lib = {version = ">=0.6.2", optional = true } @@ -46,10 +54,12 @@ my_index_extra = ["my_index_lib"] In case the user tries to use your Index without the correct installs, we want to throw an error with corresponding instructions. To enable this, first, add instructions to the `INSTALL_INSTRUCTIONS` dictionary in `docarray/utils/misc.py`, such as + ```python {'my_index_lib': '"docarray[my_index_extra]"'} ``` -Next, ensure to add a case to the `__getattr__()` for your new Index to `docarray/index/__init__.py`. By doing so, the user will be given the instructions, when trying to import `MyIndex` without the correct libraries installed. + +Next, ensure you add a case to the `__getattr__()` in `docarray/index/__init__.py` for your new Index. By doing so, the user will be given the instructions when trying to import `MyIndex` without the correct libraries installed. ```python if TYPE_CHECKING: @@ -64,13 +74,13 @@ def __getattr__(name: str): __all__.append('MyIndex') return MyIndex ``` -Additionally, wrap the required imports in the file where the `MyIndex` class will be located, such as it was done in `docarray/index/backends/hnswlib.py`. + +Additionally, wrap the required imports in the file where the `MyIndex` class will be located, like it is done in `docarray/index/backends/hnswlib.py`. ## Create a new Document Index class To get started, create a new class that inherits from `BaseDocIndex` and `typing.Generic`: - ```python TSchema = TypeVar('TSchema', bound=BaseDoc) @@ -92,28 +102,29 @@ def __init__(self, db_config=None, **kwargs): ... ``` -Make sure that you call the `super().__init__` method, which will do some basic initialization for you. +Ensure you call the `super().__init__` method, which will do some basic initialization for you. ### Set up your backend -Your backend (database or similar) should represent Documents in the following way: -- Every field of a Document is a column in the database -- Column types follow a default that you define, based on the type hint of the associated field, but can also be configured by the user -- Every row in your database thus represents a Document -- **Nesting:** The most common way to handle nested Document (and the one where the `AbstractDocumentIndex` will hold your hand the most), is to flatten out nested Documents. But if your backend natively supports nesting representations, then feel free to leverage those! +Your backend (database or similar) should represent documents in the following way: -**Caution**: Don't take too much inspiration from the HNSWLib Document Index implementation on this point, as it is a bit of a special case. +- Every field of a document is a column in the database. +- Column types follow a default that you define, based on the type hint of the associated field, but can also be configured by the user. +- Every row in your database thus represents a document. +- **Nesting:** The most common way to handle nested documents (and the one where the `AbstractDocumentIndex` will hold your hand the most), is to flatten out nested documents. But if your backend natively supports nesting representations, then feel free to leverage those! +!!! warning + Don't take too much inspiration from the HNSWLib Document Index implementation on this point, as it is a bit of a special case. -Also, you should check if the Document Index is being set up "fresh", meaning no data was previously persisted. -Then you should create a new database table (or the equivalent concept in you backend) for the Documents. -Otherwise, the Document Index should connect to the existing database and table. +Also, check the Document Index is being set up "fresh", meaning no data was previously persisted. +Then create a new database table (or the equivalent concept in you backend) for the documents, otherwise, the Document Index should connect to the existing database and table. You can determine this based on `self._db_config` (see below). -**Note:** If you are integrating a database, your Document Index should always assume that there is already a database running that it can connect to. -It should _not_ spawn a new database instance. +!!! note + If you are integrating a database, your Document Index should always assume there is already a database running that it can connect to. + It should _not_ spawn a new database instance. -To help you with all of this, `super().__init__` inject a few helpful attributes for you (more info in the dedicated sections below): +To help with all of this, `super().__init__` inject a few helpful attributes for you (more info in the dedicated sections below): - `self._schema` - `self._db_config` @@ -122,7 +133,7 @@ To help you with all of this, `super().__init__` inject a few helpful attributes ### The `_schema` -When a user instantiates a Document Index, they do so in a parametric way, like so: +When a user instantiates a Document Index, they do so in a parametric way: ```python class Inner(BaseDoc): @@ -182,20 +193,21 @@ class _ColumnInfo: - `docarray_type` is the type of the column in DocArray, e.g. `AbstractTensor` or `str` - `db_type` is the type of the column in the Document Index, e.g. `np.ndarray` or `str`. You can customize the mapping from `docarray_type` to `db_type`, as we will see later. -- `config` is a dictionary of configurations for the column. For example, for the `other_tensor` column above, this would contain the `space` and `dim` configurations. +- `config` is a dictionary of configurations for the column. For example, the `other_tensor` column above would contain the `space` and `dim` configurations. - `n_dim` is the dimensionality of the column, e.g. `100` for a 100-dimensional vector. See further guidance on this below. Again, these are automatically populated for you, so you can just use them in your implementation. -**Note:** -`_ColumnInfo.docarray_type` contains the python type as specified in `self._schema`, whereas -`_ColumnInfo.db_type` contains the data type of a particular database column. -By default, it holds that `_ColumnInfo.docarray_type == self.python_type_to_db_type(_ColumnInfo.db_type)`, as we will see later. -However, you should not rely on this, because a user can manually specify a different db_type. -Therefore, your implementation should rely on `_ColumnInfo.db_type` and not directly call `python_type_to_db_type()`. +!!! note + `_ColumnInfo.docarray_type` contains the python type as specified in `self._schema`, whereas + `_ColumnInfo.db_type` contains the data type of a particular database column. + + By default, it holds that `_ColumnInfo.docarray_type == self.python_type_to_db_type(_ColumnInfo.db_type)`, as we will see later. + However, you should not rely on this, because a user can manually specify a different db_type. + Therefore, your implementation should rely on `_ColumnInfo.db_type` and not directly call `python_type_to_db_type()`. -**Caution** -If a subclass of `AbstractTensor` appears in the Document Index's schema (i.e. `TorchTensor`, `NdArray`, or `TensorFlowTensor`), then `_ColumnInfo.docarray_type` will simply show `AbstractTensor` instead of the specific subclass. This is because the abstract class normalizes all input data of type `AbstractTensor` to `np.ndarray` anyways, which should make your life easier. Just be sure to properly handle `AbstractTensor` as a possible value or `_ColumnInfo.docarray_type`, and you won't have to worry about the differences between torch, tf, and np. +!!! warning + If a subclass of `AbstractTensor` appears in the Document Index's schema (i.e. `TorchTensor`, `NdArray`, or `TensorFlowTensor`), then `_ColumnInfo.docarray_type` will simply show `AbstractTensor` instead of the specific subclass. This is because the abstract class normalizes all input data of type `AbstractTensor` to `np.ndarray` anyways, which should make your life easier. Just be sure to properly handle `AbstractTensor` as a possible value or `_ColumnInfo.docarray_type`, and you won't have to worry about the differences between torch, tf, and np. ### Properly handle `n_dim` @@ -208,7 +220,7 @@ This leads to four possible scenarios: **Scenario 1: Only `n_dim` is defined** -Imagine the user defines a schema like the following: +Imagine the user defines this schema: ```python class MyDoc(BaseDoc): @@ -219,11 +231,11 @@ index = MyDocumentIndex[MyDoc]() ``` In that case, the following will be true: `self._column_infos['tensor'].n_dim == 100` and `self._column_infos['tensor'].config == {}`. -The `tensor` column in your backend should be configured to have dimensionality 100. +The `tensor` column in your backend should be configured to have dimensionality `100`. **Scenario 2: Only `Field(...)` is defined** -Imagine the user defines a schema like the following: +Now, imagine the user defines _this_ schema: ```python class MyDoc(BaseDoc): @@ -233,12 +245,12 @@ class MyDoc(BaseDoc): index = MyDocumentIndex[MyDoc]() ``` -In that case, the following will be true: `self._column_infos['tensor'].n_dim is None` and `self._column_infos['tensor'].config['dim'] == 50`. -The `tensor` column in your backend should be configured to have dimensionality 50. +In that case, `self._column_infos['tensor'].n_dim is None` and `self._column_infos['tensor'].config['dim'] == 50`. +The `tensor` column in your backend should be configured to have dimensionality `50`. **Scenario 3: Both `n_dim` and `Field(...)` are defined** -Imagine the user defines a schema like the following: +Now, imagine this schema: ```python class MyDoc(BaseDoc): @@ -248,12 +260,12 @@ class MyDoc(BaseDoc): index = MyDocumentIndex[MyDoc]() ``` -In that case, the following will be true: `self._column_infos['tensor'].n_dim == 100` and `self._column_infos['tensor'].config['dim'] == 50`. -The `tensor` column in your backend should be configured to have dimensionality 100, as **`n_dim` takes precedence over `Field(...)`**. +In this case, `self._column_infos['tensor'].n_dim == 100` and `self._column_infos['tensor'].config['dim'] == 50`. +The `tensor` column in your backend should be configured to have dimensionality `100`, as **`n_dim` takes precedence over `Field(...)`**. **Scenario 4: Neither `n_dim` nor `Field(...)` are defined** -Imagine the user defines a schema like the following: +Finally, imagine this: ```python class MyDoc(BaseDoc): @@ -263,15 +275,15 @@ class MyDoc(BaseDoc): index = MyDocumentIndex[MyDoc]() ``` -In that case, the following will be true: `self._column_infos['tensor'].n_dim is None` and `self._column_infos['tensor'].config == {}`. +In this case, `self._column_infos['tensor'].n_dim is None` and `self._column_infos['tensor'].config == {}`. If your backend can handle tensor/embedding columns without defined dimensionality, you should leverage that mechanism. Otherwise, raise an Exception. ## Declare default configurations -We already made reference to the `_db_config` and `_runtime_config` attributes. +We have already made reference to the `_db_config` and `_runtime_config` attributes. -In order to define what can be stored in them, and what the default values are, you need to create two inner classes: +To define what can be stored in them, and what the default values are, you need to create two inner classes: ```python @dataclass @@ -284,29 +296,30 @@ class RuntimeConfig(BaseDocIndex.RuntimeConfig): default_column_config: Dict[Type, Dict[str, Any]] = ... ``` -Note that: -- `DBConfig` inherits from `BaseDocIndex.DBConfig` and `RuntimeConfig` inherits from `BaseDocIndex.RuntimeConfig` -- All fields in each dataclass need to have default values. Choose these sensibly, as they will be used if the user does not specify a value. +!!! note + - `DBConfig` inherits from `BaseDocIndex.DBConfig` and `RuntimeConfig` inherits from `BaseDocIndex.RuntimeConfig` + - All fields in each dataclass need to have default values. Choose these sensibly, as they will be used if the user does not specify a value. ### The `DBConfig` class -The `DBConfig` class is used to define the static configurations of your Document Index. +The `DBConfig` class defines the static configurations of your Document Index. These are configurations that are tied to the database (or library) running in the background, such as `host`, `port`, etc. Here you should put everything that the user cannot or should not change after initialization. ### The `RuntimeConfig` class -The `RuntimeConfig` class is used to define the dynamic configurations of your Document Index. +The `RuntimeConfig` class defines the dynamic configurations of your Document Index. These are configurations that can be changed at runtime, for example default behaviours such as batch sizes, consistency levels, etc. It is a common pattern to allow such parameters both in the `RuntimeConfig`, where they will act as global defaults, and in specific methods (`index`, `find`, etc.), where they will act as local overrides. -**Important**: Every `RuntimeConfig` needs to contain a `default_column_config` field. -This is a dictionary that, for each possible column type in your database, defines a default configuration for that column type. -This will automatically be passed to a `_ColumnInfo` whenever a user does not manually specify a configuration for that column. +!!! note + Every `RuntimeConfig` needs to contain a `default_column_config` field. + This is a dictionary that, for each possible column type in your database, defines a default configuration for that column type. + This will automatically be passed to a `_ColumnInfo` whenever a user does not manually specify a configuration for that column. -For example, in the `MyDoc` schema above, the `tensor` `_ColumnInfo` would have a default configuration specified for `np.ndarray` columns. + For example, in the `MyDoc` schema above, the `tensor` `_ColumnInfo` would have a default configuration specified for `np.ndarray` columns. What is actually contained in these type-dependant configurations is up to you (and database specific). For example, for `np.ndarray` columns you could define the configurations `index_type` and `metric_type`, @@ -319,12 +332,16 @@ It is probably best to see this in action, so you should check out the `HnswDocu After you've done the basic setup above, you can jump into the good stuff: implementing the actual indexing, searching, and deleting. In general, the following is true: + - For every method that you need to implement, there is a public variant (e.g. `index`) and a private variant (e.g. `_index`) -- You should usually implement the private variant, which is called by the already implemented public variant. This should make your life easier, because some preprocessing and data normalization will already be done for you. +- You should usually implement the private variant, which is called by the already-implemented public variant. This should make your life easier, because some preprocessing and data normalization will already be done for you. - You can, however, also implement the public variant directly, if you want to do something special. - - **Caution**: While this is a perfectly fine thing to do, it might create more maintenance work for you in the future, because the public variant defined in the `BaseDocIndex` might change in the future, and you will have to update your implementation accordingly. + +!!! warning + While implementing the public variant directly is a perfectly fine thing to do, it may create more maintenance work for you in the future, because the public variant defined in the `BaseDocIndex` might change in the future, and you will have to update your implementation accordingly. Further: + - You don't absolutely have to implement everything. If a feature (e.g. `text_search`) is not supported by your backend, just raise a `NotImplementedError` in the corresponding method. - Many methods come in a "singular" variant (e.g. `find`) and a "batched" variant (e.g. `find_batched`). - The "singular" variant expects a single input, be it an ANN query, a text query, a filter, etc., and return matches and scores for that single input @@ -338,9 +355,13 @@ The details of each method should become clear from the docstrings and type hint ### The `python_type_to_db_type()` method -This method is slightly special, because 1) it is not exposed to the user, and 2) you absolutely have to implement it. +This method is slightly special, because + +1. It is not exposed to the user +2. You absolutely have to implement it + +It is intended to take a type of a field in the store's schema (e.g. `AbstractTensor` for `tensor`), and return the corresponding type in the database (e.g. `np.ndarray`). -It is intended to do the following: It takes a type of a field in the store's schema (e.g. `AbstractTensor` for `tensor`), and returns the corresponding type in the database (e.g. `np.ndarray`). The `BaseDocIndex` class uses this information to create and populate the `_ColumnInfo`s in `self._column_infos`. If the user wants to change the default behaviour, one can set the db type by using the `col_type` field: @@ -351,31 +372,33 @@ class MySchema(BaseDoc): my_text: str = Field(..., col_type='varchar', max_len=2048) ``` -In this case, the db type of `my_num` will be `'float64'` and the db type of `my_text` will be `'varchar'`. -Additional information regarding the col_type, such as `max_len` for `varchar` will be stored in the `_ColumnsInfo.config`. -The given col_type has to be a valid db type, meaning that has to be described in the index's `RuntimeConfig.default_column_config`. +In this case, the `db_type` of `my_num` will be `'float64'` and the `db_type` of `my_text` will be `'varchar'`. +Additional information regarding the `col_type`, such as `max_len` for `varchar` will be stored in the `_ColumnsInfo.config`. +The given `col_type` has to be a valid `db_type`, meaning that has to be described in the index's `RuntimeConfig.default_column_config`. ### The `_index()` method -When indexing Documents, your implementation should behave in the following way: +When indexing documents, your implementation should behave in the following way: - Every field in the Document is mapped to a column in the database - This includes the `id` field, which is mapped to the primary key of the database (if your backend has such a concept) - The configuration of that column can be found in `self._column_infos[field_name].config` -- In DocArray v1, we used to store a serialized representation of every Document. This is not needed anymore, as every row in your DB table should fully represent a single indexed Document. +- In DocArray v1, we used to store a serialized representation of every document. This is not needed anymore, as every row in your database table should fully represent a single indexed document. -To handle nested Documents, the public `index()` method already flattens every incoming Document for you. +To handle nested documents, the public `index()` method already flattens every incoming document for you. This means that `_index()` already receives a flattened representation of the data, and you don't need to worry about that. Concretely, the `_index()` method takes as input a dictionary of column names to column data, flattened out. -**Note:** If you (or your backend) prefer to do bulk indexing on row-wise data, then you can use the `self._transpose_col_value_dict()` -helper method. Inside of `_index()` you can use this to transform `column_to_data` into a row-wise view of the data. + +!!! note + If you (or your backend) prefer to do bulk indexing on row-wise data, then you can use the `self._transpose_col_value_dict()` + helper method. Inside of `_index()` you can use this to transform `column_to_data` into a row-wise view of the data. **If your backend has native nesting capabilities:** You can also ignore most of the above, and implement the public `index()` method directly. That way you have full control over whether the input data gets flattened or not. **The `.id` field:** Every Document has an `.id` field, which is intended to act as a unique identifier or primary key -in your backend, if such a concepts exists in your case. In your implementation you can assume that `.id`s are **unique** and **non-empty**. +in your backend, if such a concept exists in your case. In your implementation you can assume that `.id`s are **unique** and **non-empty**. (Strictly speaking, this uniqueness property is not guaranteed, since a user could override the auto-generated `.id` field with a custom value. If your implementation encounters a duplicate `.id`, it is okay to fail and raise an Exception.) @@ -405,10 +428,11 @@ class QueryBuilder(BaseDocIndex.QueryBuilder): ``` The Query Builder exposes the following interface: + - The same query related methods as the `BaseDocIndex` class (e.g. `filter`, `find`, `text_search`, and their batched variants) - The `build()` method -The goal of it is to enable an interface for composing coplex queries, like this: +Its goal is to enable an interface for composing complex queries, like this: ```python index = MyDocumentIndex[MyDoc]() @@ -430,12 +454,13 @@ could eagerly build intermediate queries at every call. No matter what you do, you should stick to one design principle: **Every call to `find`, `filter`, `text_search` etc. should return a new instance of the Query Builder**, with updated state. -**If your backend does not support all operations:** -Most backends do not support compositions of all query operations, which is completely fine. -If that is the case, you should handle that in the following way: -- If an operation **is** supported by the Document Index that you are implementing, but **is not** supported by the Query Builder, you should use the pre-defined `_raise_not_composable()` helper method to raise a `NotImplementedError`. -- If an operation **is not** supported by the Document Index that you are implementing, and **is not** supported by the Query Builder, you should use the pre-defined `_raise_not_supported()` helper method to raise a `NotImplementedError`. -- If an operation **is** supported by the Document Index that you are implementing, and **is** supported by the Query Builder, but **is not** supported in combination with a certain other operation, you should raise a `RuntimeError`. Depending on how your Query Builder is set up, you might want to do that either eagerly during the conflicting method call, or lazily inside of `.build()`. +!!! note "If your backend does not support all operations" + Most backends do not support compositions of all query operations, which is completely fine. + If that is the case, you should handle that in the following way: + + - If an operation **is** supported by the Document Index that you are implementing, but **is not** supported by the Query Builder, you should use the pre-defined `_raise_not_composable()` helper method to raise a `NotImplementedError`. + - If an operation **is not** supported by the Document Index that you are implementing, and **is not** supported by the Query Builder, you should use the pre-defined `_raise_not_supported()` helper method to raise a `NotImplementedError`. + - If an operation **is** supported by the Document Index that you are implementing, and **is** supported by the Query Builder, but **is not** supported in combination with a certain other operation, you should raise a `RuntimeError`. Depending on how your Query Builder is set up, you might want to do that either eagerly during the conflicting method call, or lazily inside of `.build()`. ### Implement the `build()` method diff --git a/docs/how_to/multimodal_training_and_serving.md b/docs/how_to/multimodal_training_and_serving.md index 805d17f9680..0886eb2b572 100644 --- a/docs/how_to/multimodal_training_and_serving.md +++ b/docs/how_to/multimodal_training_and_serving.md @@ -14,17 +14,17 @@ jupyter: # Multimodal deep learning with DocArray -DocArray is a library for representing, sending, and storing multi-modal data that can be used for a variety of different +DocArray is a library for representing, sending, and storing multimodal data that can be used for a variety of different use cases. -Here we will focus on a workflow familiar to many ML Engineers: Building and training a model, and then serving it to +Here we will focus on a workflow familiar to many ML engineers: Building and training a model, and then serving it to users. -This notebook contains two parts: +This document contains two parts: -1. **Representing**: We will use DocArray to represent multi-modal data while **building and training a PyTorch model**. -We will see how DocArray can help to organize and group your modalities and tensors and make clear what methods expect as inputs and return as outputs. -2. **Sending**: We will take the model that we built and trained in part 1, and **serve it using FastAPI**. +1. **Representing**: We will use DocArray to represent multimodal data while **building and training a PyTorch model**. +We will see how DocArray can help to organize and group your modalities and tensors and make clear what methods to expect as inputs and return as outputs. +2. **Sending**: We will take the model that we built and trained in part one, and **serve it using FastAPI**. We will see how DocArray narrows the gap between model development and model deployment, and how the same data models can be reused in both contexts. That part will be very short, but that's the point! @@ -32,15 +32,16 @@ So without further ado, let's dive into it! ## 1. Representing: Build and train a PyTorch model -We will train a [CLIP](https://arxiv.org/abs/2103.00020)-like model on a dataset composes of text-image-pairs. -The goal is to obtain a model that is able to understand both text and images and project them into a common embedding space. +We will train a [CLIP](https://arxiv.org/abs/2103.00020)-like model on a dataset composed of text-image pairs. +The goal is to obtain a model that can understand both text and images and project them into a common embedding space. -We train the CLIP-like model on the [flickr8k](https://www.kaggle.com/datasets/adityajn105/flickr8k) dataset. -To run this notebook you need to download and unzip the data into the same folder as the notebook. +We train the CLIP-like model on the [Flickr8k](https://www.kaggle.com/datasets/adityajn105/flickr8k) dataset. +To run this, you need to download and unzip the data into the same folder as your code. -Note that in this notebook by no means we aim at reproduce any CLIP results (our dataset is way too small anyways), -but we rather want to show how DocArray datastructures help researchers and practitioners to write beautiful and -pythonic multi-modal PyTorch code. +!!! note + In this tutorial we do not aim to reproduce any CLIP results (our dataset is way too small anyway), + but rather we want to show how DocArray data structures help researchers and practitioners write beautiful and + pythonic multimodal PyTorch code. ```bash #!pip install "docarray[torch,image]" @@ -71,23 +72,23 @@ DEVICE = "cuda:0" # change to your favourite device ``` -## Create the Documents for handling the Muti-Modal data +### Create documents for handling multimodal data -The first thing we are trying to achieve when using DocArray is to clearly model our data so that we never get confused -about which tensors are supposed to represent what. +The first thing we want to achieve when using DocArray is to clearly model our data so that we never get confused +about which tensors represent what. -To do that we are using a concept that is at the core of DocArray. The `Document`, a collection of multi-modal data. -The `BaseDoc` class allows users to define their own (nested, multi-modal) Document schema to represent any kind of complex data. +To do that we are using a concept that is at the core of DocArray: The document -- a collection of multimodal data. +The `BaseDoc` class allows users to define their own (nested, multimodal) document schema to represent any kind of complex data. -Let's start by defining a few Documents to handle the different modalities that we will use during our training: +Let's start by defining a few documents to handle the different modalities that we will use during our training: ```python from docarray import BaseDoc, DocList from docarray.typing import TorchTensor, ImageUrl ``` -Let's first create a Document for our Text modality. It will contain a number of `Tokens`, which we also define: +Let's first create a document for our Text modality. It will contain a number of `Tokens`, which we also define: ```python from docarray.documents import TextDoc as BaseText @@ -102,11 +103,12 @@ class Tokens(BaseDoc): class Text(BaseText): tokens: Optional[Tokens] ``` -Notice the [`TorchTensor`][docarray.typing.TorchTensor] type. It is a thin wrapper around `torch.Tensor` that can be use like any other torch tensor, + +Notice the [`TorchTensor`][docarray.typing.TorchTensor] type. It is a thin wrapper around `torch.Tensor` that can be used like any other Torch tensor, but also enables additional features. One such feature is shape parametrization (`TorchTensor[48]`), which lets you hint and even enforce the desired shape of any tensor! -To represent our image data, we use the [`ImageDoc`][docarray.documents.ImageDoc] that is included in DocArray: +To represent our image data, we use DocArray's [`ImageDoc`][docarray.documents.ImageDoc]: ```python from docarray.documents import ImageDoc @@ -123,9 +125,9 @@ class ImageDoc(BaseDoc): ``` Actually, the `BaseText` above also already includes `tensor`, `url` and `embedding` fields, so we can use those on our -`Text` Document as well. +`Text` document as well. -The final Document used for training here is the `PairTextImage`, which simply combines the Text and Image modalities: +The final document used for training here is the `PairTextImage`, which simply combines the Text and Image modalities: ```python class PairTextImage(BaseDoc): @@ -133,10 +135,9 @@ class PairTextImage(BaseDoc): image: ImageDoc ``` -## Create the Dataset - +### Create the dataset -In this section we will create a multi-modal pytorch dataset around the Flick8k dataset using DocArray. +In this section we will create a multimodal pytorch dataset around the Flick8k dataset using DocArray. We will use DocArray's data loading functionality to load the data and use Torchvision and Transformers to preprocess the data before feeding it to our deep learning model: @@ -191,7 +192,7 @@ def get_flickr8k_da(file: str = "captions.txt", N: Optional[int] = None): return da ``` -In the `get_flickr8k_da` method we process the Flickr8k dataset into a `DocList`. +In the `get_flickr8k_da` method we process the [Flickr8k](https://www.kaggle.com/datasets/adityajn105/flickr8k) dataset into a `DocList`. Now let's instantiate this dataset using the [`MultiModalDataset`][docarray.data.MultiModalDataset] class. The constructor takes in the `da` and a dictionary of preprocessing transformations: @@ -214,10 +215,9 @@ loader = DataLoader( ) ``` -## Create the Pytorch model that works on DocArray - +### Create the Pytorch model that works on DocArray -In this section we create two encoders, one per modality (Text and Image). These encoders are normal PyTorch `nn.Module`s. +In this section we will create two encoders, one per modality (Text and Image). These encoders are normal PyTorch `nn.Module`s. The only difference is that they operate on `DocList` rather that on torch.Tensor: ```python @@ -243,7 +243,6 @@ class TextEncoder(nn.Module): The `TextEncoder` takes a `DocList` of `TextDoc`s as input, and returns an embedding `TorchTensor` as output. `DocList` can be seen as a list of `TextDoc` documents, and the encoder will treat it as one batch. - ```python class VisionEncoder(nn.Module): def __init__(self): @@ -257,7 +256,7 @@ class VisionEncoder(nn.Module): ``` Similarly, the `VisionEncoder` also takes a `DocList` of `ImageDoc`s as input, and returns an embedding `TorchTensor` as output. -However, it operates on the `tensor` attribute of each Document. +However, it operates on the `tensor` attribute of each document. Now we can instantiate our encoders: @@ -266,10 +265,9 @@ vision_encoder = VisionEncoder().to(DEVICE) text_encoder = TextEncoder().to(DEVICE) ``` -As you can see, DocArray helps us to clearly convey what data is expected as input and output for each method, all through Python type hints. - -## Train the model in a contrastive way between Text and Image (CLIP) +As you can see, DocArray helps us clearly convey what data is expected as input and output for each method, all through Python type hints. +### Train the model in a contrastive way between Text and Image (CLIP) Now that we have defined our dataloader and our models, we can train the two encoders is a contrastive way. The goal is to match the representation of the text and the image for each pair in the dataset. @@ -301,7 +299,7 @@ In the type hints of `cosine_sim` and `clip_loss` you can again notice that we c num_epoch = 1 # here you should do more epochs to really learn something ``` -One things to notice here is that our dataloader does not return a `torch.Tensor` but a `DocList[PairTextImage]`, +One thing to notice here is that our dataloader does not return a `torch.Tensor` but a `DocList[PairTextImage]`, which is exactly what our model can operate on. So let's write a training loop and train our encoders: @@ -327,20 +325,19 @@ with torch.autocast(device_type="cuda", dtype=torch.float16): optim.step() ``` -Here we can see how we can immediately group the output of each encoder with the Document (and modality) it belong to. +Here we see how we can immediately group the output of each encoder with the document (and modality) it belong to. -And with all that, we've successfully trained a CLIP-like model without ever being confused the meaning of any tensors! +And with all that, we've successfully trained a CLIP-like model without ever getting confused about the meaning of any tensors! ## 2. Sending: Serve the model using FastAPI Now that we have a trained CLIP model, let's see how we can serve this model with a REST API by reusing most of the code above. -Let's use our beloved [FastAPI](https://fastapi.tiangolo.com/) for that! - +Let's use [FastAPI](https://fastapi.tiangolo.com/) for that! FastAPI is powerful because it allows you to define your Rest API data schema in pure Python. -And DocArray is fully compatible with FastAPI and Pydantic, which means that as long as you have a function that takes a Document as input, -FastAPI will be able to automatically translate it into a fully fledged API with documentation, openAPI specification and more: +And DocArray is fully compatible with FastAPI and Pydantic, which means that as long as you have a function that takes a document as input, +FastAPI will be able to automatically translate it into a fully fledged API with documentation, OpenAPI specification and more: ```python from fastapi import FastAPI @@ -374,7 +371,7 @@ async def embed_text(doc: Text) -> Text: return doc ``` -You can see that our earlier definition of the `Text` Document now doubles as the API schema for the `/embed_text` endpoint. +You can see that our earlier definition of the `Text` document now doubles as the API schema for the `/embed_text` endpoint. With this running, we can query our model over the network: @@ -402,4 +399,4 @@ doc_resp = Text.parse_raw(response.content.decode()) doc_resp.embedding.shape ``` -And we're done! You have trained and served a mulit-modal ML model, with zero headache and a lot of DocArray! +And we're done! You have trained and served a multimodal ML model, with zero headaches and a lot of DocArray! diff --git a/docs/how_to/optimize_performance_with_id_generation.md b/docs/how_to/optimize_performance_with_id_generation.md index 5d0df78e776..7893f3c7950 100644 --- a/docs/how_to/optimize_performance_with_id_generation.md +++ b/docs/how_to/optimize_performance_with_id_generation.md @@ -1,9 +1,9 @@ # Optimize performance -### `BaseDoc`'s id +### `BaseDoc`'s `id` DocArray's `BaseDoc` has an optional `id` field, which defaults to `ID(os.urandom(16).hex())`. This takes quite some time. -If you don't rely on the id anywhere, you can instead set the default to None. This increases the performance by a factor of approximately 1.4. +If you don't rely on the `id` anywhere, you can instead set the default to `None`. This increases the performance by a factor of approximately 1.4: ```python from docarray import BaseDoc @@ -15,7 +15,7 @@ class MyDoc(BaseDoc): title: str ``` -Since the `BaseDoc.id` is optional, you could also set the value to None, but this turns out to be a bit less efficient than the option above, and increases the performance by a factor of approximately 1.2. +Since `BaseDoc.id` is optional, you could also set the value to `None`, but this turns out to be a bit less efficient than the option above, and increases performance by a factor of approximately 1.2: ```python class MyDoc2(BaseDoc): diff --git a/docs/user_guide/sending/api/fastAPI.md b/docs/user_guide/sending/api/fastAPI.md index d35308fefce..039af9b5aca 100644 --- a/docs/user_guide/sending/api/fastAPI.md +++ b/docs/user_guide/sending/api/fastAPI.md @@ -10,8 +10,9 @@ and provide a seamless and efficient way to work with multimodal data in FastAPI pip install fastapi ``` +## Define schemas -First, you should define schemas for your input and/or output Documents: +First, you should define schemas for your input and/or output documents: ```python from docarray import BaseDoc from docarray.documents import ImageDoc @@ -27,7 +28,10 @@ class OutputDoc(BaseDoc): embedding_bert: NdArray ``` -Afterwards, you can use your Documents with FastAPI: +## Use documents with FastAPI + +After creating your schemas, you can use your documents with FastAPI: + ```python import numpy as np from fastapi import FastAPI @@ -56,7 +60,7 @@ async with AsyncClient(app=app, base_url="http://test") as ac: doc = OutputDoc.parse_raw(response.content.decode()) ``` -The big advantage here is **first-class support for ML centric data**, such as {Torch, TF, ...}Tensor, Embedding, etc. +The big advantage here is **first-class support for ML centric data**, such as `TorchTensor`, `TensorFlowTensor`, `Embedding`, etc. This includes handy features such as validating the shape of a tensor: @@ -92,11 +96,11 @@ Image( ``` -Further, you can send and receive lists of Documents represented as a `DocArray` object: +Further, you can send and receive lists of documents represented as a `DocList` object: !!! note - Currently, `FastAPI` receives `DocArray` objects as lists, so you have to construct a DocArray inside the function. - Also, if you want to return a `DocArray` object, first you have to convert it to a list. + Currently, `FastAPI` receives `DocList` objects as lists, so you have to construct a DocList inside the function. + Also, if you want to return a `DocList` object, first you have to convert it to a list. (Shown in the example below) ```python @@ -106,12 +110,12 @@ import numpy as np from fastapi import FastAPI from httpx import AsyncClient -from docarray import DocArray +from docarray import DocList from docarray.base_doc import DocArrayResponse from docarray.documents import TextDoc # Create a docarray -docs = DocArray[TextDoc]([TextDoc(text='first'), TextDoc(text='second')]) +docs = DocList[TextDoc]([TextDoc(text='first'), TextDoc(text='second')]) app = FastAPI() @@ -120,14 +124,14 @@ app = FastAPI() @app.post("/doc/", response_class=DocArrayResponse) async def create_embeddings(docs: List[TextDoc]) -> List[TextDoc]: # The docs FastAPI will receive will be treated as List[TextDoc] - # so you need to cast it to DocArray - docs = DocArray[TextDoc].construct(docs) + # so you need to cast it to DocList + docs = DocList[TextDoc].construct(docs) # Embed docs for doc in docs: doc.embedding = np.zeros((3, 224, 224)) - # Return your DocArray as a list + # Return your DocList as a list return list(docs) @@ -136,5 +140,5 @@ async with AsyncClient(app=app, base_url="http://test") as ac: assert response.status_code == 200 # You can read FastAPI's response in the following way -docs = DocArray[TextDoc].from_json(response.content.decode()) +docs = DocList[TextDoc].from_json(response.content.decode()) ``` diff --git a/docs/user_guide/sending/api/jina.md b/docs/user_guide/sending/api/jina.md index cbdf50acd2a..b2de99d12c3 100644 --- a/docs/user_guide/sending/api/jina.md +++ b/docs/user_guide/sending/api/jina.md @@ -1,17 +1,15 @@ # Jina -# Create an audio to text app with Jina and DocArray V2 - -This is how you can build an Audio to Text app using Jina, DocArray and Whisper. +In this example we'll build an audio-to-text app using [Jina](https://docs.jina.ai/), DocArray and [Whisper](https://openai.com/research/whisper). We will use: -* DocArray V2: Helps us to load and preprocess multimodal data such as image, text and audio in our case -* Jina: Helps us serve the model quickly and create a client +* DocArray V2: To load and preprocess multimodal data such as image, text and audio. +* Jina: To serve the model quickly and create a client. -First let's install requirements +## Install packages -## 💾 Installation +First let's install requirements: ```bash pip install transformers @@ -19,8 +17,9 @@ pip install openai-whisper pip install jina ``` -Now let's import necessary libraries +## Import libraries +Let's import the necessary libraries: ```python import whisper @@ -29,23 +28,27 @@ from docarray import BaseDoc, DocList from docarray.typing import AudioUrl ``` -Now we need to create the schema of our input and output documents. Since our input is an audio -our input schema should contain an AudioUrl like the following +## Create schemeas + +Now we need to create the schema of our input and output documents. Since our input is an audio URL, +our input schema should contain an `AudioUrl`: ```python class AudioURL(BaseDoc): audio: AudioUrl ``` -As for the output schema we would like to receive the transcribed text so we use the following: +For the output schema we would like to receive the transcribed text: ```python class Response(BaseDoc): text: str ``` -Now it's time we create our model, we wrap our model into Jina Executor, this allows us to serve to model -later on and expose its endpoint /transcribe +## Create Executor + +To create our model, we wrap our model into a Jina [Executor](https://docs.jina.ai/concepts/serving/executor/), allowing us to serve the model +later and expose the endpoint `/transcribe`: ```python class WhisperExecutor(Executor): @@ -59,23 +62,33 @@ class WhisperExecutor(Executor): for doc in docs: transcribed_text = self.model.transcribe(str(doc.audio))['text'] response_docs.append(Response(text=transcribed_text)) + return response_docs ``` -Now we can leverage Deployment object provided by Jina to use this executor -then we send a request to transcribe endpoint. Here we are using an audio file previously recorded -that says, "A Man reading a book" saved under resources/audio.mp3 but feel free to use your own audio. +## Deploy Executor and get results + +Now we can leverage Jina's [Deployment object](https://docs.jina.ai/concepts/orchestration/deployment/) to deploy this Executor, then send a request to the `/transcribe` endpoint. + +Here we are using an audio file that says, "A man reading a book", saved as `resources/audio.mp3`: ```python -with Deployment( +dep = Deployment( uses=WhisperExecutor, uses_with={'device': "cpu"}, port=12349, timeout_ready=-1 -) as d: +) + +with dep: docs = d.post( on='/transcribe', inputs=[AudioURL(audio='resources/audio.mp3')], return_type=DocList[Response], ) - print(docs[0].text) + +print(docs[0].text) ``` -And we get the transcribed result! +And we get the transcribed result: + +```text +A man reading a book +``` diff --git a/docs/user_guide/sending/first_step.md b/docs/user_guide/sending/first_step.md index 6e2d2608943..c296e45d577 100644 --- a/docs/user_guide/sending/first_step.md +++ b/docs/user_guide/sending/first_step.md @@ -1,12 +1,9 @@ -# Intro +# Introduction In the representation section we saw how to use [`BaseDoc`][docarray.base_doc.doc.BaseDoc], [`DocList`][docarray.array.doc_list.doc_list.DocList] and [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] -to represent multi-modal data. In this section we will see **how to send these data over the wire**. +to represent multi-modal data. In this section we will see **how to send such data over the wire**. +This section is divided into two parts: -This section is divided into two: - -- [Serialization](./ser/send_doc.md) of [`BaseDoc`][docarray.base_doc.doc.BaseDoc], [`DocList`][docarray.array.doc_list.doc_list.DocList] and [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] +- [Serializing](./ser/send_doc.md) [`BaseDoc`][docarray.base_doc.doc.BaseDoc], [`DocList`][docarray.array.doc_list.doc_list.DocList] and [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] - [Using DocArray with a web framework to build a multimodal API](./api/jina.md) - - diff --git a/docs/user_guide/sending/ser/send_doc.md b/docs/user_guide/sending/ser/send_doc.md index dd77557dbba..b239d9c3399 100644 --- a/docs/user_guide/sending/ser/send_doc.md +++ b/docs/user_guide/sending/ser/send_doc.md @@ -5,10 +5,10 @@ You need to serialize a [BaseDoc][docarray.base_doc.doc.BaseDoc] before you can !!! note [BaseDoc][docarray.base_doc.doc.BaseDoc] supports serialization to `protobuf` and `json` formats. -## Serialization to protobuf +## JSON -You can use [`to_protobuf`][docarray.base_doc.mixins.io.IOMixin.to_protobuf] to serialize a [BaseDoc][docarray.base_doc.doc.BaseDoc] to a protobuf message object -and use [`from_protobuf`][docarray.base_doc.mixins.io.IOMixin.from_protobuf] to deserialize it. +- [`json`][docarray.base_doc.doc.BaseDoc.json] serializes a [`BaseDoc`][docarray.base_doc.doc.BaseDoc] to a JSON string. +- [`parse_raw`][docarray.base_doc.doc.BaseDoc.parse_raw] deserializes a [`BaseDoc`][docarray.base_doc.doc.BaseDoc] from a JSON string. ```python from typing import List @@ -21,15 +21,15 @@ class MyDoc(BaseDoc): doc = MyDoc(text='hello world', tags=['hello', 'world']) -proto_message = doc.to_protobuf() -new_doc = MyDoc.from_protobuf(proto_message) +json_str = doc.json() +new_doc = MyDoc.parse_raw(json_str) assert doc == new_doc # True ``` -## Serialization to JSON +## protobuf -You can use [`json`][docarray.base_doc.doc.BaseDoc.json] to serialize a [BaseDoc][docarray.base_doc.doc.BaseDoc] to a json string -and use [`parse_raw`][docarray.base_doc.doc.BaseDoc.parse_raw] to deserialize it. +- [`to_protobuf`][docarray.base_doc.mixins.io.IOMixin.to_protobuf] serializes a [`BaseDoc`][docarray.base_doc.doc.BaseDoc] to a `protobuf` message object. +- [`from_protobuf`][docarray.base_doc.mixins.io.IOMixin.from_protobuf] deserializes a [`BaseDoc`][docarray.base_doc.doc.BaseDoc] from a `protobuf` object. ```python from typing import List @@ -42,14 +42,13 @@ class MyDoc(BaseDoc): doc = MyDoc(text='hello world', tags=['hello', 'world']) -json_str = doc.json() -new_doc = MyDoc.parse_raw(json_str) +proto_message = doc.to_protobuf() +new_doc = MyDoc.from_protobuf(proto_message) assert doc == new_doc # True ``` See also: -* The serializing [DocList](./send_doclist.md) section -* The serializing [DocVec](./send_docvec.md) section - +* The serializing [`DocList`](./send_doclist.md) section +* The serializing [`DocVec`](./send_docvec.md) section diff --git a/docs/user_guide/sending/ser/send_doclist.md b/docs/user_guide/sending/ser/send_doclist.md index 70b1789ca5f..31d8ec919f2 100644 --- a/docs/user_guide/sending/ser/send_doclist.md +++ b/docs/user_guide/sending/ser/send_doclist.md @@ -1,8 +1,11 @@ # DocList -When sending or storing [`DocList`][docarray.array.doc_list.doc_list.DocList], you need to use serialization. [DocList][docarray.array.doc_list.doc_list.DocList] supports multiple ways to serialize the data. + +When sending or storing [`DocList`][docarray.array.doc_list.doc_list.DocList], you need to use serialization. [`DocList`][docarray.array.doc_list.doc_list.DocList] supports multiple ways to serialize the data. ## JSON -You can use [`to_json()`][docarray.array.doc_list.io.IOMixinArray.to_json] and [`from_json()`][docarray.array.doc_list.io.IOMixinArray.from_json] to serialize and deserialize a [DocList][docarray.array.doc_list.doc_list.DocList]: + +- [`to_json()`][docarray.array.doc_list.io.IOMixinArray.to_json] serializes a [`DocList`][docarray.array.doc_list.doc_list.DocList] to JSON. It returns the binary representation of the JSON object. +- [`from_json()`][docarray.array.doc_list.io.IOMixinArray.from_json] deserializes a [`DocList`][docarray.array.doc_list.doc_list.DocList] from JSON. It can load from either a `str` or `binary` representation of the JSON object. ```python from docarray import BaseDoc, DocList @@ -24,14 +27,14 @@ with open('simple-dl.json', 'r') as f: print(dl_load_from_json) ``` -[to_json()][docarray.array.doc_list.io.IOMixinArray.to_json] returns the binary representation of the json object. [from_json()][docarray.array.doc_list.io.IOMixinArray.from_json] can load from either `str` or `binary` representation of the json object. - ```output b'[{"id":"5540e72d407ae81abb2390e9249ed066","text":"doc 0"},{"id":"fbe9f80d2fa03571e899a2887af1ac1b","text":"doc 1"}]' ``` -## Protobuf -To serialize a DocList with `protobuf`, you can use [`to_protobuf()`][docarray.array.doc_list.io.IOMixinArray.to_protobuf] and [`from_protobuf()`][docarray.array.doc_list.io.IOMixinArray.from_protobuf] to serialize and deserialize a [DocList][docarray.array.doc_list.doc_list.DocList]: +## protobuf + +- [`to_protobuf()`][docarray.array.doc_list.io.IOMixinArray.to_protobuf] serializes a [`DocList`][docarray.array.doc_list.doc_list.DocList] to `protobuf`. It returns a `protobuf` object of `docarray_pb2.DocListProto` class. +- [`from_protobuf()`][docarray.array.doc_list.io.IOMixinArray.from_protobuf] deserializes a [`DocList`][docarray.array.doc_list.doc_list.DocList] from `protobuf`. It accepts a `protobuf` message object to construct a [`DocList`][docarray.array.doc_list.doc_list.DocList]. ```python from docarray import BaseDoc, DocList @@ -49,16 +52,15 @@ print(type(proto_message_dl)) print(dl_from_proto) ``` -[to_protobuf()][docarray.array.doc_list.io.IOMixinArray.to_protobuf] returns a protobuf object of `docarray_pb2.DocListProto` class. [from_protobuf()][docarray.array.doc_list.io.IOMixinArray.from_protobuf] accepts a protobuf message object to construct a [DocList][docarray.array.doc_list.doc_list.DocList]. - ## Base64 -When transferring over the network, you can choose `Base64` format to serialize the [`DocList`][docarray.array.doc_list.doc_list.DocList]. -Serializing a [DocList][docarray.array.doc_list.doc_list.DocList] in Base64 supports both `pickle` and `protobuf` protocols. Besides, you can choose different compression methods. -To serialize a [DocList][docarray.array.doc_list.doc_list.DocList] in Base64, you can use [`to_base64()`][docarray.array.doc_list.io.IOMixinArray.to_base64] and [`from_base64()`][docarray.array.doc_list.io.IOMixinArray.from_protobuf] to serialize and deserialize a [DocList][docarray.array.doc_list.doc_list.DocList]: +When transferring data over the network, use `Base64` format to serialize the [`DocList`][docarray.array.doc_list.doc_list.DocList]. +Serializing a [`DocList`][docarray.array.doc_list.doc_list.DocList] in Base64 supports both the `pickle` and `protobuf` protocols. You can also choose different compression methods. -We support multiple compression methods. (namely : `lz4`, `bz2`, `lzma`, `zlib`, `gzip`) +- [`to_base64()`][docarray.array.doc_list.io.IOMixinArray.to_base64] serializes a [`DocList`][docarray.array.doc_list.doc_list.DocList] to Base64 +- [`from_base64()`][docarray.array.doc_list.io.IOMixinArray.from_base64] deserializes a [`DocList`][docarray.array.doc_list.doc_list.DocList] from Base64: +You can multiple compression methods: `lz4`, `bz2`, `lzma`, `zlib`, and `gzip`. ```python from docarray import BaseDoc, DocList @@ -78,9 +80,11 @@ dl_from_base64 = DocList[SimpleDoc].from_base64( ``` ## Binary -Similar to `Base64` serialization, `Binary` serialization also supports different protocols and compression methods. -To save a [DocList][docarray.array.doc_list.doc_list.DocList] into a binary file, you can use [`save_binary()`][docarray.array.doc_list.io.IOMixinArray.to_base64] and [`load_binary()`][docarray.array.doc_list.io.IOMixinArray.from_protobuf] to serialize and deserialize a [DocList][docarray.array.doc_list.doc_list.DocList]: +- [`save_binary()`][docarray.array.doc_list.io.IOMixinArray.save_binary] saves a [`DocList`][docarray.array.doc_list.doc_list.DocList] to a binary file. +- [`load_binary()`][docarray.array.doc_list.io.IOMixinArray.load_binary] loads a [`DocList`][docarray.array.doc_list.doc_list.DocList] from a binary file. + +You can multiple compression methods: `lz4`, `bz2`, `lzma`, `zlib`, and `gzip`. ```python from docarray import BaseDoc, DocList @@ -99,10 +103,20 @@ dl_from_binary = DocList[SimpleDoc].load_binary( ) ``` -The [DocList][docarray.array.doc_list.doc_list.DocList] is stored at `simple-dl.pickle` file. +In the above snippet, the [`DocList`][docarray.array.doc_list.doc_list.DocList] is stored as the file `simple-dl.pickle`. ### Bytes -Under the hood, [save_binary()][docarray.array.doc_list.io.IOMixinArray.to_base64] prepares the file object and calls [to_bytes()][docarray.array.doc_list.io.IOMixinArray.to_bytes] function to convert the [DocList][docarray.array.doc_list.doc_list.DocList] into a byte object. You can use [to_bytes()][docarray.array.doc_list.io.IOMixinArray.to_bytes] function directly and use [from_bytes()][docarray.array.doc_list.io.IOMixinArray.from_bytes] to load the [DocList][docarray.array.doc_list.doc_list.DocList] from a byte object. You can use `protocol` to choose between `pickle` and `protobuf`. Besides, [to_bytes()][docarray.array.doc_list.io.IOMixinArray.to_bytes] and [save_binary()][docarray.array.doc_list.io.IOMixinArray.save_binary] support multiple options for `compress` as well. + +- [to_bytes()][docarray.array.doc_list.io.IOMixinArray.to_bytes] saves a [`DocList`][docarray.array.doc_list.doc_list.DocList] to a byte object. +- [from_bytes()][docarray.array.doc_list.io.IOMixinArray.from_bytes] loads a [`DocList`][docarray.array.doc_list.doc_list.DocList] from a byte object. + +!!! note + These methods are used under the hood by [save_binary()][docarray.array.doc_list.io.IOMixinArray.to_base64] and [`load_binary()`][docarray.array.doc_list.io.IOMixinArray.load_binary] to prepare/load/save to a binary file. You can also use them directly to work with byte files. + +Like working with binary files: + +- You can use `protocol` to choose between `pickle` and `protobuf`. +- You can use multiple compression methods: `lz4`, `bz2`, `lzma`, `zlib`, and `gzip`. ```python from docarray import BaseDoc, DocList @@ -121,9 +135,12 @@ dl_from_bytes = DocList[SimpleDoc].from_bytes( ) ``` - ## CSV -You can use [`from_csv()`][docarray.array.doc_list.io.IOMixinArray.from_csv] and [`to_csv()`][docarray.array.doc_list.io.IOMixinArray.to_csv] to de-/serializae and deserialize the [DocList][docarray.array.doc_list.doc_list.DocList] from/to a CSV file. Use the `dialect` parameter to choose the dialect of the CSV format: + +- [`to_csv()`][docarray.array.doc_list.io.IOMixinArray.to_csv] serializes a [`DocList`][docarray.array.doc_list.doc_list.DocList] to a CSV file. +- [`from_csv()`][docarray.array.doc_list.io.IOMixinArray.from_csv] deserializes a [`DocList`][docarray.array.doc_list.doc_list.DocList] from a CSV file. + +Use the `dialect` parameter to choose the [dialect of the CSV format](https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters): ```python from docarray import BaseDoc, DocList @@ -140,9 +157,10 @@ dl_from_csv = DocList[SimpleDoc].from_csv('simple-dl.csv') print(dl_from_csv) ``` - ## Pandas.Dataframe -You can use [`from_dataframe()`][docarray.array.doc_list.io.IOMixinArray.from_dataframe] and [`to_dataframe()`][docarray.array.doc_list.io.IOMixinArray.to_dataframe] to load/save the [DocList][docarray.array.doc_list.doc_list.DocList] from/to a pandas DataFrame: + +- [`from_dataframe()`][docarray.array.doc_list.io.IOMixinArray.from_dataframe] loads a [`DocList`][docarray.array.doc_list.doc_list.DocList] from a [Pandas Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). +- [`to_dataframe()`][docarray.array.doc_list.io.IOMixinArray.to_dataframe] saves a [`DocList`][docarray.array.doc_list.doc_list.DocList] to a [Pandas Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). ```python from docarray import BaseDoc, DocList @@ -161,5 +179,5 @@ print(dl_from_dataframe) See also: -* The serializing [BaseDoc](./send_doc.md) section -* The serializing [DocVec](./send_docvec.md) section +* The serializing [`BaseDoc`](./send_doc.md) section +* The serializing [`DocVec`](./send_docvec.md) section diff --git a/docs/user_guide/sending/ser/send_docvec.md b/docs/user_guide/sending/ser/send_docvec.md index 3fbaf759075..f313b9a7f1c 100644 --- a/docs/user_guide/sending/ser/send_docvec.md +++ b/docs/user_guide/sending/ser/send_docvec.md @@ -1,7 +1,14 @@ # DocVec -When sending or storing [`DocVec`][docarray.array.doc_list.doc_list.DocVec], you need to use serialization. [DocVec][docarray.array.doc_list.doc_list.DocVec] only supports protobuf to serialize the data. -You can use [`to_protobuf`][docarray.array.doc_list.doc_list.DocVec.to_protobuf] and [`from_protobuf`][docarray.array.doc_list.doc_list.DocVec.from_protobuf] to serialize and deserialize a [DocVec][docarray.array.doc_list.doc_list.DocVec] +When sending or storing [`DocVec`][docarray.array.doc_list.doc_list.DocVec], you need to use protobuf serialization. + +!!! note + We plan to add more serialization formats in the future, notably JSON. + +## protobuf + +- [`to_protobuf`][docarray.array.doc_list.doc_list.DocVec.to_protobuf] serializes a [DocVec][docarray.array.doc_list.doc_list.DocVec] to `protobuf`. It returns a `protobuf` object of `docarray_pb2.DocVecProto` class. +- [`from_protobuf`][docarray.array.doc_list.doc_list.DocVec.from_protobuf] deserializes a [DocVec][docarray.array.doc_list.doc_list.DocVec] from `protobuf`. It accepts a protobuf message object to construct a [DocVec][docarray.array.doc_list.doc_list.DocVec]. ```python import numpy as np @@ -21,10 +28,7 @@ proto_message_dv = dv.to_protobuf() dv_from_proto = DocVec[SimpleVecDoc].from_protobuf(proto_message_dv) ``` -!!! note - We are planning to add more serialization formats in the future, notably JSON. - -[`to_protobuf`][docarray.array.doc_list.doc_list.DocVec.to_protobuf] returns a protobuf object of `docarray_pb2.DocVecProto` class. [`from_protobuf`][docarray.array.doc_list.doc_list.DocVec.from_protobuf] accepts a protobuf message object to construct a [DocVec][docarray.array.doc_list.doc_list.DocVec]. +## See also -* The serializing [BaseDoc](./send_doc.md) section -* The serializing [DocList](./send_doclist.md) section +* The serializing [`BaseDoc`](./send_doc.md) section +* The serializing [`DocList`](./send_doclist.md) section From e8f4b1dcda32b1c4ae0f5a36afb6bfd2dd48affe Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Mon, 17 Apr 2023 17:04:55 +0200 Subject: [PATCH 04/10] docs: menu item names Signed-off-by: Alex C-G --- mkdocs.yml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mkdocs.yml b/mkdocs.yml index 100a4da336a..0edb5ad7aea 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -79,12 +79,12 @@ plugins: nav: - Home: README.md - - Tutorial/User Guide: + - User Guide: - user_guide/intro.md - Representing data: - user_guide/representing/first_step.md - user_guide/representing/array.md - - Sending: + - Sending data: - user_guide/sending/first_step.md - Serialization: - user_guide/sending/ser/send_doc.md @@ -93,7 +93,7 @@ nav: - Building API: - user_guide/sending/api/jina.md - user_guide/sending/api/fastAPI.md - - Storing: + - Storing data: - user_guide/storing/first_step.md - DocStore: - user_guide/storing/doc_store/store_file.md From aeea27a9d36eaee3d7016e178fc0236bb350945b Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Mon, 17 Apr 2023 17:13:11 +0200 Subject: [PATCH 05/10] docs: rename api_references to better name Signed-off-by: Alex C-G --- docs/{api_references => API_reference}/.pages | 0 docs/{api_references => API_reference}/array/any_da.md | 0 docs/{api_references => API_reference}/array/da.md | 0 docs/{api_references => API_reference}/array/da_stack.md | 0 docs/{api_references => API_reference}/base_doc/base_doc.md | 0 docs/{api_references => API_reference}/data/data.md | 0 .../doc_index/backends/elastic.md | 0 .../doc_index/backends/elastic7.md | 0 .../doc_index/backends/hnswlib.md | 0 .../doc_index/backends/qdrant.md | 0 .../doc_index/backends/weaviate.md | 0 docs/{api_references => API_reference}/doc_index/doc_index.md | 0 docs/{api_references => API_reference}/doc_store/doc_store.md | 0 .../{api_references => API_reference}/doc_store/file_doc_store.md | 0 docs/{api_references => API_reference}/doc_store/jac_doc_store.md | 0 docs/{api_references => API_reference}/doc_store/s3_doc_store.md | 0 docs/{api_references => API_reference}/documents/documents.md | 0 docs/{api_references => API_reference}/typing/bytes.md | 0 docs/{api_references => API_reference}/typing/id.md | 0 docs/{api_references => API_reference}/typing/tensor/audio.md | 0 docs/{api_references => API_reference}/typing/tensor/embedding.md | 0 docs/{api_references => API_reference}/typing/tensor/image.md | 0 docs/{api_references => API_reference}/typing/tensor/tensor.md | 0 docs/{api_references => API_reference}/typing/tensor/video.md | 0 docs/{api_references => API_reference}/typing/url.md | 0 docs/{api_references => API_reference}/utils/filter.md | 0 docs/{api_references => API_reference}/utils/find.md | 0 docs/{api_references => API_reference}/utils/maps_docs.md | 0 docs/{api_references => API_reference}/utils/reduce.md | 0 29 files changed, 0 insertions(+), 0 deletions(-) rename docs/{api_references => API_reference}/.pages (100%) rename docs/{api_references => API_reference}/array/any_da.md (100%) rename docs/{api_references => API_reference}/array/da.md (100%) rename docs/{api_references => API_reference}/array/da_stack.md (100%) rename docs/{api_references => API_reference}/base_doc/base_doc.md (100%) rename docs/{api_references => API_reference}/data/data.md (100%) rename docs/{api_references => API_reference}/doc_index/backends/elastic.md (100%) rename docs/{api_references => API_reference}/doc_index/backends/elastic7.md (100%) rename docs/{api_references => API_reference}/doc_index/backends/hnswlib.md (100%) rename docs/{api_references => API_reference}/doc_index/backends/qdrant.md (100%) rename docs/{api_references => API_reference}/doc_index/backends/weaviate.md (100%) rename docs/{api_references => API_reference}/doc_index/doc_index.md (100%) rename docs/{api_references => API_reference}/doc_store/doc_store.md (100%) rename docs/{api_references => API_reference}/doc_store/file_doc_store.md (100%) rename docs/{api_references => API_reference}/doc_store/jac_doc_store.md (100%) rename docs/{api_references => API_reference}/doc_store/s3_doc_store.md (100%) rename docs/{api_references => API_reference}/documents/documents.md (100%) rename docs/{api_references => API_reference}/typing/bytes.md (100%) rename docs/{api_references => API_reference}/typing/id.md (100%) rename docs/{api_references => API_reference}/typing/tensor/audio.md (100%) rename docs/{api_references => API_reference}/typing/tensor/embedding.md (100%) rename docs/{api_references => API_reference}/typing/tensor/image.md (100%) rename docs/{api_references => API_reference}/typing/tensor/tensor.md (100%) rename docs/{api_references => API_reference}/typing/tensor/video.md (100%) rename docs/{api_references => API_reference}/typing/url.md (100%) rename docs/{api_references => API_reference}/utils/filter.md (100%) rename docs/{api_references => API_reference}/utils/find.md (100%) rename docs/{api_references => API_reference}/utils/maps_docs.md (100%) rename docs/{api_references => API_reference}/utils/reduce.md (100%) diff --git a/docs/api_references/.pages b/docs/API_reference/.pages similarity index 100% rename from docs/api_references/.pages rename to docs/API_reference/.pages diff --git a/docs/api_references/array/any_da.md b/docs/API_reference/array/any_da.md similarity index 100% rename from docs/api_references/array/any_da.md rename to docs/API_reference/array/any_da.md diff --git a/docs/api_references/array/da.md b/docs/API_reference/array/da.md similarity index 100% rename from docs/api_references/array/da.md rename to docs/API_reference/array/da.md diff --git a/docs/api_references/array/da_stack.md b/docs/API_reference/array/da_stack.md similarity index 100% rename from docs/api_references/array/da_stack.md rename to docs/API_reference/array/da_stack.md diff --git a/docs/api_references/base_doc/base_doc.md b/docs/API_reference/base_doc/base_doc.md similarity index 100% rename from docs/api_references/base_doc/base_doc.md rename to docs/API_reference/base_doc/base_doc.md diff --git a/docs/api_references/data/data.md b/docs/API_reference/data/data.md similarity index 100% rename from docs/api_references/data/data.md rename to docs/API_reference/data/data.md diff --git a/docs/api_references/doc_index/backends/elastic.md b/docs/API_reference/doc_index/backends/elastic.md similarity index 100% rename from docs/api_references/doc_index/backends/elastic.md rename to docs/API_reference/doc_index/backends/elastic.md diff --git a/docs/api_references/doc_index/backends/elastic7.md b/docs/API_reference/doc_index/backends/elastic7.md similarity index 100% rename from docs/api_references/doc_index/backends/elastic7.md rename to docs/API_reference/doc_index/backends/elastic7.md diff --git a/docs/api_references/doc_index/backends/hnswlib.md b/docs/API_reference/doc_index/backends/hnswlib.md similarity index 100% rename from docs/api_references/doc_index/backends/hnswlib.md rename to docs/API_reference/doc_index/backends/hnswlib.md diff --git a/docs/api_references/doc_index/backends/qdrant.md b/docs/API_reference/doc_index/backends/qdrant.md similarity index 100% rename from docs/api_references/doc_index/backends/qdrant.md rename to docs/API_reference/doc_index/backends/qdrant.md diff --git a/docs/api_references/doc_index/backends/weaviate.md b/docs/API_reference/doc_index/backends/weaviate.md similarity index 100% rename from docs/api_references/doc_index/backends/weaviate.md rename to docs/API_reference/doc_index/backends/weaviate.md diff --git a/docs/api_references/doc_index/doc_index.md b/docs/API_reference/doc_index/doc_index.md similarity index 100% rename from docs/api_references/doc_index/doc_index.md rename to docs/API_reference/doc_index/doc_index.md diff --git a/docs/api_references/doc_store/doc_store.md b/docs/API_reference/doc_store/doc_store.md similarity index 100% rename from docs/api_references/doc_store/doc_store.md rename to docs/API_reference/doc_store/doc_store.md diff --git a/docs/api_references/doc_store/file_doc_store.md b/docs/API_reference/doc_store/file_doc_store.md similarity index 100% rename from docs/api_references/doc_store/file_doc_store.md rename to docs/API_reference/doc_store/file_doc_store.md diff --git a/docs/api_references/doc_store/jac_doc_store.md b/docs/API_reference/doc_store/jac_doc_store.md similarity index 100% rename from docs/api_references/doc_store/jac_doc_store.md rename to docs/API_reference/doc_store/jac_doc_store.md diff --git a/docs/api_references/doc_store/s3_doc_store.md b/docs/API_reference/doc_store/s3_doc_store.md similarity index 100% rename from docs/api_references/doc_store/s3_doc_store.md rename to docs/API_reference/doc_store/s3_doc_store.md diff --git a/docs/api_references/documents/documents.md b/docs/API_reference/documents/documents.md similarity index 100% rename from docs/api_references/documents/documents.md rename to docs/API_reference/documents/documents.md diff --git a/docs/api_references/typing/bytes.md b/docs/API_reference/typing/bytes.md similarity index 100% rename from docs/api_references/typing/bytes.md rename to docs/API_reference/typing/bytes.md diff --git a/docs/api_references/typing/id.md b/docs/API_reference/typing/id.md similarity index 100% rename from docs/api_references/typing/id.md rename to docs/API_reference/typing/id.md diff --git a/docs/api_references/typing/tensor/audio.md b/docs/API_reference/typing/tensor/audio.md similarity index 100% rename from docs/api_references/typing/tensor/audio.md rename to docs/API_reference/typing/tensor/audio.md diff --git a/docs/api_references/typing/tensor/embedding.md b/docs/API_reference/typing/tensor/embedding.md similarity index 100% rename from docs/api_references/typing/tensor/embedding.md rename to docs/API_reference/typing/tensor/embedding.md diff --git a/docs/api_references/typing/tensor/image.md b/docs/API_reference/typing/tensor/image.md similarity index 100% rename from docs/api_references/typing/tensor/image.md rename to docs/API_reference/typing/tensor/image.md diff --git a/docs/api_references/typing/tensor/tensor.md b/docs/API_reference/typing/tensor/tensor.md similarity index 100% rename from docs/api_references/typing/tensor/tensor.md rename to docs/API_reference/typing/tensor/tensor.md diff --git a/docs/api_references/typing/tensor/video.md b/docs/API_reference/typing/tensor/video.md similarity index 100% rename from docs/api_references/typing/tensor/video.md rename to docs/API_reference/typing/tensor/video.md diff --git a/docs/api_references/typing/url.md b/docs/API_reference/typing/url.md similarity index 100% rename from docs/api_references/typing/url.md rename to docs/API_reference/typing/url.md diff --git a/docs/api_references/utils/filter.md b/docs/API_reference/utils/filter.md similarity index 100% rename from docs/api_references/utils/filter.md rename to docs/API_reference/utils/filter.md diff --git a/docs/api_references/utils/find.md b/docs/API_reference/utils/find.md similarity index 100% rename from docs/api_references/utils/find.md rename to docs/API_reference/utils/find.md diff --git a/docs/api_references/utils/maps_docs.md b/docs/API_reference/utils/maps_docs.md similarity index 100% rename from docs/api_references/utils/maps_docs.md rename to docs/API_reference/utils/maps_docs.md diff --git a/docs/api_references/utils/reduce.md b/docs/API_reference/utils/reduce.md similarity index 100% rename from docs/api_references/utils/reduce.md rename to docs/API_reference/utils/reduce.md From a248c38ee0fb9ccb359a024cbe90ed846f2fc902 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Mon, 17 Apr 2023 17:21:51 +0200 Subject: [PATCH 06/10] docs(glossary): fixes Signed-off-by: Alex C-G --- docs/glossary.md | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/docs/glossary.md b/docs/glossary.md index b6810c9d25c..c131bae5b5c 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -1,14 +1,17 @@ # Glossary -DocArray's scope is at the edge of different fields, from AI to web apps. To make it easier to understand, we have created a glossary of terms used in the documentation. +DocArray's scope covers several fields, from AI to web apps. To make it easier to understand, we have created a glossary of terms used in the documentation. -## Concept +## Concepts ### `Multimodal Data` -Multimodal data is data that is composed of different modalities, like Image, Text, Video, Audio, etc. -For example, a YouTube video is composed of a video, a title, a description, a thumbnail, etc. -Actually, most of the data we have in the world is multimodal. +Multimodal data is data that is composed of different modalities, like image, text, video, audio, etc. + +Actually, most of the data we have in the world is multimodal, for example: + +- Newspaper pages are made up of headline, author byline, image, text, etc. +- YouTube videos are made up of a video, title, description, thumbnail, etc. ### `Multimodal AI` @@ -16,19 +19,19 @@ Multimodal AI is the field of AI that focuses on multimodal data. Most of the recent breakthroughs in AI are multimodal AI. -* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [Midjourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [DALL-E 2](https://openai.com/product/dall-e-2) generate *images* from *text*. +* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [Midjourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F) and [DALL-E 2](https://openai.com/product/dall-e-2) generate *images* from *text*. * [Whisper](https://openai.com/research/whisper) generates *text* from *speech*. * [GPT-4](https://openai.com/product/gpt-4) and [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) are MLLMs (Multimodal Large Language Models) that understand both *text* and *images*. -One of the reasons that AI labs are focusing on multimodal AI is that it can solve a lot of practical problems and that it actually might be -a requirement to build a strong AI system as argued by Yann Lecun in [this article](https://www.noemamag.com/ai-and-the-limits-of-language/) where he stated that "a system trained on language alone will never approximate human intelligence." +Many AI labs are focusing on multimodal AI because it can solve a lot of practical problems, and that it might actually be +a requirement for strong AI systems (as argued by Yann Lecun in [this article](https://www.noemamag.com/ai-and-the-limits-of-language/) where he states that "a system trained on language alone will never approximate human intelligence.") ### `Generative AI` Generative AI is also in the epicenter of the latest AI revolution. These tools allow us to *generate* data. -* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [MidJourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [Dalle-2](https://openai.com/product/dall-e-2) generate *images* from *text*. -* LLM: Large Language Model, (GPT, Flan, LLama, Bloom). These models generate *text*. +* [StableDiffusion](https://stability.ai/blog/stable-diffusion-public-release), [MidJourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), and [Dalle-2](https://openai.com/product/dall-e-2) generate *images* from *text*. +* LLMs: Large Language Models, (GPT, Flan, LLama, Bloom). These models generate *text*. ### `Neural Search` @@ -42,9 +45,9 @@ A vector database is a specialized storage system designed to handle high-dimens ### `Jina` -[Jina](https://jina.ai) is a framework to build multimodal applications. It relies heavily on DocArray to represent and send data. +[Jina](https://github.com/jina-ai/jina/) is a framework for building multimodal applications. It relies heavily on DocArray to represent and send data. -DocArray was originally part of Jina but it became a standalone project that is now independent of Jina. +DocArray was originally part of Jina but it is now a standalone project independent of Jina. ### `Pydantic` @@ -53,14 +56,12 @@ DocArray relies on Pydantic. ### `FastAPI` -[FastAPI](https://fastapi.tiangolo.com/) is a Python library that allows building API using Python type hints. - -It is built on top of Pydantic and nicely extends to DocArray. +[FastAPI](https://fastapi.tiangolo.com/) is a Python library that allows building API using Python type hints. It is built on top of Pydantic and nicely extends to DocArray. ### `Weaviate` [Weaviate](https://weaviate.io/) is an open-source vector database that is supported in DocArray. -### `Weaviate` +### `Qdrant` [Qdrant](https://qdrant.tech/) is an open-source vector database that is supported in DocArray. From da10da33dd6005f002aa7eb4a90477e60539bc97 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Mon, 17 Apr 2023 17:39:54 +0200 Subject: [PATCH 07/10] docs(readme): fixes Signed-off-by: Alex C-G --- docs/README.md | 807 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 807 insertions(+) create mode 100644 docs/README.md diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 00000000000..d31579acbc5 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,807 @@ +

+DocArray logo: The data structure for unstructured data +
+The data structure for multimodal data +

+ +

+PyPI +Codecov branch + +PyPI - Downloads from official pypistats + +

+ +> ⬆️ **DocArray v2**: This readme is for the second version of DocArray (starting at 0.30). If you want to use the older +> DocArray version (prior to 0.30) check out the [docarray-v1-fixes](https://github.com/docarray/docarray/tree/docarray-v1-fixes) branch + + +DocArray is a library for **representing, sending and storing multi-modal data**, perfect for **Machine Learning applications**. + +Those are the three pillars of DocArray, and you can check them out individually: + +1. [**Represent**](#represent) +2. [**Send**](#send) +3. [**Store**](#store) + +DocArray handles your data while integrating seamlessly with the rest of your **Python and ML ecosystem**: + +- :fire: DocArray has native compatibility for **[NumPy](https://github.com/numpy/numpy)**, **[PyTorch](https://github.com/pytorch/pytorch)** and **[TensorFlow](https://github.com/tensorflow/tensorflow)**, including for **model training use cases** +- :zap: DocArray is built on **[Pydantic](https://github.com/pydantic/pydantic)** and out-of-the-box compatible with **[FastAPI](https://github.com/tiangolo/fastapi/)** and **[Jina](https://github.com/jina-ai/jina/)** +- :package: DocArray can index data in vector databases such as **[Weaviate](https://weaviate.io/), [Qdrant](https://qdrant.tech/), [ElasticSearch](https://www.elastic.co/de/elasticsearch/)** as well as **[HNSWLib](https://github.com/nmslib/hnswlib)** +- :chains: DocArray data can be sent as JSON over **HTTP** or as **[Protobuf](https://protobuf.dev/)** over **[gRPC](https://grpc.io/)** + + +> :bulb: **Where are you coming from?** Depending on your use case and background, there are different ways to "get" DocArray. +> You can navigate to the following section for an explanation that should fit your mindset: +> +> - [Coming from pure PyTorch or TensorFlow](#coming-from-pytorch) +> - [Coming from Pydantic](#coming-from-pydantic) +> - [Coming from FastAPI](#coming-from-fastapi) +> - [Coming from a vector database](#coming-from-vector-database) + +DocArray was released under the open-source [Apache License 2.0](https://github.com/docarray/docarray/blob/main/LICENSE) in January 2022. It is currently a sandbox project under [LF AI & Data Foundation](https://lfaidata.foundation/). + +## Represent + +DocArray allows you to **represent your data**, in an ML-native way. + +This is useful for different use cases: + +- :woman_running: You are **training a model**, there are myriads of tensors of different shapes and sizes flying around, representing different _things_, and you want to keep a straight head about them +- :cloud: You are **serving a model**, for example through FastAPI, and you want to specify your API endpoints +- :card_index_dividers: You are **parsing data** for later use in your ML or DS applications + +> :bulb: **Coming from Pydantic?** If you're currently using Pydantic for the use cases above, you should be happy to hear +> that DocArray is built on top of, and fully compatible with, Pydantic! +> Also, we have [dedicated section](#coming-from-pydantic) just for you! + +Put simply, DocArray lets you represent your data in a dataclass-like way, with ML as a first class citizen: + +```python +from docarray import BaseDoc +from docarray.typing import TorchTensor, ImageUrl +import torch + + +# Define your data model +class MyDocument(BaseDoc): + description: str + image_url: ImageUrl # could also be VideoUrl, AudioUrl, etc. + image_tensor: TorchTensor[1704, 2272, 3] # you can express tensor shapes! + + +# Stack multiple documents in a Document Vector +from docarray import DocVec + +vec = DocVec[MyDocument]( + [ + MyDocument( + description="A cat", + image_url="https://example.com/cat.jpg", + image_tensor=torch.rand(1704, 2272, 3), + ), + ] + * 10 +) +print(vec.image_tensor.shape) # (10, 1704, 2272, 3) +``` + +
+ Click for more details + +So let's take a closer look at how you can represent your data with DocArray: + +```python +from docarray import BaseDoc +from docarray.typing import TorchTensor, ImageUrl +from typing import Optional +import torch + + +# Define your data model +class MyDocument(BaseDoc): + description: str + image_url: ImageUrl # could also be VideoUrl, AudioUrl, etc. + image_tensor: Optional[ + TorchTensor[1704, 2272, 3] + ] # could also be NdArray or TensorflowTensor + embedding: Optional[TorchTensor] +``` + +So not only can you define the types of your data, you can even **specify the shape of your tensors!** + +Once you have your model in form of a `Document`, you can work with it! + +```python +# Create a document +doc = MyDocument( + description="This is a photo of a mountain", + image_url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", +) + +# Load image tensor from URL +doc.image_tensor = doc.image_url.load() + +# Compute embedding with any model of your choice + + +def clip_image_encoder(image_tensor: TorchTensor) -> TorchTensor: # dummy function + return torch.rand(512) + + +doc.embedding = clip_image_encoder(doc.image_tensor) + +print(doc.embedding.shape) # torch.Size([512]) +``` + +### Compose nested Documents + +Of course you can compose Documents into a nested structure: + +```python +from docarray import BaseDoc +from docarray.documents import ImageDoc, TextDoc +import numpy as np + + +class MultiModalDocument(BaseDoc): + image_doc: ImageDoc + text_doc: TextDoc + + +doc = MultiModalDocument( + image_doc=ImageDoc(tensor=np.zeros((3, 224, 224))), text_doc=TextDoc(text='hi!') +) +``` + +Of course, you rarely work with a single data point at a time, especially in Machine Learning applications. + +That's why you can easily collect multiple `Documents`: + +### Collect multiple `Documents` + +When building or interacting with an ML system, usually you want to process multiple Documents (data points) at once. + +DocArray offers two data structures for this: + +- **`DocVec`**: A vector of `Documents`. All tensors in the `Documents` are stacked up into a single tensor. **Perfect for batch processing and use inside of ML models**. +- **`DocList`**: A list of `Documents`. All tensors in the `Documents` are kept as-is. **Perfect for streaming, re-ranking, and shuffling of data**. + +Let's take a look at them, starting with `DocVec`: + +```python +from docarray import DocVec, BaseDoc +from docarray.typing import AnyTensor, ImageUrl +import numpy as np + + +class Image(BaseDoc): + url: ImageUrl + tensor: AnyTensor # this allows torch, numpy, and tensor flow tensors + + +vec = DocVec[Image]( # the DocVec is parametrized by your personal schema! + [ + Image( + url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", + tensor=np.zeros((3, 224, 224)), + ) + for _ in range(100) + ] +) +``` + +As you can see in the code snippet above, `DocVec` is **parametrized by the type of Document** you want to use with it: `DocVec[Image]`. + +This may look slightly weird at first, but we're confident that you'll get used to it quickly! +Besides, it allows us to do some cool things, like giving you **bulk access to the fields that you defined** in your `Document`: + +```python +tensor = vec.tensor # gets all the tensors in the DocVec +print(tensor.shape) # which are stacked up into a single tensor! +print(vec.url) # you can bulk access any other field, too +``` + +The second data structure, `DocList`, works in a similar way: + +```python +from docarray import DocList + +dl = DocList[Image]( # the DocList is parametrized by your personal schema! + [ + Image( + url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", + tensor=np.zeros((3, 224, 224)), + ) + for _ in range(100) + ] +) +``` + +You can still bulk access the fields of your `Document`: + +```python +tensors = dl.tensor # gets all the tensors in the DocList +print(type(tensors)) # as a list of tensors +print(dl.url) # you can bulk access any other field, too +``` + +And you can insert, remove, and append `Documents` to your `DocList`: + +```python +# append +dl.append( + Image( + url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", + tensor=np.zeros((3, 224, 224)), + ) +) +# delete +del dl[0] +# insert +dl.insert( + 0, + Image( + url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", + tensor=np.zeros((3, 224, 224)), + ), +) +``` + +And you can seamlessly switch between `DocVec` and `DocList`: + +```python +vec_2 = dl.to_doc_vec() +assert isinstance(vec_2, DocVec) + +dl_2 = vec_2.to_doc_list() +assert isinstance(dl_2, DocList) +``` + +
+ +## Send + +DocArray allows you to **send your data**, in an ML-native way. + +This means there is native support for **Protobuf and gRPC**, on top of **HTTP** and serialization to JSON, JSONSchema, Base64, and Bytes. + +This is useful for different use cases: + +- :cloud: You are **serving a model**, for example through **[Jina](https://github.com/jina-ai/jina/)** or **[FastAPI](https://github.com/tiangolo/fastapi/)** +- :spider_web: You **distribute your model** across machines and need to send your data between nodes +- :gear: You are building a **microservice** architecture and need to send your data between microservices + +> :bulb: **Coming from FastAPI?** If you're currently using FastAPI for the use cases above, you should be happy to hear +> that DocArray is fully compatible with FastAPI! +> Also, we have [dedicated section](#coming-from-fastapi) just for you! + +Whenever you want to send your data you need to serialize it, so let's take a look at how that works with DocArray: + +```python +from docarray import BaseDoc +from docarray.typing import ImageTorchTensor +import torch + + +# model your data +class MyDocument(BaseDoc): + description: str + image: ImageTorchTensor[3, 224, 224] + + +# create a Document +doc = MyDocument( + description="This is a description", + image=torch.zeros((3, 224, 224)), +) + +# serialize it! +proto = doc.to_protobuf() +bytes_ = doc.to_bytes() +json = doc.json() + +# deserialize it! +doc_2 = MyDocument.from_protobuf(proto) +doc_4 = MyDocument.from_bytes(bytes_) +doc_5 = MyDocument.parse_raw(json) +``` + +Of course, serialization is not all you need. +So check out how DocArray integrates with FastAPI and Jina. + + +## Store + +Once you've modelled your data, and maybe sent it around, usually you want to **store it** somewhere. +But fret not! DocArray has you covered! + +**Document Stores** let you, well, store your Documents, locally or remotely, all with the same user interface: + +- :cd: **On disk** as a file in your local file system +- :bucket: On **[AWS S3](https://aws.amazon.com/de/s3/)** +- :cloud: On **[Jina AI Cloud](https://cloud.jina.ai/)** + +
+ See Document Store usage + +The Document Store interface lets you push and pull Documents to and from multiple data sources, all with the same user interface. + +For example, let's see how that works with on-disk storage: + +```python +from docarray import BaseDoc, DocList + + +class SimpleDoc(BaseDoc): + text: str + + +docs = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(8)]) +docs.push('file://simple_docs') + +docs_pull = DocList[SimpleDoc].pull('file://simple_docs') +``` +
+ +**Document Indexes** let you index your Documents into a **vector database**, for efficient similarity-based retrieval. + +This is useful for: + +- :left_speech_bubble: Augmenting **LLMs and Chatbots** with domain knowledge ([Retrieval Augmented Generation](https://arxiv.org/abs/2005.11401)) +- :mag: **Neural search** applications +- :bulb: **Recommender systems** + +Currently, DocArray Document Indexes support **[Weaviate](https://weaviate.io/)**, **[Qdrant](https://qdrant.tech/)**, **[ElasticSearch](https://www.elastic.co/)**, and **[HNSWLib](https://github.com/nmslib/hnswlib)**, with more to come! + +
+ See Document Index usage + +The Document Index interface lets you index and retrieve Documents from multiple vector databases, all with the same user interface. + +It supports ANN vector search, text search, filtering, and hybrid search. + +```python +from docarray import DocList, BaseDoc +from docarray.index import HnswDocumentIndex +import numpy as np + +from docarray.typing import ImageUrl, ImageTensor, NdArray + + +class ImageDoc(BaseDoc): + url: ImageUrl + tensor: ImageTensor + embedding: NdArray[128] + + +# create some data +dl = DocList[ImageDoc]( + [ + ImageDoc( + url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", + tensor=np.zeros((3, 224, 224)), + embedding=np.random.random((128,)), + ) + for _ in range(100) + ] +) + +# create a Document Index +index = HnswDocumentIndex[ImageDoc](work_dir='/tmp/test_index') + + +# index your data +index.index(dl) + +# find similar Documents +query = dl[0] +results, scores = index.find(query, limit=10, search_field='embedding') +``` + +
+ +Depending on your background and use case, there are different ways for you to _get_ DocArray. +Choose your own adventure! + +## Coming from old DocArray + +
+ Click to expand + +If you are using DocArray v<0.30.0, you will be familiar with its [dataclass API](https://docarray.jina.ai/fundamentals/dataclass/). + +_DocArray v2 is that idea, taken seriously._ Every `Document` is created through dataclass-like interface, +courtesy of [Pydantic](https://pydantic-docs.helpmanual.io/usage/models/). + +This gives the following advantages: +- **Flexibility:** No need to conform to a fixed set of fields -- your data defines the schema +- **Multi-modality:** Easily store multiple modalities and multiple embeddings in the same Document +- **Language agnostic:** At its core, Documents are just dictionaries. This makes it easy to create and send them from any language, not just Python. + +You may also be familiar with our old Document Stores for vector DB integration. +They are now called **Document Indexes** and offer the following improvements (see [here](#store) for the new API): +- **Hybrid search:** You can now combine vector search with text search, and even filter by arbitrary fields +- **Production-ready:** The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain +- **Increased flexibility:** We strive to support any configuration or setting that you could perform through the DB's first-party client + +For now, Document Indexes support **[Weaviate](https://weaviate.io/)**, **[Qdrant](https://qdrant.tech/)**, **[ElasticSearch](https://www.elastic.co/)**, and **[HNSWLib](https://github.com/nmslib/hnswlib)**, with more to come. + +
+ +## Coming from Pydantic + +
+ Click to expand + +If you come from Pydantic, you can see DocArray Documents as juiced up Pydantic models, and DocArray as a collection of goodies around them. + +More specifically, we set out to **make Pydantic fit for the ML world** - not by replacing it, but by building on top of it! + +This means that you get the following benefits: +- **ML focused types**: Tensor, TorchTensor, Embedding, ..., including **tensor shape validation** +- Full compatibility with **FastAPI** +- **DocList** and **DocVec** generalize the idea of a model to a _sequence_ or _batch_ of models. Perfect for **use in ML models** and other batch processing tasks. +- **Types that are alive**: ImageUrl can `.load()` a URL to image tensor, TextUrl can load and tokenize text documents, etc. +- Cloud-ready: Serialization to **Protobuf** for use with microservices and **gRPC** +- **Pre-built multi-modal Documents** for different data modalities: Image, Text, 3DMesh, Video, Audio and more. Note that all of these are valid Pydantic models! +- **Document Stores** and **Document Indexes** let you store your data and retrieve it using **vector search** + +The most obvious advantage here is **first-class support for ML centric data**, such as {Torch, TF, ...}Tensor, Embedding, etc. + +This includes handy features such as validating the shape of a tensor: + +```python +from docarray import BaseDoc +from docarray.typing import TorchTensor +import torch + + +class MyDoc(BaseDoc): + tensor: TorchTensor[3, 224, 224] + + +doc = MyDoc(tensor=torch.zeros(3, 224, 224)) # works +doc = MyDoc(tensor=torch.zeros(224, 224, 3)) # works by reshaping + +try: + doc = MyDoc(tensor=torch.zeros(224)) # fails validation +except Exception as e: + print(e) + # tensor + # Cannot reshape tensor of shape (224,) to shape (3, 224, 224) (type=value_error) + + +class Image(BaseDoc): + tensor: TorchTensor[3, 'x', 'x'] + + +Image(tensor=torch.zeros(3, 224, 224)) # works + +try: + Image( + tensor=torch.zeros(3, 64, 128) + ) # fails validation because second dimension does not match third +except Exception as e: + print() + + +try: + Image( + tensor=torch.zeros(4, 224, 224) + ) # fails validation because of the first dimension +except Exception as e: + print(e) + # Tensor shape mismatch. Expected(3, 'x', 'x'), got(4, 224, 224)(type=value_error) + +try: + Image( + tensor=torch.zeros(3, 64) + ) # fails validation because it does not have enough dimensions +except Exception as e: + print(e) + # Tensor shape mismatch. Expected (3, 'x', 'x'), got (3, 64) (type=value_error) +``` + +
+ + +## Coming from PyTorch + +
+ Click to expand + +If you come from PyTorch, you can see DocArray mainly as a way of _organizing your data as it flows through your model_. + +It offers you several advantages: +- Express **tensors shapes in type hints** +- **Group tensors that belong to the same object**, e.g. an audio track and an image +- **Go directly to deployment**, by re-using your data model as a [FastAPI](https://fastapi.tiangolo.com/) or [Jina](https://github.com/jina-ai/jina) API schema +- Connect model components between **microservices**, using Protobuf and gRPC + +DocArray can be used directly inside ML models to handle and represent multi-modal data. +This allows you to reason about your data using DocArray's abstractions deep inside of `nn.Module`, +and provides a (FastAPI-compatible) schema that eases the transition between model training and model serving. + +To see the effect of this, let's first observe a vanilla PyTorch implementation of a tri-modal ML model: + +```python +import torch +from torch import nn +import torch + + +def encoder(x): + return torch.rand(512) + + +class MyMultiModalModel(nn.Module): + def __init__(self): + super().__init__() + self.audio_encoder = encoder() + self.image_encoder = encoder() + self.text_encoder = encoder() + + def forward(self, text_1, text_2, image_1, image_2, audio_1, audio_2): + embedding_text_1 = self.text_encoder(text_1) + embedding_text_2 = self.text_encoder(text_2) + + embedding_image_1 = self.image_encoder(image_1) + embedding_image_2 = self.image_encoder(image_2) + + embedding_audio_1 = self.image_encoder(audio_1) + embedding_audio_2 = self.image_encoder(audio_2) + + return ( + embedding_text_1, + embedding_text_2, + embedding_image_1, + embedding_image_2, + embedding_audio_1, + embedding_audio_2, + ) +``` + +Not very easy on the eyes if you ask us. And even worse, if you need to add one more modality you have to touch every part of your code base, changing the `forward()` return type and making a whole lot of changes downstream from that. + +So, now let's see what the same code looks like with DocArray: + +```python +from docarray import DocList, BaseDoc +from docarray.documents import ImageDoc, TextDoc, AudioDoc +from docarray.typing import TorchTensor +from torch import nn +import torch + + +def encoder(x): + return torch.rand(512) + + +class Podcast(BaseDoc): + text: TextDoc + image: ImageDoc + audio: AudioDoc + + +class PairPodcast(BaseDoc): + left: Podcast + right: Podcast + + +class MyPodcastModel(nn.Module): + def __init__(self): + super().__init__() + self.audio_encoder = encoder() + self.image_encoder = encoder() + self.text_encoder = encoder() + + def forward_podcast(self, docs: DocList[Podcast]) -> DocList[Podcast]: + docs.audio.embedding = self.audio_encoder(docs.audio.tensor) + docs.text.embedding = self.text_encoder(docs.text.tensor) + docs.image.embedding = self.image_encoder(docs.image.tensor) + + return docs + + def forward(self, docs: DocList[PairPodcast]) -> DocList[PairPodcast]: + docs.left = self.forward_podcast(docs.left) + docs.right = self.forward_podcast(docs.right) + + return docs +``` + +Looks much better, doesn't it? +You instantly win in code readability and maintainability. And for the same price you can turn your PyTorch model into a FastAPI app and reuse your Document +schema definition (see [below](#coming-from-fastapi)). Everything is handled in a pythonic manner by relying on type hints. + +
+ + +## Coming from TensorFlow + +
+ Click to expand + +Similar to the [PyTorch approach](#coming-from-pytorch), you can also use DocArray with TensorFlow to handle and represent multi-modal data inside your ML model. + +First off, to use DocArray with TensorFlow we first need to install it as follows: + +``` +pip install tensorflow==2.11.0 +pip install protobuf==3.19.0 +``` + +Compared to using DocArray with PyTorch, there is one main difference when using it with TensorFlow:\ +While DocArray's `TorchTensor` is a subclass of `torch.Tensor`, this is not the case for the `TensorFlowTensor`: Due to some technical limitations of `tf.Tensor`, DocArray's `TensorFlowTensor` is not a subclass of `tf.Tensor` but rather stores a `tf.Tensor` in its `.tensor` attribute. + +How does this affect you? Whenever you want to access the tensor data to, let's say, do operations with it or hand it to your ML model, instead of handing over your `TensorFlowTensor` instance, you need to access its `.tensor` attribute. + +This would look like the following: + +```python +from typing import Optional + +from docarray import DocList, BaseDoc + +import tensorflow as tf + + +class Podcast(BaseDoc): + audio_tensor: Optional[AudioTensorFlowTensor] + embedding: Optional[AudioTensorFlowTensor] + + +class MyPodcastModel(tf.keras.Model): + def __init__(self): + super().__init__() + self.audio_encoder = AudioEncoder() + + def call(self, inputs: DocList[Podcast]) -> DocList[Podcast]: + inputs.audio_tensor.embedding = self.audio_encoder( + inputs.audio_tensor.tensor + ) # access audio_tensor's .tensor attribute + return inputs +``` + +
+ + +## Coming from FastAPI + +
+ Click to expand + +Documents are Pydantic Models (with a twist), and as such they are fully compatible with FastAPI! + +But why should you use them, and not the Pydantic models you already know and love? +Good question! +- Because of the ML-first features, types and validations, [here](#coming-from-pydantic) +- Because DocArray can act as an [ORM for vector databases](#coming-from-a-vector-database), similar to what SQLModel does for SQL databases + +And to seal the deal, let us show you how easily Documents slot into your FastAPI app: + +```python +import numpy as np +from fastapi import FastAPI +from httpx import AsyncClient + +from docarray import BaseDoc +from docarray.documents import ImageDoc +from docarray.typing import NdArray +from docarray.base_doc import DocArrayResponse + + +class InputDoc(BaseDoc): + img: ImageDoc + + +class OutputDoc(BaseDoc): + embedding_clip: NdArray + embedding_bert: NdArray + + +input_doc = InputDoc(img=ImageDoc(tensor=np.zeros((3, 224, 224)))) + +app = FastAPI() + + +@app.post("/doc/", response_model=OutputDoc, response_class=DocArrayResponse) +async def create_item(doc: InputDoc) -> OutputDoc: + ## call my fancy model to generate the embeddings + doc = OutputDoc( + embedding_clip=np.zeros((100, 1)), embedding_bert=np.zeros((100, 1)) + ) + return doc + + +async with AsyncClient(app=app, base_url="http://test") as ac: + response = await ac.post("/doc/", data=input_doc.json()) + resp_doc = await ac.get("/docs") + resp_redoc = await ac.get("/redoc") +``` + +Just like a vanilla Pydantic model! + +
+ + +## Coming from a vector database + +
+ Click to expand + +If you came across DocArray as a universal vector database client, you can best think of it as **a new kind of ORM for vector databases**. +DocArray's job is to take multi-modal, nested and domain-specific data and to map it to a vector database, +store it there, and thus make it searchable: + +```python +from docarray import DocList, BaseDoc +from docarray.index import HnswDocumentIndex +import numpy as np + +from docarray.typing import ImageUrl, ImageTensor, NdArray + + +class ImageDoc(BaseDoc): + url: ImageUrl + tensor: ImageTensor + embedding: NdArray[128] + + +# create some data +dl = DocList[ImageDoc]( + [ + ImageDoc( + url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", + tensor=np.zeros((3, 224, 224)), + embedding=np.random.random((128,)), + ) + for _ in range(100) + ] +) + +# create a Document Index +index = HnswDocumentIndex[ImageDoc](work_dir='/tmp/test_index2') + + +# index your data +index.index(dl) + +# find similar Documents +query = dl[0] +results, scores = index.find(query, limit=10, search_field='embedding') +``` + +Currently, DocArray supports the following vector databases: +- [Weaviate](https://www.weaviate.io/) +- [Qdrant](https://qdrant.tech/) +- [Elasticsearch](https://www.elastic.co/elasticsearch/) v8 and v7 +- [HNSWlib](https://github.com/nmslib/hnswlib) as a local-first alternative + +An integration of [OpenSearch](https://opensearch.org/) is currently in progress. + +Legacy versions of DocArray also support [Redis](https://redis.io/) and [Milvus](https://milvus.io/), but these are not yet supported in the current version. + +Of course this is only one thing that DocArray can do, so we encourage you to check out the rest of this readme! + +
+ + +## Install the alpha + +To try out the alpha you can install it via git: + +```shell +pip install "git+https://github.com/docarray/docarray" +``` + +## See also + +- [Documentation](https://docarray-v2--jina-docs.netlify.app/) +- [Join our Discord server](https://discord.gg/WaMp6PVPgR) +- [Donation to Linux Foundation AI&Data blog post](https://jina.ai/news/donate-docarray-lf-for-inclusive-standard-multimodal-data-model/) +- ["Legacy" DocArray github page](https://github.com/docarray/docarray/tree/docarray-v1-fixes) +- ["Legacy" DocArray documentation](https://docarray.jina.ai/) + +> DocArray is a trademark of LF AI Projects, LLC From f4a1bf2405116c42e05ec69c6fe7639d2fa9525f Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Mon, 17 Apr 2023 17:41:02 +0200 Subject: [PATCH 08/10] docs(readme): remove alpha notice, update install, placeholder url for legacy docs Signed-off-by: Alex C-G --- README.md | 127 +++++++++++++++++++++++++----------------------------- 1 file changed, 59 insertions(+), 68 deletions(-) diff --git a/README.md b/README.md index d31579acbc5..567c7ea3329 100644 --- a/README.md +++ b/README.md @@ -13,27 +13,24 @@

> ⬆️ **DocArray v2**: This readme is for the second version of DocArray (starting at 0.30). If you want to use the older -> DocArray version (prior to 0.30) check out the [docarray-v1-fixes](https://github.com/docarray/docarray/tree/docarray-v1-fixes) branch - +> version (prior to 0.30) check out the [docarray-v1-fixes](https://github.com/docarray/docarray/tree/docarray-v1-fixes) branch DocArray is a library for **representing, sending and storing multi-modal data**, perfect for **Machine Learning applications**. -Those are the three pillars of DocArray, and you can check them out individually: +With DocArray you can: -1. [**Represent**](#represent) -2. [**Send**](#send) -3. [**Store**](#store) +1. [**Represent data**](#represent) +2. [**Send data**](#send) +3. [**Store data**](#store) DocArray handles your data while integrating seamlessly with the rest of your **Python and ML ecosystem**: -- :fire: DocArray has native compatibility for **[NumPy](https://github.com/numpy/numpy)**, **[PyTorch](https://github.com/pytorch/pytorch)** and **[TensorFlow](https://github.com/tensorflow/tensorflow)**, including for **model training use cases** -- :zap: DocArray is built on **[Pydantic](https://github.com/pydantic/pydantic)** and out-of-the-box compatible with **[FastAPI](https://github.com/tiangolo/fastapi/)** and **[Jina](https://github.com/jina-ai/jina/)** -- :package: DocArray can index data in vector databases such as **[Weaviate](https://weaviate.io/), [Qdrant](https://qdrant.tech/), [ElasticSearch](https://www.elastic.co/de/elasticsearch/)** as well as **[HNSWLib](https://github.com/nmslib/hnswlib)** -- :chains: DocArray data can be sent as JSON over **HTTP** or as **[Protobuf](https://protobuf.dev/)** over **[gRPC](https://grpc.io/)** - +- :fire: Native compatibility for **[NumPy](https://github.com/numpy/numpy)**, **[PyTorch](https://github.com/pytorch/pytorch)** and **[TensorFlow](https://github.com/tensorflow/tensorflow)**, including for **model training use cases** +- :zap: Built on **[Pydantic](https://github.com/pydantic/pydantic)** and out-of-the-box compatible with **[FastAPI](https://github.com/tiangolo/fastapi/)** and **[Jina](https://github.com/jina-ai/jina/)** +- :package: Support vector databases like **[Weaviate](https://weaviate.io/), [Qdrant](https://qdrant.tech/), [ElasticSearch](https://www.elastic.co/de/elasticsearch/)** and **[HNSWLib](https://github.com/nmslib/hnswlib)** +- :chains: Send data as JSON over **HTTP** or as **[Protobuf](https://protobuf.dev/)** over **[gRPC](https://grpc.io/)** -> :bulb: **Where are you coming from?** Depending on your use case and background, there are different ways to "get" DocArray. -> You can navigate to the following section for an explanation that should fit your mindset: +> :bulb: **Where are you coming from?** Based on your use case and background, there are different ways to understand DocArray: > > - [Coming from pure PyTorch or TensorFlow](#coming-from-pytorch) > - [Coming from Pydantic](#coming-from-pydantic) @@ -48,13 +45,13 @@ DocArray allows you to **represent your data**, in an ML-native way. This is useful for different use cases: -- :woman_running: You are **training a model**, there are myriads of tensors of different shapes and sizes flying around, representing different _things_, and you want to keep a straight head about them -- :cloud: You are **serving a model**, for example through FastAPI, and you want to specify your API endpoints -- :card_index_dividers: You are **parsing data** for later use in your ML or DS applications +- :woman_running: You are **training a model**: There are tensors of different shapes and sizes flying around, representing different _things_, and you want to keep a straight head about them. +- :cloud: You are **serving a model**: For example through FastAPI, and you want to specify your API endpoints. +- :card_index_dividers: You are **parsing data**: For later use in your ML or data science applications. -> :bulb: **Coming from Pydantic?** If you're currently using Pydantic for the use cases above, you should be happy to hear -> that DocArray is built on top of, and fully compatible with, Pydantic! -> Also, we have [dedicated section](#coming-from-pydantic) just for you! +> :bulb: **Coming from Pydantic?** You should be happy to hear +> that DocArray is built on top of, and is fully compatible with, Pydantic! +> Also, we have a [dedicated section](#coming-from-pydantic) just for you! Put simply, DocArray lets you represent your data in a dataclass-like way, with ML as a first class citizen: @@ -90,7 +87,7 @@ print(vec.image_tensor.shape) # (10, 1704, 2272, 3)
Click for more details -So let's take a closer look at how you can represent your data with DocArray: +Let's take a closer look at how you can represent your data with DocArray: ```python from docarray import BaseDoc @@ -111,7 +108,7 @@ class MyDocument(BaseDoc): So not only can you define the types of your data, you can even **specify the shape of your tensors!** -Once you have your model in form of a `Document`, you can work with it! +Once you have your model in the form of a document, you can work with it! ```python # Create a document @@ -124,8 +121,6 @@ doc = MyDocument( doc.image_tensor = doc.image_url.load() # Compute embedding with any model of your choice - - def clip_image_encoder(image_tensor: TorchTensor) -> TorchTensor: # dummy function return torch.rand(512) @@ -137,7 +132,7 @@ print(doc.embedding.shape) # torch.Size([512]) ### Compose nested Documents -Of course you can compose Documents into a nested structure: +Of course, you can compose Documents into a nested structure: ```python from docarray import BaseDoc @@ -155,9 +150,7 @@ doc = MultiModalDocument( ) ``` -Of course, you rarely work with a single data point at a time, especially in Machine Learning applications. - -That's why you can easily collect multiple `Documents`: +You rarely work with a single data point at a time, especially in machine learning applications. That's why you can easily collect multiple `Documents`: ### Collect multiple `Documents` @@ -165,8 +158,8 @@ When building or interacting with an ML system, usually you want to process mult DocArray offers two data structures for this: -- **`DocVec`**: A vector of `Documents`. All tensors in the `Documents` are stacked up into a single tensor. **Perfect for batch processing and use inside of ML models**. -- **`DocList`**: A list of `Documents`. All tensors in the `Documents` are kept as-is. **Perfect for streaming, re-ranking, and shuffling of data**. +- **`DocVec`**: A vector of `Documents`. All tensors in the documents are stacked into a single tensor. **Perfect for batch processing and use inside of ML models**. +- **`DocList`**: A list of `Documents`. All tensors in the documents are kept as-is. **Perfect for streaming, re-ranking, and shuffling of data**. Let's take a look at them, starting with `DocVec`: @@ -192,10 +185,10 @@ vec = DocVec[Image]( # the DocVec is parametrized by your personal schema! ) ``` -As you can see in the code snippet above, `DocVec` is **parametrized by the type of Document** you want to use with it: `DocVec[Image]`. +In the code snippet above, `DocVec` is **parametrized by the type of document** you want to use with it: `DocVec[Image]`. -This may look slightly weird at first, but we're confident that you'll get used to it quickly! -Besides, it allows us to do some cool things, like giving you **bulk access to the fields that you defined** in your `Document`: +This may look weird at first, but we're confident that you'll get used to it quickly! +Besides, it lets us do some cool things, like having **bulk access to the fields that you defined** in your document: ```python tensor = vec.tensor # gets all the tensors in the DocVec @@ -219,7 +212,7 @@ dl = DocList[Image]( # the DocList is parametrized by your personal schema! ) ``` -You can still bulk access the fields of your `Document`: +You can still bulk access the fields of your document: ```python tensors = dl.tensor # gets all the tensors in the DocList @@ -227,7 +220,7 @@ print(type(tensors)) # as a list of tensors print(dl.url) # you can bulk access any other field, too ``` -And you can insert, remove, and append `Documents` to your `DocList`: +And you can insert, remove, and append documents to your `DocList`: ```python # append @@ -270,12 +263,12 @@ This means there is native support for **Protobuf and gRPC**, on top of **HTTP** This is useful for different use cases: - :cloud: You are **serving a model**, for example through **[Jina](https://github.com/jina-ai/jina/)** or **[FastAPI](https://github.com/tiangolo/fastapi/)** -- :spider_web: You **distribute your model** across machines and need to send your data between nodes +- :spider_web: You are **distributing your model** across machines and need to send your data between nodes - :gear: You are building a **microservice** architecture and need to send your data between microservices -> :bulb: **Coming from FastAPI?** If you're currently using FastAPI for the use cases above, you should be happy to hear +> :bulb: **Coming from FastAPI?** You should be happy to hear > that DocArray is fully compatible with FastAPI! -> Also, we have [dedicated section](#coming-from-fastapi) just for you! +> Also, we have a [dedicated section](#coming-from-fastapi) just for you! Whenever you want to send your data you need to serialize it, so let's take a look at how that works with DocArray: @@ -308,14 +301,12 @@ doc_4 = MyDocument.from_bytes(bytes_) doc_5 = MyDocument.parse_raw(json) ``` -Of course, serialization is not all you need. -So check out how DocArray integrates with FastAPI and Jina. - +Of course, serialization is not all you need. So check out how DocArray integrates with FastAPI and Jina. ## Store Once you've modelled your data, and maybe sent it around, usually you want to **store it** somewhere. -But fret not! DocArray has you covered! +DocArray has you covered! **Document Stores** let you, well, store your Documents, locally or remotely, all with the same user interface: @@ -353,7 +344,7 @@ This is useful for: - :mag: **Neural search** applications - :bulb: **Recommender systems** -Currently, DocArray Document Indexes support **[Weaviate](https://weaviate.io/)**, **[Qdrant](https://qdrant.tech/)**, **[ElasticSearch](https://www.elastic.co/)**, and **[HNSWLib](https://github.com/nmslib/hnswlib)**, with more to come! +Currently, Document Indexes support **[Weaviate](https://weaviate.io/)**, **[Qdrant](https://qdrant.tech/)**, **[ElasticSearch](https://www.elastic.co/)**, and **[HNSWLib](https://github.com/nmslib/hnswlib)**, with more to come!
See Document Index usage @@ -402,26 +393,25 @@ results, scores = index.find(query, limit=10, search_field='embedding')
-Depending on your background and use case, there are different ways for you to _get_ DocArray. -Choose your own adventure! +Depending on your background and use case, there are different ways for you to understand DocArray. ## Coming from old DocArray
Click to expand -If you are using DocArray v<0.30.0, you will be familiar with its [dataclass API](https://docarray.jina.ai/fundamentals/dataclass/). +If you are using DocArray version 0.30.0 or lower, you will be familiar with its [dataclass API](https://docarray.jina.ai/fundamentals/dataclass/). -_DocArray v2 is that idea, taken seriously._ Every `Document` is created through dataclass-like interface, +_DocArray v2 is that idea, taken seriously._ Every document is created through a dataclass-like interface, courtesy of [Pydantic](https://pydantic-docs.helpmanual.io/usage/models/). This gives the following advantages: - **Flexibility:** No need to conform to a fixed set of fields -- your data defines the schema -- **Multi-modality:** Easily store multiple modalities and multiple embeddings in the same Document -- **Language agnostic:** At its core, Documents are just dictionaries. This makes it easy to create and send them from any language, not just Python. +- **Multimodality:** At their core, documents are just dictionaries. This makes it easy to create and send them from any language, not just Python. You may also be familiar with our old Document Stores for vector DB integration. They are now called **Document Indexes** and offer the following improvements (see [here](#store) for the new API): + - **Hybrid search:** You can now combine vector search with text search, and even filter by arbitrary fields - **Production-ready:** The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain - **Increased flexibility:** We strive to support any configuration or setting that you could perform through the DB's first-party client @@ -435,20 +425,21 @@ For now, Document Indexes support **[Weaviate](https://weaviate.io/)**, **[Qdran
Click to expand -If you come from Pydantic, you can see DocArray Documents as juiced up Pydantic models, and DocArray as a collection of goodies around them. +If you come from Pydantic, you can see DocArray documents as juiced up Pydantic models, and DocArray as a collection of goodies around them. More specifically, we set out to **make Pydantic fit for the ML world** - not by replacing it, but by building on top of it! -This means that you get the following benefits: -- **ML focused types**: Tensor, TorchTensor, Embedding, ..., including **tensor shape validation** +This means you get the following benefits: + +- **ML-focused types**: Tensor, TorchTensor, Embedding, ..., including **tensor shape validation** - Full compatibility with **FastAPI** - **DocList** and **DocVec** generalize the idea of a model to a _sequence_ or _batch_ of models. Perfect for **use in ML models** and other batch processing tasks. - **Types that are alive**: ImageUrl can `.load()` a URL to image tensor, TextUrl can load and tokenize text documents, etc. - Cloud-ready: Serialization to **Protobuf** for use with microservices and **gRPC** -- **Pre-built multi-modal Documents** for different data modalities: Image, Text, 3DMesh, Video, Audio and more. Note that all of these are valid Pydantic models! +- **Pre-built multimodal documents** for different data modalities: Image, Text, 3DMesh, Video, Audio and more. Note that all of these are valid Pydantic models! - **Document Stores** and **Document Indexes** let you store your data and retrieve it using **vector search** -The most obvious advantage here is **first-class support for ML centric data**, such as {Torch, TF, ...}Tensor, Embedding, etc. +The most obvious advantage here is **first-class support for ML centric data**, such as `{Torch, TF, ...}Tensor`, `Embedding`, etc. This includes handy features such as validating the shape of a tensor: @@ -506,7 +497,6 @@ except Exception as e:
- ## Coming from PyTorch
@@ -515,14 +505,15 @@ except Exception as e: If you come from PyTorch, you can see DocArray mainly as a way of _organizing your data as it flows through your model_. It offers you several advantages: -- Express **tensors shapes in type hints** + +- Express **tensor shapes in type hints** - **Group tensors that belong to the same object**, e.g. an audio track and an image - **Go directly to deployment**, by re-using your data model as a [FastAPI](https://fastapi.tiangolo.com/) or [Jina](https://github.com/jina-ai/jina) API schema - Connect model components between **microservices**, using Protobuf and gRPC DocArray can be used directly inside ML models to handle and represent multi-modal data. This allows you to reason about your data using DocArray's abstractions deep inside of `nn.Module`, -and provides a (FastAPI-compatible) schema that eases the transition between model training and model serving. +and provides a FastAPI-compatible schema that eases the transition between model training and model serving. To see the effect of this, let's first observe a vanilla PyTorch implementation of a tri-modal ML model: @@ -623,7 +614,7 @@ schema definition (see [below](#coming-from-fastapi)). Everything is handled in
Click to expand -Similar to the [PyTorch approach](#coming-from-pytorch), you can also use DocArray with TensorFlow to handle and represent multi-modal data inside your ML model. +Like the [PyTorch approach](#coming-from-pytorch), you can also use DocArray with TensorFlow to handle and represent multimodal data inside your ML model. First off, to use DocArray with TensorFlow we first need to install it as follows: @@ -632,7 +623,7 @@ pip install tensorflow==2.11.0 pip install protobuf==3.19.0 ``` -Compared to using DocArray with PyTorch, there is one main difference when using it with TensorFlow:\ +Compared to using DocArray with PyTorch, there is one main difference when using it with TensorFlow: While DocArray's `TorchTensor` is a subclass of `torch.Tensor`, this is not the case for the `TensorFlowTensor`: Due to some technical limitations of `tf.Tensor`, DocArray's `TensorFlowTensor` is not a subclass of `tf.Tensor` but rather stores a `tf.Tensor` in its `.tensor` attribute. How does this affect you? Whenever you want to access the tensor data to, let's say, do operations with it or hand it to your ML model, instead of handing over your `TensorFlowTensor` instance, you need to access its `.tensor` attribute. @@ -666,7 +657,6 @@ class MyPodcastModel(tf.keras.Model):
- ## Coming from FastAPI
@@ -676,10 +666,11 @@ Documents are Pydantic Models (with a twist), and as such they are fully compati But why should you use them, and not the Pydantic models you already know and love? Good question! + - Because of the ML-first features, types and validations, [here](#coming-from-pydantic) - Because DocArray can act as an [ORM for vector databases](#coming-from-a-vector-database), similar to what SQLModel does for SQL databases -And to seal the deal, let us show you how easily Documents slot into your FastAPI app: +And to seal the deal, let us show you how easily documents slot into your FastAPI app: ```python import numpy as np @@ -725,14 +716,13 @@ Just like a vanilla Pydantic model!
- ## Coming from a vector database
Click to expand If you came across DocArray as a universal vector database client, you can best think of it as **a new kind of ORM for vector databases**. -DocArray's job is to take multi-modal, nested and domain-specific data and to map it to a vector database, +DocArray's job is to take multimodal, nested and domain-specific data and to map it to a vector database, store it there, and thus make it searchable: ```python @@ -774,6 +764,7 @@ results, scores = index.find(query, limit=10, search_field='embedding') ``` Currently, DocArray supports the following vector databases: + - [Weaviate](https://www.weaviate.io/) - [Qdrant](https://qdrant.tech/) - [Elasticsearch](https://www.elastic.co/elasticsearch/) v8 and v7 @@ -783,25 +774,25 @@ An integration of [OpenSearch](https://opensearch.org/) is currently in progress Legacy versions of DocArray also support [Redis](https://redis.io/) and [Milvus](https://milvus.io/), but these are not yet supported in the current version. -Of course this is only one thing that DocArray can do, so we encourage you to check out the rest of this readme! +Of course this is only one of the things that DocArray can do, so we encourage you to check out the rest of this readme!
-## Install the alpha +## Installation -To try out the alpha you can install it via git: +To install DocArray from the CLI, run the following command: ```shell -pip install "git+https://github.com/docarray/docarray" +pip install docarray ``` ## See also -- [Documentation](https://docarray-v2--jina-docs.netlify.app/) +- [Documentation](https://docs.docarray.org) - [Join our Discord server](https://discord.gg/WaMp6PVPgR) - [Donation to Linux Foundation AI&Data blog post](https://jina.ai/news/donate-docarray-lf-for-inclusive-standard-multimodal-data-model/) - ["Legacy" DocArray github page](https://github.com/docarray/docarray/tree/docarray-v1-fixes) -- ["Legacy" DocArray documentation](https://docarray.jina.ai/) +- ["Legacy" DocArray documentation](https://docarray-legacy.jina.ai/) > DocArray is a trademark of LF AI Projects, LLC From a4078bbc1791fd48591e162b2c226c7d49c970c5 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Mon, 17 Apr 2023 17:46:54 +0200 Subject: [PATCH 09/10] docs: fix menu string Signed-off-by: Alex C-G --- mkdocs.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mkdocs.yml b/mkdocs.yml index bb1dffb933c..c7a992bde4e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -81,7 +81,7 @@ nav: - Home: README.md - User Guide: - user_guide/intro.md - - Represent: + - Representing data: - user_guide/representing/first_step.md - user_guide/representing/array.md - Sending data: From 7bb1624eabe180b67f5e704679b46b4b7fbfaa93 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Mon, 17 Apr 2023 17:52:34 +0200 Subject: [PATCH 10/10] docs: remove docs-readme Signed-off-by: Alex C-G --- docs/README.md | 807 ------------------------------------------------- 1 file changed, 807 deletions(-) delete mode 100644 docs/README.md diff --git a/docs/README.md b/docs/README.md deleted file mode 100644 index d31579acbc5..00000000000 --- a/docs/README.md +++ /dev/null @@ -1,807 +0,0 @@ -

-DocArray logo: The data structure for unstructured data -
-The data structure for multimodal data -

- -

-PyPI -Codecov branch - -PyPI - Downloads from official pypistats - -

- -> ⬆️ **DocArray v2**: This readme is for the second version of DocArray (starting at 0.30). If you want to use the older -> DocArray version (prior to 0.30) check out the [docarray-v1-fixes](https://github.com/docarray/docarray/tree/docarray-v1-fixes) branch - - -DocArray is a library for **representing, sending and storing multi-modal data**, perfect for **Machine Learning applications**. - -Those are the three pillars of DocArray, and you can check them out individually: - -1. [**Represent**](#represent) -2. [**Send**](#send) -3. [**Store**](#store) - -DocArray handles your data while integrating seamlessly with the rest of your **Python and ML ecosystem**: - -- :fire: DocArray has native compatibility for **[NumPy](https://github.com/numpy/numpy)**, **[PyTorch](https://github.com/pytorch/pytorch)** and **[TensorFlow](https://github.com/tensorflow/tensorflow)**, including for **model training use cases** -- :zap: DocArray is built on **[Pydantic](https://github.com/pydantic/pydantic)** and out-of-the-box compatible with **[FastAPI](https://github.com/tiangolo/fastapi/)** and **[Jina](https://github.com/jina-ai/jina/)** -- :package: DocArray can index data in vector databases such as **[Weaviate](https://weaviate.io/), [Qdrant](https://qdrant.tech/), [ElasticSearch](https://www.elastic.co/de/elasticsearch/)** as well as **[HNSWLib](https://github.com/nmslib/hnswlib)** -- :chains: DocArray data can be sent as JSON over **HTTP** or as **[Protobuf](https://protobuf.dev/)** over **[gRPC](https://grpc.io/)** - - -> :bulb: **Where are you coming from?** Depending on your use case and background, there are different ways to "get" DocArray. -> You can navigate to the following section for an explanation that should fit your mindset: -> -> - [Coming from pure PyTorch or TensorFlow](#coming-from-pytorch) -> - [Coming from Pydantic](#coming-from-pydantic) -> - [Coming from FastAPI](#coming-from-fastapi) -> - [Coming from a vector database](#coming-from-vector-database) - -DocArray was released under the open-source [Apache License 2.0](https://github.com/docarray/docarray/blob/main/LICENSE) in January 2022. It is currently a sandbox project under [LF AI & Data Foundation](https://lfaidata.foundation/). - -## Represent - -DocArray allows you to **represent your data**, in an ML-native way. - -This is useful for different use cases: - -- :woman_running: You are **training a model**, there are myriads of tensors of different shapes and sizes flying around, representing different _things_, and you want to keep a straight head about them -- :cloud: You are **serving a model**, for example through FastAPI, and you want to specify your API endpoints -- :card_index_dividers: You are **parsing data** for later use in your ML or DS applications - -> :bulb: **Coming from Pydantic?** If you're currently using Pydantic for the use cases above, you should be happy to hear -> that DocArray is built on top of, and fully compatible with, Pydantic! -> Also, we have [dedicated section](#coming-from-pydantic) just for you! - -Put simply, DocArray lets you represent your data in a dataclass-like way, with ML as a first class citizen: - -```python -from docarray import BaseDoc -from docarray.typing import TorchTensor, ImageUrl -import torch - - -# Define your data model -class MyDocument(BaseDoc): - description: str - image_url: ImageUrl # could also be VideoUrl, AudioUrl, etc. - image_tensor: TorchTensor[1704, 2272, 3] # you can express tensor shapes! - - -# Stack multiple documents in a Document Vector -from docarray import DocVec - -vec = DocVec[MyDocument]( - [ - MyDocument( - description="A cat", - image_url="https://example.com/cat.jpg", - image_tensor=torch.rand(1704, 2272, 3), - ), - ] - * 10 -) -print(vec.image_tensor.shape) # (10, 1704, 2272, 3) -``` - -
- Click for more details - -So let's take a closer look at how you can represent your data with DocArray: - -```python -from docarray import BaseDoc -from docarray.typing import TorchTensor, ImageUrl -from typing import Optional -import torch - - -# Define your data model -class MyDocument(BaseDoc): - description: str - image_url: ImageUrl # could also be VideoUrl, AudioUrl, etc. - image_tensor: Optional[ - TorchTensor[1704, 2272, 3] - ] # could also be NdArray or TensorflowTensor - embedding: Optional[TorchTensor] -``` - -So not only can you define the types of your data, you can even **specify the shape of your tensors!** - -Once you have your model in form of a `Document`, you can work with it! - -```python -# Create a document -doc = MyDocument( - description="This is a photo of a mountain", - image_url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", -) - -# Load image tensor from URL -doc.image_tensor = doc.image_url.load() - -# Compute embedding with any model of your choice - - -def clip_image_encoder(image_tensor: TorchTensor) -> TorchTensor: # dummy function - return torch.rand(512) - - -doc.embedding = clip_image_encoder(doc.image_tensor) - -print(doc.embedding.shape) # torch.Size([512]) -``` - -### Compose nested Documents - -Of course you can compose Documents into a nested structure: - -```python -from docarray import BaseDoc -from docarray.documents import ImageDoc, TextDoc -import numpy as np - - -class MultiModalDocument(BaseDoc): - image_doc: ImageDoc - text_doc: TextDoc - - -doc = MultiModalDocument( - image_doc=ImageDoc(tensor=np.zeros((3, 224, 224))), text_doc=TextDoc(text='hi!') -) -``` - -Of course, you rarely work with a single data point at a time, especially in Machine Learning applications. - -That's why you can easily collect multiple `Documents`: - -### Collect multiple `Documents` - -When building or interacting with an ML system, usually you want to process multiple Documents (data points) at once. - -DocArray offers two data structures for this: - -- **`DocVec`**: A vector of `Documents`. All tensors in the `Documents` are stacked up into a single tensor. **Perfect for batch processing and use inside of ML models**. -- **`DocList`**: A list of `Documents`. All tensors in the `Documents` are kept as-is. **Perfect for streaming, re-ranking, and shuffling of data**. - -Let's take a look at them, starting with `DocVec`: - -```python -from docarray import DocVec, BaseDoc -from docarray.typing import AnyTensor, ImageUrl -import numpy as np - - -class Image(BaseDoc): - url: ImageUrl - tensor: AnyTensor # this allows torch, numpy, and tensor flow tensors - - -vec = DocVec[Image]( # the DocVec is parametrized by your personal schema! - [ - Image( - url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", - tensor=np.zeros((3, 224, 224)), - ) - for _ in range(100) - ] -) -``` - -As you can see in the code snippet above, `DocVec` is **parametrized by the type of Document** you want to use with it: `DocVec[Image]`. - -This may look slightly weird at first, but we're confident that you'll get used to it quickly! -Besides, it allows us to do some cool things, like giving you **bulk access to the fields that you defined** in your `Document`: - -```python -tensor = vec.tensor # gets all the tensors in the DocVec -print(tensor.shape) # which are stacked up into a single tensor! -print(vec.url) # you can bulk access any other field, too -``` - -The second data structure, `DocList`, works in a similar way: - -```python -from docarray import DocList - -dl = DocList[Image]( # the DocList is parametrized by your personal schema! - [ - Image( - url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", - tensor=np.zeros((3, 224, 224)), - ) - for _ in range(100) - ] -) -``` - -You can still bulk access the fields of your `Document`: - -```python -tensors = dl.tensor # gets all the tensors in the DocList -print(type(tensors)) # as a list of tensors -print(dl.url) # you can bulk access any other field, too -``` - -And you can insert, remove, and append `Documents` to your `DocList`: - -```python -# append -dl.append( - Image( - url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", - tensor=np.zeros((3, 224, 224)), - ) -) -# delete -del dl[0] -# insert -dl.insert( - 0, - Image( - url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", - tensor=np.zeros((3, 224, 224)), - ), -) -``` - -And you can seamlessly switch between `DocVec` and `DocList`: - -```python -vec_2 = dl.to_doc_vec() -assert isinstance(vec_2, DocVec) - -dl_2 = vec_2.to_doc_list() -assert isinstance(dl_2, DocList) -``` - -
- -## Send - -DocArray allows you to **send your data**, in an ML-native way. - -This means there is native support for **Protobuf and gRPC**, on top of **HTTP** and serialization to JSON, JSONSchema, Base64, and Bytes. - -This is useful for different use cases: - -- :cloud: You are **serving a model**, for example through **[Jina](https://github.com/jina-ai/jina/)** or **[FastAPI](https://github.com/tiangolo/fastapi/)** -- :spider_web: You **distribute your model** across machines and need to send your data between nodes -- :gear: You are building a **microservice** architecture and need to send your data between microservices - -> :bulb: **Coming from FastAPI?** If you're currently using FastAPI for the use cases above, you should be happy to hear -> that DocArray is fully compatible with FastAPI! -> Also, we have [dedicated section](#coming-from-fastapi) just for you! - -Whenever you want to send your data you need to serialize it, so let's take a look at how that works with DocArray: - -```python -from docarray import BaseDoc -from docarray.typing import ImageTorchTensor -import torch - - -# model your data -class MyDocument(BaseDoc): - description: str - image: ImageTorchTensor[3, 224, 224] - - -# create a Document -doc = MyDocument( - description="This is a description", - image=torch.zeros((3, 224, 224)), -) - -# serialize it! -proto = doc.to_protobuf() -bytes_ = doc.to_bytes() -json = doc.json() - -# deserialize it! -doc_2 = MyDocument.from_protobuf(proto) -doc_4 = MyDocument.from_bytes(bytes_) -doc_5 = MyDocument.parse_raw(json) -``` - -Of course, serialization is not all you need. -So check out how DocArray integrates with FastAPI and Jina. - - -## Store - -Once you've modelled your data, and maybe sent it around, usually you want to **store it** somewhere. -But fret not! DocArray has you covered! - -**Document Stores** let you, well, store your Documents, locally or remotely, all with the same user interface: - -- :cd: **On disk** as a file in your local file system -- :bucket: On **[AWS S3](https://aws.amazon.com/de/s3/)** -- :cloud: On **[Jina AI Cloud](https://cloud.jina.ai/)** - -
- See Document Store usage - -The Document Store interface lets you push and pull Documents to and from multiple data sources, all with the same user interface. - -For example, let's see how that works with on-disk storage: - -```python -from docarray import BaseDoc, DocList - - -class SimpleDoc(BaseDoc): - text: str - - -docs = DocList[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(8)]) -docs.push('file://simple_docs') - -docs_pull = DocList[SimpleDoc].pull('file://simple_docs') -``` -
- -**Document Indexes** let you index your Documents into a **vector database**, for efficient similarity-based retrieval. - -This is useful for: - -- :left_speech_bubble: Augmenting **LLMs and Chatbots** with domain knowledge ([Retrieval Augmented Generation](https://arxiv.org/abs/2005.11401)) -- :mag: **Neural search** applications -- :bulb: **Recommender systems** - -Currently, DocArray Document Indexes support **[Weaviate](https://weaviate.io/)**, **[Qdrant](https://qdrant.tech/)**, **[ElasticSearch](https://www.elastic.co/)**, and **[HNSWLib](https://github.com/nmslib/hnswlib)**, with more to come! - -
- See Document Index usage - -The Document Index interface lets you index and retrieve Documents from multiple vector databases, all with the same user interface. - -It supports ANN vector search, text search, filtering, and hybrid search. - -```python -from docarray import DocList, BaseDoc -from docarray.index import HnswDocumentIndex -import numpy as np - -from docarray.typing import ImageUrl, ImageTensor, NdArray - - -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: ImageTensor - embedding: NdArray[128] - - -# create some data -dl = DocList[ImageDoc]( - [ - ImageDoc( - url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", - tensor=np.zeros((3, 224, 224)), - embedding=np.random.random((128,)), - ) - for _ in range(100) - ] -) - -# create a Document Index -index = HnswDocumentIndex[ImageDoc](work_dir='/tmp/test_index') - - -# index your data -index.index(dl) - -# find similar Documents -query = dl[0] -results, scores = index.find(query, limit=10, search_field='embedding') -``` - -
- -Depending on your background and use case, there are different ways for you to _get_ DocArray. -Choose your own adventure! - -## Coming from old DocArray - -
- Click to expand - -If you are using DocArray v<0.30.0, you will be familiar with its [dataclass API](https://docarray.jina.ai/fundamentals/dataclass/). - -_DocArray v2 is that idea, taken seriously._ Every `Document` is created through dataclass-like interface, -courtesy of [Pydantic](https://pydantic-docs.helpmanual.io/usage/models/). - -This gives the following advantages: -- **Flexibility:** No need to conform to a fixed set of fields -- your data defines the schema -- **Multi-modality:** Easily store multiple modalities and multiple embeddings in the same Document -- **Language agnostic:** At its core, Documents are just dictionaries. This makes it easy to create and send them from any language, not just Python. - -You may also be familiar with our old Document Stores for vector DB integration. -They are now called **Document Indexes** and offer the following improvements (see [here](#store) for the new API): -- **Hybrid search:** You can now combine vector search with text search, and even filter by arbitrary fields -- **Production-ready:** The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain -- **Increased flexibility:** We strive to support any configuration or setting that you could perform through the DB's first-party client - -For now, Document Indexes support **[Weaviate](https://weaviate.io/)**, **[Qdrant](https://qdrant.tech/)**, **[ElasticSearch](https://www.elastic.co/)**, and **[HNSWLib](https://github.com/nmslib/hnswlib)**, with more to come. - -
- -## Coming from Pydantic - -
- Click to expand - -If you come from Pydantic, you can see DocArray Documents as juiced up Pydantic models, and DocArray as a collection of goodies around them. - -More specifically, we set out to **make Pydantic fit for the ML world** - not by replacing it, but by building on top of it! - -This means that you get the following benefits: -- **ML focused types**: Tensor, TorchTensor, Embedding, ..., including **tensor shape validation** -- Full compatibility with **FastAPI** -- **DocList** and **DocVec** generalize the idea of a model to a _sequence_ or _batch_ of models. Perfect for **use in ML models** and other batch processing tasks. -- **Types that are alive**: ImageUrl can `.load()` a URL to image tensor, TextUrl can load and tokenize text documents, etc. -- Cloud-ready: Serialization to **Protobuf** for use with microservices and **gRPC** -- **Pre-built multi-modal Documents** for different data modalities: Image, Text, 3DMesh, Video, Audio and more. Note that all of these are valid Pydantic models! -- **Document Stores** and **Document Indexes** let you store your data and retrieve it using **vector search** - -The most obvious advantage here is **first-class support for ML centric data**, such as {Torch, TF, ...}Tensor, Embedding, etc. - -This includes handy features such as validating the shape of a tensor: - -```python -from docarray import BaseDoc -from docarray.typing import TorchTensor -import torch - - -class MyDoc(BaseDoc): - tensor: TorchTensor[3, 224, 224] - - -doc = MyDoc(tensor=torch.zeros(3, 224, 224)) # works -doc = MyDoc(tensor=torch.zeros(224, 224, 3)) # works by reshaping - -try: - doc = MyDoc(tensor=torch.zeros(224)) # fails validation -except Exception as e: - print(e) - # tensor - # Cannot reshape tensor of shape (224,) to shape (3, 224, 224) (type=value_error) - - -class Image(BaseDoc): - tensor: TorchTensor[3, 'x', 'x'] - - -Image(tensor=torch.zeros(3, 224, 224)) # works - -try: - Image( - tensor=torch.zeros(3, 64, 128) - ) # fails validation because second dimension does not match third -except Exception as e: - print() - - -try: - Image( - tensor=torch.zeros(4, 224, 224) - ) # fails validation because of the first dimension -except Exception as e: - print(e) - # Tensor shape mismatch. Expected(3, 'x', 'x'), got(4, 224, 224)(type=value_error) - -try: - Image( - tensor=torch.zeros(3, 64) - ) # fails validation because it does not have enough dimensions -except Exception as e: - print(e) - # Tensor shape mismatch. Expected (3, 'x', 'x'), got (3, 64) (type=value_error) -``` - -
- - -## Coming from PyTorch - -
- Click to expand - -If you come from PyTorch, you can see DocArray mainly as a way of _organizing your data as it flows through your model_. - -It offers you several advantages: -- Express **tensors shapes in type hints** -- **Group tensors that belong to the same object**, e.g. an audio track and an image -- **Go directly to deployment**, by re-using your data model as a [FastAPI](https://fastapi.tiangolo.com/) or [Jina](https://github.com/jina-ai/jina) API schema -- Connect model components between **microservices**, using Protobuf and gRPC - -DocArray can be used directly inside ML models to handle and represent multi-modal data. -This allows you to reason about your data using DocArray's abstractions deep inside of `nn.Module`, -and provides a (FastAPI-compatible) schema that eases the transition between model training and model serving. - -To see the effect of this, let's first observe a vanilla PyTorch implementation of a tri-modal ML model: - -```python -import torch -from torch import nn -import torch - - -def encoder(x): - return torch.rand(512) - - -class MyMultiModalModel(nn.Module): - def __init__(self): - super().__init__() - self.audio_encoder = encoder() - self.image_encoder = encoder() - self.text_encoder = encoder() - - def forward(self, text_1, text_2, image_1, image_2, audio_1, audio_2): - embedding_text_1 = self.text_encoder(text_1) - embedding_text_2 = self.text_encoder(text_2) - - embedding_image_1 = self.image_encoder(image_1) - embedding_image_2 = self.image_encoder(image_2) - - embedding_audio_1 = self.image_encoder(audio_1) - embedding_audio_2 = self.image_encoder(audio_2) - - return ( - embedding_text_1, - embedding_text_2, - embedding_image_1, - embedding_image_2, - embedding_audio_1, - embedding_audio_2, - ) -``` - -Not very easy on the eyes if you ask us. And even worse, if you need to add one more modality you have to touch every part of your code base, changing the `forward()` return type and making a whole lot of changes downstream from that. - -So, now let's see what the same code looks like with DocArray: - -```python -from docarray import DocList, BaseDoc -from docarray.documents import ImageDoc, TextDoc, AudioDoc -from docarray.typing import TorchTensor -from torch import nn -import torch - - -def encoder(x): - return torch.rand(512) - - -class Podcast(BaseDoc): - text: TextDoc - image: ImageDoc - audio: AudioDoc - - -class PairPodcast(BaseDoc): - left: Podcast - right: Podcast - - -class MyPodcastModel(nn.Module): - def __init__(self): - super().__init__() - self.audio_encoder = encoder() - self.image_encoder = encoder() - self.text_encoder = encoder() - - def forward_podcast(self, docs: DocList[Podcast]) -> DocList[Podcast]: - docs.audio.embedding = self.audio_encoder(docs.audio.tensor) - docs.text.embedding = self.text_encoder(docs.text.tensor) - docs.image.embedding = self.image_encoder(docs.image.tensor) - - return docs - - def forward(self, docs: DocList[PairPodcast]) -> DocList[PairPodcast]: - docs.left = self.forward_podcast(docs.left) - docs.right = self.forward_podcast(docs.right) - - return docs -``` - -Looks much better, doesn't it? -You instantly win in code readability and maintainability. And for the same price you can turn your PyTorch model into a FastAPI app and reuse your Document -schema definition (see [below](#coming-from-fastapi)). Everything is handled in a pythonic manner by relying on type hints. - -
- - -## Coming from TensorFlow - -
- Click to expand - -Similar to the [PyTorch approach](#coming-from-pytorch), you can also use DocArray with TensorFlow to handle and represent multi-modal data inside your ML model. - -First off, to use DocArray with TensorFlow we first need to install it as follows: - -``` -pip install tensorflow==2.11.0 -pip install protobuf==3.19.0 -``` - -Compared to using DocArray with PyTorch, there is one main difference when using it with TensorFlow:\ -While DocArray's `TorchTensor` is a subclass of `torch.Tensor`, this is not the case for the `TensorFlowTensor`: Due to some technical limitations of `tf.Tensor`, DocArray's `TensorFlowTensor` is not a subclass of `tf.Tensor` but rather stores a `tf.Tensor` in its `.tensor` attribute. - -How does this affect you? Whenever you want to access the tensor data to, let's say, do operations with it or hand it to your ML model, instead of handing over your `TensorFlowTensor` instance, you need to access its `.tensor` attribute. - -This would look like the following: - -```python -from typing import Optional - -from docarray import DocList, BaseDoc - -import tensorflow as tf - - -class Podcast(BaseDoc): - audio_tensor: Optional[AudioTensorFlowTensor] - embedding: Optional[AudioTensorFlowTensor] - - -class MyPodcastModel(tf.keras.Model): - def __init__(self): - super().__init__() - self.audio_encoder = AudioEncoder() - - def call(self, inputs: DocList[Podcast]) -> DocList[Podcast]: - inputs.audio_tensor.embedding = self.audio_encoder( - inputs.audio_tensor.tensor - ) # access audio_tensor's .tensor attribute - return inputs -``` - -
- - -## Coming from FastAPI - -
- Click to expand - -Documents are Pydantic Models (with a twist), and as such they are fully compatible with FastAPI! - -But why should you use them, and not the Pydantic models you already know and love? -Good question! -- Because of the ML-first features, types and validations, [here](#coming-from-pydantic) -- Because DocArray can act as an [ORM for vector databases](#coming-from-a-vector-database), similar to what SQLModel does for SQL databases - -And to seal the deal, let us show you how easily Documents slot into your FastAPI app: - -```python -import numpy as np -from fastapi import FastAPI -from httpx import AsyncClient - -from docarray import BaseDoc -from docarray.documents import ImageDoc -from docarray.typing import NdArray -from docarray.base_doc import DocArrayResponse - - -class InputDoc(BaseDoc): - img: ImageDoc - - -class OutputDoc(BaseDoc): - embedding_clip: NdArray - embedding_bert: NdArray - - -input_doc = InputDoc(img=ImageDoc(tensor=np.zeros((3, 224, 224)))) - -app = FastAPI() - - -@app.post("/doc/", response_model=OutputDoc, response_class=DocArrayResponse) -async def create_item(doc: InputDoc) -> OutputDoc: - ## call my fancy model to generate the embeddings - doc = OutputDoc( - embedding_clip=np.zeros((100, 1)), embedding_bert=np.zeros((100, 1)) - ) - return doc - - -async with AsyncClient(app=app, base_url="http://test") as ac: - response = await ac.post("/doc/", data=input_doc.json()) - resp_doc = await ac.get("/docs") - resp_redoc = await ac.get("/redoc") -``` - -Just like a vanilla Pydantic model! - -
- - -## Coming from a vector database - -
- Click to expand - -If you came across DocArray as a universal vector database client, you can best think of it as **a new kind of ORM for vector databases**. -DocArray's job is to take multi-modal, nested and domain-specific data and to map it to a vector database, -store it there, and thus make it searchable: - -```python -from docarray import DocList, BaseDoc -from docarray.index import HnswDocumentIndex -import numpy as np - -from docarray.typing import ImageUrl, ImageTensor, NdArray - - -class ImageDoc(BaseDoc): - url: ImageUrl - tensor: ImageTensor - embedding: NdArray[128] - - -# create some data -dl = DocList[ImageDoc]( - [ - ImageDoc( - url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg", - tensor=np.zeros((3, 224, 224)), - embedding=np.random.random((128,)), - ) - for _ in range(100) - ] -) - -# create a Document Index -index = HnswDocumentIndex[ImageDoc](work_dir='/tmp/test_index2') - - -# index your data -index.index(dl) - -# find similar Documents -query = dl[0] -results, scores = index.find(query, limit=10, search_field='embedding') -``` - -Currently, DocArray supports the following vector databases: -- [Weaviate](https://www.weaviate.io/) -- [Qdrant](https://qdrant.tech/) -- [Elasticsearch](https://www.elastic.co/elasticsearch/) v8 and v7 -- [HNSWlib](https://github.com/nmslib/hnswlib) as a local-first alternative - -An integration of [OpenSearch](https://opensearch.org/) is currently in progress. - -Legacy versions of DocArray also support [Redis](https://redis.io/) and [Milvus](https://milvus.io/), but these are not yet supported in the current version. - -Of course this is only one thing that DocArray can do, so we encourage you to check out the rest of this readme! - -
- - -## Install the alpha - -To try out the alpha you can install it via git: - -```shell -pip install "git+https://github.com/docarray/docarray" -``` - -## See also - -- [Documentation](https://docarray-v2--jina-docs.netlify.app/) -- [Join our Discord server](https://discord.gg/WaMp6PVPgR) -- [Donation to Linux Foundation AI&Data blog post](https://jina.ai/news/donate-docarray-lf-for-inclusive-standard-multimodal-data-model/) -- ["Legacy" DocArray github page](https://github.com/docarray/docarray/tree/docarray-v1-fixes) -- ["Legacy" DocArray documentation](https://docarray.jina.ai/) - -> DocArray is a trademark of LF AI Projects, LLC