Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,9 @@ DocArray handles your data while integrating seamlessly with the rest of your **
- :chains: DocArray data can be sent as JSON over **HTTP** or as **[Protobuf](https://protobuf.dev/)** over **[gRPC](https://grpc.io/)**


> :bulb: **Where are you coming from?** Depending on your use case and background, there are different was to "get" DocArray.
> You can navigate to the following section for an explanation that should fit your mindest:
> :bulb: **Where are you coming from?** Depending on your use case and background, there are different ways to "get" DocArray.
> You can navigate to the following section for an explanation that should fit your mindset:
>
> - [Coming from pure PyTorch or TensorFlow](#coming-from-pytorch)
> - [Coming from Pydantic](#coming-from-pydantic)
> - [Coming from FastAPI](#coming-from-fastapi)
Expand All @@ -46,7 +47,8 @@ DocArray was released under the open-source [Apache License 2.0](https://github.
DocArray allows you to **represent your data**, in an ML-native way.

This is useful for different use cases:
- :running_woman: You are **training a model**, there are myriads of tensors of different shapes and sizes flying around, representing different _things_, and you want to keep a straight head about them

- :woman_running: You are **training a model**, there are myriads of tensors of different shapes and sizes flying around, representing different _things_, and you want to keep a straight head about them
- :cloud: You are **serving a model**, for example through FastAPI, and you want to specify your API endpoints
- :card_index_dividers: You are **parsing data** for later use in your ML or DS applications

Expand All @@ -61,6 +63,7 @@ from docarray import BaseDoc
from docarray.typing import TorchTensor, ImageUrl
import torch


# Define your data model
class MyDocument(BaseDoc):
description: str
Expand Down Expand Up @@ -95,6 +98,7 @@ from docarray.typing import TorchTensor, ImageUrl
from typing import Optional
import torch


# Define your data model
class MyDocument(BaseDoc):
description: str
Expand Down Expand Up @@ -160,6 +164,7 @@ That's why you can easily collect multiple `Documents`:
When building or interacting with an ML system, usually you want to process multiple Documents (data points) at once.

DocArray offers two data structures for this:

- **`DocVec`**: A vector of `Documents`. All tensors in the `Documents` are stacked up into a single tensor. **Perfect for batch processing and use inside of ML models**.
- **`DocList`**: A list of `Documents`. All tensors in the `Documents` are kept as-is. **Perfect for streaming, re-ranking, and shuffling of data**.

Expand All @@ -185,7 +190,7 @@ vec = DocVec[Image]( # the DocVec is parametrized by your personal schema!
for _ in range(100)
]
)
```
```

As you can see in the code snippet above, `DocVec` is **parametrized by the type of Document** you want to use with it: `DocVec[Image]`.

Expand Down Expand Up @@ -263,6 +268,7 @@ DocArray allows you to **send your data**, in an ML-native way.
This means there is native support for **Protobuf and gRPC**, on top of **HTTP** and serialization to JSON, JSONSchema, Base64, and Bytes.

This is useful for different use cases:

- :cloud: You are **serving a model**, for example through **[Jina](https://github.com/jina-ai/jina/)** or **[FastAPI](https://github.com/tiangolo/fastapi/)**
- :spider_web: You **distribute your model** across machines and need to send your data between nodes
- :gear: You are building a **microservice** architecture and need to send your data between microservices
Expand All @@ -278,6 +284,7 @@ from docarray import BaseDoc
from docarray.typing import ImageTorchTensor
import torch


# model your data
class MyDocument(BaseDoc):
description: str
Expand All @@ -302,7 +309,7 @@ doc_5 = MyDocument.parse_raw(json)
```

Of course, serialization is not all you need.
So check out how DocArray integrates with FatAPI and Jina.
So check out how DocArray integrates with FastAPI and Jina.


## Store
Expand All @@ -311,6 +318,7 @@ Once you've modelled your data, and maybe sent it around, usually you want to **
But fret not! DocArray has you covered!

**Document Stores** let you, well, store your Documents, locally or remotely, all with the same user interface:

- :cd: **On disk** as a file in your local file system
- :bucket: On **[AWS S3](https://aws.amazon.com/de/s3/)**
- :cloud: On **[Jina AI Cloud](https://cloud.jina.ai/)**
Expand Down Expand Up @@ -348,6 +356,7 @@ dl_2 = DocList[ImageDoc].pull('s3://my-bucket/my-documents', show_progress=True)
**Document Indexes** let you index your Documents into a **vector database**, for efficient similarity-based retrieval.

This is useful for:

- :left_speech_bubble: Augmenting **LLMs and Chatbots** with domain knowledge ([Retrieval Augmented Generation](https://arxiv.org/abs/2005.11401))
- :mag: **Neural search** applications
- :bulb: **Recommender systems**
Expand Down
12 changes: 6 additions & 6 deletions docarray/array/doc_list/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -760,22 +760,22 @@ def save_binary(
"""Save DocList into a binary file.

It will use the protocol to pick how to save the DocList.
If used 'picke-doc_list` and `protobuf-array` the DocList will be stored
If used `picke-doc_list` and `protobuf-array` the DocList will be stored
and compressed at complete level using `pickle` or `protobuf`.
When using `protobuf` or `pickle` as protocol each Document in DocList
will be stored individually and this would make it available for streaming.

:param file: File or filename to which the data is saved.
:param protocol: protocol to use. It can be 'pickle-array', 'protobuf-array', 'pickle' or 'protobuf'
:param compress: compress algorithm to use between `lz4`, `bz2`, `lzma`, `zlib`, `gzip`
:param show_progress: show progress bar, only works when protocol is `pickle` or `protobuf`

!!! note
If `file` is `str` it can specify `protocol` and `compress` as file extensions.
This functionality assumes `file=file_name.$protocol.$compress` where `$protocol` and `$compress` refer to a
string interpolation of the respective `protocol` and `compress` methods.
For example if `file=my_docarray.protobuf.lz4` then the binary data will be created using `protocol=protobuf`
and `compress=lz4`.

:param file: File or filename to which the data is saved.
:param protocol: protocol to use. It can be 'pickle-array', 'protobuf-array', 'pickle' or 'protobuf'
:param compress: compress algorithm to use between `lz4`, `bz2`, `lzma`, `zlib`, `gzip`
:param show_progress: show progress bar, only works when protocol is `pickle` or `protobuf`
"""
if isinstance(file, io.BufferedWriter):
file_ctx = nullcontext(file)
Expand Down
4 changes: 3 additions & 1 deletion docarray/array/doc_list/pushpull.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,9 @@ def __len__(self) -> int:

@staticmethod
def resolve_url(url: str) -> Tuple[PUSH_PULL_PROTOCOL, str]:
"""Resolve the URL to the correct protocol and name."""
"""Resolve the URL to the correct protocol and name.
:param url: url to resolve
"""
protocol, name = url.split('://', 2)
if protocol in SUPPORTED_PUSH_PULL_PROTOCOLS:
protocol = cast(PUSH_PULL_PROTOCOL, protocol)
Expand Down
14 changes: 9 additions & 5 deletions docarray/array/doc_list/sequence_indexing_mixin.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,16 @@ class IndexingSequenceMixin(Iterable[T_item]):

You can index into, delete from, and set items in a IndexingSequenceMixin like a numpy doc_list or torch tensor:

.. code-block:: python
docs[0] # index by position
docs[0:5:2] # index by slice
docs[[0, 2, 3]] # index by list of indices
docs[True, False, True, True, ...] # index by boolean mask
---

```python
docs[0] # index by position
docs[0:5:2] # index by slice
docs[[0, 2, 3]] # index by list of indices
docs[True, False, True, True, ...] # index by boolean mask
```

---

"""

Expand Down
14 changes: 9 additions & 5 deletions docarray/array/doc_vec/list_advance_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,16 @@ class ListAdvancedIndexing(IndexingSequenceMixin[T_item]):

You can index into a ListAdvanceIndex like a numpy array or torch tensor:

.. code-block:: python
docs[0] # index by position
docs[0:5:2] # index by slice
docs[[0, 2, 3]] # index by list of indices
docs[True, False, True, True, ...] # index by boolean mask
---

```python
docs[0] # index by position
docs[0:5:2] # index by slice
docs[[0, 2, 3]] # index by list of indices
docs[True, False, True, True, ...] # index by boolean mask
```

---

"""

Expand Down
19 changes: 12 additions & 7 deletions docarray/base_doc/docarray_response.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,20 @@ class DocArrayResponse(JSONResponse):
This is a custom Response class for FastAPI and starlette. This is needed
to handle serialization of the Document types when using FastAPI

EXAMPLE USAGE
.. code-block:: python
from docarray.documets import Text
from docarray.base_doc import DocResponse
---

```python
from docarray.documets import Text
from docarray.base_doc import DocResponse


@app.post("/doc/", response_model=Text, response_class=DocResponse)
async def create_item(doc: Text) -> Text:
return doc
```

---

@app.post("/doc/", response_model=Text, response_class=DocResponse)
async def create_item(doc: Text) -> Text:
return doc
"""

def render(self, content: Any) -> bytes:
Expand Down
4 changes: 2 additions & 2 deletions docarray/base_doc/mixins/update.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,12 @@ def update(self, other: T):
- Setting data properties of the second Document to the first Document
if they are not None
- Concatenating lists and updating sets
- Updating recursively Documents and DocArrays
- Updating recursively Documents and DocLists
- Updating Dictionaries of the left with the right

It behaves as an update operation for Dictionaries, except that since
it is applied to a static schema type, the presence of the field is
given by the field not having a None value and that DocArrays,
given by the field not having a None value and that DocLists,
lists and sets are concatenated. It is worth mentioning that Tuples
are not merged together since they are meant to be immutable,
so they behave as regular types and the value of `self` is updated
Expand Down
2 changes: 1 addition & 1 deletion docarray/computation/abstract_comp_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ def minmax_normalize(
`tensor` can be a 1D array or a 2D array. When `tensor` is a 2D array, then
normalization is row-based.

.. note::
!!! note
- with `t_range=(0, 1)` will normalize the min-value of data to 0, max to 1;
- with `t_range=(1, 0)` will normalize the min-value of data to 1, max value
of the data to 0.
Expand Down
3 changes: 2 additions & 1 deletion docarray/computation/numpy_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,8 @@ def minmax_normalize(
`tensor` can be a 1D array or a 2D array. When `tensor` is a 2D array, then
normalization is row-based.

.. note::
!!! note

- with `t_range=(0, 1)` will normalize the min-value of data to 0, max to 1;
- with `t_range=(1, 0)` will normalize the min-value of data to 1, max value
of the data to 0.
Expand Down
3 changes: 2 additions & 1 deletion docarray/computation/torch_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,8 @@ def minmax_normalize(
`tensor` can be a 1D array or a 2D array. When `tensor` is a 2D array, then
normalization is row-based.

.. note::
!!! note

- with `t_range=(0, 1)` will normalize the min-value of data to 0, max to 1;
- with `t_range=(1, 0)` will normalize the min-value of data to 1, max value
of the data to 0.
Expand Down
80 changes: 42 additions & 38 deletions docarray/documents/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,6 @@ def create_doc(
) -> Type['T_doc']:
"""
Dynamically create a subclass of BaseDoc. This is a wrapper around pydantic's create_model.
:param __model_name: name of the created model
:param __config__: config class to use for the new model
:param __base__: base class for the new model to inherit from, must be BaseDoc or its subclass
:param __module__: module of the created model
:param __validators__: a dict of method names and @validator class methods
:param __cls_kwargs__: a dict for class creation
:param __slots__: Deprecated, `__slots__` should not be passed to `create_model`
:param field_definitions: fields of the model (or extra fields if a base is supplied)
in the format `<name>=(<type>, <default default>)` or `<name>=<default value>`
:return: the new Document class

```python
from docarray.documents import Audio
Expand All @@ -51,6 +41,17 @@ def create_doc(
assert issubclass(MyAudio, BaseDoc)
assert issubclass(MyAudio, Audio)
```

:param __model_name: name of the created model
:param __config__: config class to use for the new model
:param __base__: base class for the new model to inherit from, must be BaseDoc or its subclass
:param __module__: module of the created model
:param __validators__: a dict of method names and @validator class methods
:param __cls_kwargs__: a dict for class creation
:param __slots__: Deprecated, `__slots__` should not be passed to `create_model`
:param field_definitions: fields of the model (or extra fields if a base is supplied)
in the format `<name>=(<type>, <default default>)` or `<name>=<default value>`
:return: the new Document class
"""

if not issubclass(__base__, BaseDoc):
Expand All @@ -76,32 +77,34 @@ def create_doc_from_typeddict(
):
"""
Create a subclass of BaseDoc based on the fields of a `TypedDict`. This is a wrapper around pydantic's create_model_from_typeddict.
:param typeddict_cls: TypedDict class to use for the new Document class
:param kwargs: extra arguments to pass to `create_model_from_typeddict`
:return: the new Document class

EXAMPLE USAGE
---

.. code-block:: python
```python
from typing_extensions import TypedDict

from typing_extensions import TypedDict
from docarray import BaseDoc
from docarray.documents import Audio
from docarray.documents.helper import create_doc_from_typeddict
from docarray.typing.tensor.audio import AudioNdArray

from docarray import BaseDoc
from docarray.documents import Audio
from docarray.documents.helper import create_doc_from_typeddict
from docarray.typing.tensor.audio import AudioNdArray

class MyAudio(TypedDict):
title: str
tensor: AudioNdArray

class MyAudio(TypedDict):
title: str
tensor: AudioNdArray

Doc = create_doc_from_typeddict(MyAudio, __base__=Audio)

Doc = create_doc_from_typeddict(MyAudio, __base__=Audio)
assert issubclass(Doc, BaseDoc)
assert issubclass(Doc, Audio)
```

assert issubclass(Doc, BaseDoc)
assert issubclass(Doc, Audio)
---

:param typeddict_cls: TypedDict class to use for the new Document class
:param kwargs: extra arguments to pass to `create_model_from_typeddict`
:return: the new Document class
"""

if '__base__' in kwargs:
Expand All @@ -122,24 +125,25 @@ def create_doc_from_dict(model_name: str, data_dict: Dict[str, Any]) -> Type['T_
In case the example contains None as a value,
corresponding field will be viewed as the type Any.

:param model_name: Name of the new Document class
:param data_dict: Dictionary of field types to their corresponding values.
:return: the new Document class

EXAMPLE USAGE
---

.. code-block:: python
```python
import numpy as np
from docarray.documents import ImageDoc
from docarray.documents.helper import create_doc_from_dict

import numpy as np
from docarray.documents import ImageDoc
from docarray.documents.helper import create_doc_from_dict
data_dict = {'image': ImageDoc(tensor=np.random.rand(3, 224, 224)), 'author': 'me'}

data_dict = {'image': ImageDoc(tensor=np.random.rand(3, 224, 224)), 'author': 'me'}
MyDoc = create_doc_from_dict(model_name='MyDoc', data_dict=data_dict)

MyDoc = create_doc_from_dict(model_name='MyDoc', data_dict=data_dict)
assert issubclass(MyDoc, BaseDoc)
```

assert issubclass(MyDoc, BaseDoc)
---

:param model_name: Name of the new Document class
:param data_dict: Dictionary of field types to their corresponding values.
:return: the new Document class
"""
if not data_dict:
raise ValueError('`data_dict` should contain at least one item')
Expand Down
Loading