diff --git a/README.md b/README.md index 4468e767880..fbef28886bc 100644 --- a/README.md +++ b/README.md @@ -32,8 +32,9 @@ DocArray handles your data while integrating seamlessly with the rest of your ** - :chains: DocArray data can be sent as JSON over **HTTP** or as **[Protobuf](https://protobuf.dev/)** over **[gRPC](https://grpc.io/)** -> :bulb: **Where are you coming from?** Depending on your use case and background, there are different was to "get" DocArray. -> You can navigate to the following section for an explanation that should fit your mindest: +> :bulb: **Where are you coming from?** Depending on your use case and background, there are different ways to "get" DocArray. +> You can navigate to the following section for an explanation that should fit your mindset: +> > - [Coming from pure PyTorch or TensorFlow](#coming-from-pytorch) > - [Coming from Pydantic](#coming-from-pydantic) > - [Coming from FastAPI](#coming-from-fastapi) @@ -46,7 +47,8 @@ DocArray was released under the open-source [Apache License 2.0](https://github. DocArray allows you to **represent your data**, in an ML-native way. This is useful for different use cases: -- :running_woman: You are **training a model**, there are myriads of tensors of different shapes and sizes flying around, representing different _things_, and you want to keep a straight head about them + +- :woman_running: You are **training a model**, there are myriads of tensors of different shapes and sizes flying around, representing different _things_, and you want to keep a straight head about them - :cloud: You are **serving a model**, for example through FastAPI, and you want to specify your API endpoints - :card_index_dividers: You are **parsing data** for later use in your ML or DS applications @@ -61,6 +63,7 @@ from docarray import BaseDoc from docarray.typing import TorchTensor, ImageUrl import torch + # Define your data model class MyDocument(BaseDoc): description: str @@ -95,6 +98,7 @@ from docarray.typing import TorchTensor, ImageUrl from typing import Optional import torch + # Define your data model class MyDocument(BaseDoc): description: str @@ -160,6 +164,7 @@ That's why you can easily collect multiple `Documents`: When building or interacting with an ML system, usually you want to process multiple Documents (data points) at once. DocArray offers two data structures for this: + - **`DocVec`**: A vector of `Documents`. All tensors in the `Documents` are stacked up into a single tensor. **Perfect for batch processing and use inside of ML models**. - **`DocList`**: A list of `Documents`. All tensors in the `Documents` are kept as-is. **Perfect for streaming, re-ranking, and shuffling of data**. @@ -185,7 +190,7 @@ vec = DocVec[Image]( # the DocVec is parametrized by your personal schema! for _ in range(100) ] ) -``` +``` As you can see in the code snippet above, `DocVec` is **parametrized by the type of Document** you want to use with it: `DocVec[Image]`. @@ -263,6 +268,7 @@ DocArray allows you to **send your data**, in an ML-native way. This means there is native support for **Protobuf and gRPC**, on top of **HTTP** and serialization to JSON, JSONSchema, Base64, and Bytes. This is useful for different use cases: + - :cloud: You are **serving a model**, for example through **[Jina](https://github.com/jina-ai/jina/)** or **[FastAPI](https://github.com/tiangolo/fastapi/)** - :spider_web: You **distribute your model** across machines and need to send your data between nodes - :gear: You are building a **microservice** architecture and need to send your data between microservices @@ -278,6 +284,7 @@ from docarray import BaseDoc from docarray.typing import ImageTorchTensor import torch + # model your data class MyDocument(BaseDoc): description: str @@ -302,7 +309,7 @@ doc_5 = MyDocument.parse_raw(json) ``` Of course, serialization is not all you need. -So check out how DocArray integrates with FatAPI and Jina. +So check out how DocArray integrates with FastAPI and Jina. ## Store @@ -311,6 +318,7 @@ Once you've modelled your data, and maybe sent it around, usually you want to ** But fret not! DocArray has you covered! **Document Stores** let you, well, store your Documents, locally or remotely, all with the same user interface: + - :cd: **On disk** as a file in your local file system - :bucket: On **[AWS S3](https://aws.amazon.com/de/s3/)** - :cloud: On **[Jina AI Cloud](https://cloud.jina.ai/)** @@ -348,6 +356,7 @@ dl_2 = DocList[ImageDoc].pull('s3://my-bucket/my-documents', show_progress=True) **Document Indexes** let you index your Documents into a **vector database**, for efficient similarity-based retrieval. This is useful for: + - :left_speech_bubble: Augmenting **LLMs and Chatbots** with domain knowledge ([Retrieval Augmented Generation](https://arxiv.org/abs/2005.11401)) - :mag: **Neural search** applications - :bulb: **Recommender systems** diff --git a/docarray/array/doc_list/io.py b/docarray/array/doc_list/io.py index 16dca6a5bb0..7be91d98c41 100644 --- a/docarray/array/doc_list/io.py +++ b/docarray/array/doc_list/io.py @@ -760,22 +760,22 @@ def save_binary( """Save DocList into a binary file. It will use the protocol to pick how to save the DocList. - If used 'picke-doc_list` and `protobuf-array` the DocList will be stored + If used `picke-doc_list` and `protobuf-array` the DocList will be stored and compressed at complete level using `pickle` or `protobuf`. When using `protobuf` or `pickle` as protocol each Document in DocList will be stored individually and this would make it available for streaming. - :param file: File or filename to which the data is saved. - :param protocol: protocol to use. It can be 'pickle-array', 'protobuf-array', 'pickle' or 'protobuf' - :param compress: compress algorithm to use between `lz4`, `bz2`, `lzma`, `zlib`, `gzip` - :param show_progress: show progress bar, only works when protocol is `pickle` or `protobuf` - !!! note If `file` is `str` it can specify `protocol` and `compress` as file extensions. This functionality assumes `file=file_name.$protocol.$compress` where `$protocol` and `$compress` refer to a string interpolation of the respective `protocol` and `compress` methods. For example if `file=my_docarray.protobuf.lz4` then the binary data will be created using `protocol=protobuf` and `compress=lz4`. + + :param file: File or filename to which the data is saved. + :param protocol: protocol to use. It can be 'pickle-array', 'protobuf-array', 'pickle' or 'protobuf' + :param compress: compress algorithm to use between `lz4`, `bz2`, `lzma`, `zlib`, `gzip` + :param show_progress: show progress bar, only works when protocol is `pickle` or `protobuf` """ if isinstance(file, io.BufferedWriter): file_ctx = nullcontext(file) diff --git a/docarray/array/doc_list/pushpull.py b/docarray/array/doc_list/pushpull.py index 0d0f9384758..2bfe6764061 100644 --- a/docarray/array/doc_list/pushpull.py +++ b/docarray/array/doc_list/pushpull.py @@ -38,7 +38,9 @@ def __len__(self) -> int: @staticmethod def resolve_url(url: str) -> Tuple[PUSH_PULL_PROTOCOL, str]: - """Resolve the URL to the correct protocol and name.""" + """Resolve the URL to the correct protocol and name. + :param url: url to resolve + """ protocol, name = url.split('://', 2) if protocol in SUPPORTED_PUSH_PULL_PROTOCOLS: protocol = cast(PUSH_PULL_PROTOCOL, protocol) diff --git a/docarray/array/doc_list/sequence_indexing_mixin.py b/docarray/array/doc_list/sequence_indexing_mixin.py index 85bad64429f..8513c82bee0 100644 --- a/docarray/array/doc_list/sequence_indexing_mixin.py +++ b/docarray/array/doc_list/sequence_indexing_mixin.py @@ -41,12 +41,16 @@ class IndexingSequenceMixin(Iterable[T_item]): You can index into, delete from, and set items in a IndexingSequenceMixin like a numpy doc_list or torch tensor: - .. code-block:: python - docs[0] # index by position - docs[0:5:2] # index by slice - docs[[0, 2, 3]] # index by list of indices - docs[True, False, True, True, ...] # index by boolean mask + --- + ```python + docs[0] # index by position + docs[0:5:2] # index by slice + docs[[0, 2, 3]] # index by list of indices + docs[True, False, True, True, ...] # index by boolean mask + ``` + + --- """ diff --git a/docarray/array/doc_vec/list_advance_indexing.py b/docarray/array/doc_vec/list_advance_indexing.py index e0eaf2e970c..bc5c07d9c83 100644 --- a/docarray/array/doc_vec/list_advance_indexing.py +++ b/docarray/array/doc_vec/list_advance_indexing.py @@ -11,12 +11,16 @@ class ListAdvancedIndexing(IndexingSequenceMixin[T_item]): You can index into a ListAdvanceIndex like a numpy array or torch tensor: - .. code-block:: python - docs[0] # index by position - docs[0:5:2] # index by slice - docs[[0, 2, 3]] # index by list of indices - docs[True, False, True, True, ...] # index by boolean mask + --- + ```python + docs[0] # index by position + docs[0:5:2] # index by slice + docs[[0, 2, 3]] # index by list of indices + docs[True, False, True, True, ...] # index by boolean mask + ``` + + --- """ diff --git a/docarray/base_doc/docarray_response.py b/docarray/base_doc/docarray_response.py index 3e62cf64f9b..a9f807ab6b4 100644 --- a/docarray/base_doc/docarray_response.py +++ b/docarray/base_doc/docarray_response.py @@ -15,15 +15,20 @@ class DocArrayResponse(JSONResponse): This is a custom Response class for FastAPI and starlette. This is needed to handle serialization of the Document types when using FastAPI - EXAMPLE USAGE - .. code-block:: python - from docarray.documets import Text - from docarray.base_doc import DocResponse + --- + ```python + from docarray.documets import Text + from docarray.base_doc import DocResponse + + + @app.post("/doc/", response_model=Text, response_class=DocResponse) + async def create_item(doc: Text) -> Text: + return doc + ``` + + --- - @app.post("/doc/", response_model=Text, response_class=DocResponse) - async def create_item(doc: Text) -> Text: - return doc """ def render(self, content: Any) -> bytes: diff --git a/docarray/base_doc/mixins/update.py b/docarray/base_doc/mixins/update.py index 754e6c9b789..e463bcb6af1 100644 --- a/docarray/base_doc/mixins/update.py +++ b/docarray/base_doc/mixins/update.py @@ -28,12 +28,12 @@ def update(self, other: T): - Setting data properties of the second Document to the first Document if they are not None - Concatenating lists and updating sets - - Updating recursively Documents and DocArrays + - Updating recursively Documents and DocLists - Updating Dictionaries of the left with the right It behaves as an update operation for Dictionaries, except that since it is applied to a static schema type, the presence of the field is - given by the field not having a None value and that DocArrays, + given by the field not having a None value and that DocLists, lists and sets are concatenated. It is worth mentioning that Tuples are not merged together since they are meant to be immutable, so they behave as regular types and the value of `self` is updated diff --git a/docarray/computation/abstract_comp_backend.py b/docarray/computation/abstract_comp_backend.py index da80ad9f841..8e2be24cbfb 100644 --- a/docarray/computation/abstract_comp_backend.py +++ b/docarray/computation/abstract_comp_backend.py @@ -144,7 +144,7 @@ def minmax_normalize( `tensor` can be a 1D array or a 2D array. When `tensor` is a 2D array, then normalization is row-based. - .. note:: + !!! note - with `t_range=(0, 1)` will normalize the min-value of data to 0, max to 1; - with `t_range=(1, 0)` will normalize the min-value of data to 1, max value of the data to 0. diff --git a/docarray/computation/numpy_backend.py b/docarray/computation/numpy_backend.py index 45b43d763d4..30d50cc0174 100644 --- a/docarray/computation/numpy_backend.py +++ b/docarray/computation/numpy_backend.py @@ -91,7 +91,8 @@ def minmax_normalize( `tensor` can be a 1D array or a 2D array. When `tensor` is a 2D array, then normalization is row-based. - .. note:: + !!! note + - with `t_range=(0, 1)` will normalize the min-value of data to 0, max to 1; - with `t_range=(1, 0)` will normalize the min-value of data to 1, max value of the data to 0. diff --git a/docarray/computation/torch_backend.py b/docarray/computation/torch_backend.py index b1419f20200..be6d4ea03fd 100644 --- a/docarray/computation/torch_backend.py +++ b/docarray/computation/torch_backend.py @@ -147,7 +147,8 @@ def minmax_normalize( `tensor` can be a 1D array or a 2D array. When `tensor` is a 2D array, then normalization is row-based. - .. note:: + !!! note + - with `t_range=(0, 1)` will normalize the min-value of data to 0, max to 1; - with `t_range=(1, 0)` will normalize the min-value of data to 1, max value of the data to 0. diff --git a/docarray/documents/helper.py b/docarray/documents/helper.py index 94be2db2739..039ada7ae71 100644 --- a/docarray/documents/helper.py +++ b/docarray/documents/helper.py @@ -25,16 +25,6 @@ def create_doc( ) -> Type['T_doc']: """ Dynamically create a subclass of BaseDoc. This is a wrapper around pydantic's create_model. - :param __model_name: name of the created model - :param __config__: config class to use for the new model - :param __base__: base class for the new model to inherit from, must be BaseDoc or its subclass - :param __module__: module of the created model - :param __validators__: a dict of method names and @validator class methods - :param __cls_kwargs__: a dict for class creation - :param __slots__: Deprecated, `__slots__` should not be passed to `create_model` - :param field_definitions: fields of the model (or extra fields if a base is supplied) - in the format `=(, )` or `=` - :return: the new Document class ```python from docarray.documents import Audio @@ -51,6 +41,17 @@ def create_doc( assert issubclass(MyAudio, BaseDoc) assert issubclass(MyAudio, Audio) ``` + + :param __model_name: name of the created model + :param __config__: config class to use for the new model + :param __base__: base class for the new model to inherit from, must be BaseDoc or its subclass + :param __module__: module of the created model + :param __validators__: a dict of method names and @validator class methods + :param __cls_kwargs__: a dict for class creation + :param __slots__: Deprecated, `__slots__` should not be passed to `create_model` + :param field_definitions: fields of the model (or extra fields if a base is supplied) + in the format `=(, )` or `=` + :return: the new Document class """ if not issubclass(__base__, BaseDoc): @@ -76,32 +77,34 @@ def create_doc_from_typeddict( ): """ Create a subclass of BaseDoc based on the fields of a `TypedDict`. This is a wrapper around pydantic's create_model_from_typeddict. - :param typeddict_cls: TypedDict class to use for the new Document class - :param kwargs: extra arguments to pass to `create_model_from_typeddict` - :return: the new Document class - EXAMPLE USAGE + --- - .. code-block:: python + ```python + from typing_extensions import TypedDict - from typing_extensions import TypedDict + from docarray import BaseDoc + from docarray.documents import Audio + from docarray.documents.helper import create_doc_from_typeddict + from docarray.typing.tensor.audio import AudioNdArray - from docarray import BaseDoc - from docarray.documents import Audio - from docarray.documents.helper import create_doc_from_typeddict - from docarray.typing.tensor.audio import AudioNdArray + class MyAudio(TypedDict): + title: str + tensor: AudioNdArray - class MyAudio(TypedDict): - title: str - tensor: AudioNdArray + Doc = create_doc_from_typeddict(MyAudio, __base__=Audio) - Doc = create_doc_from_typeddict(MyAudio, __base__=Audio) + assert issubclass(Doc, BaseDoc) + assert issubclass(Doc, Audio) + ``` - assert issubclass(Doc, BaseDoc) - assert issubclass(Doc, Audio) + --- + :param typeddict_cls: TypedDict class to use for the new Document class + :param kwargs: extra arguments to pass to `create_model_from_typeddict` + :return: the new Document class """ if '__base__' in kwargs: @@ -122,24 +125,25 @@ def create_doc_from_dict(model_name: str, data_dict: Dict[str, Any]) -> Type['T_ In case the example contains None as a value, corresponding field will be viewed as the type Any. - :param model_name: Name of the new Document class - :param data_dict: Dictionary of field types to their corresponding values. - :return: the new Document class - - EXAMPLE USAGE + --- - .. code-block:: python + ```python + import numpy as np + from docarray.documents import ImageDoc + from docarray.documents.helper import create_doc_from_dict - import numpy as np - from docarray.documents import ImageDoc - from docarray.documents.helper import create_doc_from_dict + data_dict = {'image': ImageDoc(tensor=np.random.rand(3, 224, 224)), 'author': 'me'} - data_dict = {'image': ImageDoc(tensor=np.random.rand(3, 224, 224)), 'author': 'me'} + MyDoc = create_doc_from_dict(model_name='MyDoc', data_dict=data_dict) - MyDoc = create_doc_from_dict(model_name='MyDoc', data_dict=data_dict) + assert issubclass(MyDoc, BaseDoc) + ``` - assert issubclass(MyDoc, BaseDoc) + --- + :param model_name: Name of the new Document class + :param data_dict: Dictionary of field types to their corresponding values. + :return: the new Document class """ if not data_dict: raise ValueError('`data_dict` should contain at least one item') diff --git a/docarray/helper.py b/docarray/helper.py index 7c8972b4735..21469ca2acd 100644 --- a/docarray/helper.py +++ b/docarray/helper.py @@ -41,9 +41,9 @@ def _access_path_to_dict(access_path: str, value) -> Dict[str, Any]: """ Convert an access path ("__"-separated) and its value to a (potentially) nested dict. - EXAMPLE USAGE - .. code-block:: python - assert access_path_to_dict('image__url', 'img.png') == {'image': {'url': 'img.png'}} + ```python + assert access_path_to_dict('image__url', 'img.png') == {'image': {'url': 'img.png'}} + ``` """ fields = access_path.split('__') for field in reversed(fields): @@ -56,14 +56,16 @@ def _access_path_dict_to_nested_dict(access_path2val: Dict[str, Any]) -> Dict[An """ Convert a dict, where the keys are access paths ("__"-separated) to a nested dictionary. - EXAMPLE USAGE + --- - .. code-block:: python + ```python + access_path2val = {'image__url': 'some.png'} + assert access_path_dict_to_nested_dict(access_path2val) == { + 'image': {'url': 'some.png'} + } + ``` - access_path2val = {'image__url': 'some.png'} - assert access_path_dict_to_nested_dict(access_path2val) == { - 'image': {'url': 'some.png'} - } + --- :param access_path2val: dict with access_paths as keys :return: nested dict where the access path keys are split into separate field names and nested keys @@ -83,9 +85,10 @@ def _dict_to_access_paths(d: dict) -> Dict[str, Any]: Convert a (nested) dict to a Dict[access_path, value]. Access paths are defined as a path of field(s) separated by "__". - EXAMPLE USAGE - .. code-block:: python - assert dict_to_access_paths({'image': {'url': 'img.png'}}) == {'image__url', 'img.png'} + ```python + assert dict_to_access_paths({'image': {'url': 'img.png'}}) == {'image__url', 'img.png'} + ``` + """ result = {} for k, v in d.items(): @@ -105,15 +108,13 @@ def _update_nested_dicts( """ Update a dict with another one, while considering shared nested keys. - EXAMPLE USAGE: - - .. code-block:: python - - d1 = {'image': {'tensor': None}, 'title': 'hello'} - d2 = {'image': {'url': 'some.png'}} + ```python + d1 = {'image': {'tensor': None}, 'title': 'hello'} + d2 = {'image': {'url': 'some.png'}} - update_nested_dicts(d1, d2) - assert d1 == {'image': {'tensor': None, 'url': 'some.png'}, 'title': 'hello'} + update_nested_dicts(d1, d2) + assert d1 == {'image': {'tensor': None, 'url': 'some.png'}, 'title': 'hello'} + ``` :param to_update: dict that should be updated :param update_with: dict to update with @@ -131,6 +132,7 @@ def _get_field_type_by_access_path( ) -> Optional[Type]: """ Get field type by "__"-separated access path. + :param doc_type: type of document :param access_path: "__"-separated access path :return: field type of accessed attribute. If access path is invalid, return None. @@ -175,30 +177,32 @@ def get_paths( """ Yield file paths described by `patterns`. - EXAMPLE USAGE + --- - .. code-block:: python + ```python + from typing import Optional + from docarray import BaseDoc, DocList + from docarray.helper import get_paths + from docarray.typing import TextUrl, ImageUrl - from typing import Optional - from docarray import BaseDoc, DocList - from docarray.helper import get_paths - from docarray.typing import TextUrl, ImageUrl + class Banner(BaseDoc): + text_url: TextUrl + image_url: Optional[ImageUrl] - class Banner(BaseDoc): - text_url: TextUrl - image_url: Optional[ImageUrl] + # you can call it in the constructor + docs = DocList[Banner]([Banner(text_url=url) for url in get_paths(patterns='*.txt')]) - # you can call it in the constructor - docs = DocList[Banner]([Banner(text_url=url) for url in get_paths(patterns='*.txt')]) + # and call it after construction to set the urls + docs.image_url = list(get_paths(patterns='*.jpg', exclude_regex='test')) - # and call it after construction to set the urls - docs.image_url = list(get_paths(patterns='*.jpg', exclude_regex='test')) + for doc in docs: + assert doc.image_url.endswith('.txt') + assert doc.text_url.endswith('.jpg') + ``` - for doc in docs: - assert doc.image_url.endswith('.txt') - assert doc.text_url.endswith('.jpg') + --- :param patterns: The pattern may contain simple shell-style wildcards, e.g. '\*.py', '[\*.zip, \*.gz]' diff --git a/docarray/index/abstract.py b/docarray/index/abstract.py index 58118dc312d..a6eecbf185c 100644 --- a/docarray/index/abstract.py +++ b/docarray/index/abstract.py @@ -197,9 +197,10 @@ def execute_query(self, query: Any, *args, **kwargs) -> Any: Execute a query on the database. Can take two kinds of inputs: - - A native query of the underlying database. This is meant as a passthrough so that you + + 1. A native query of the underlying database. This is meant as a passthrough so that you can enjoy any functionality that is not available through the Document index API. - - The output of this Document index' `QueryBuilder.build()` method. + 2. The output of this Document index' `QueryBuilder.build()` method. :param query: the query to execute :param args: positional arguments to pass to the query @@ -268,8 +269,8 @@ def _filter_batched( :param filter_queries: the DB specific filter queries to execute :param limit: maximum number of documents to return per query - :return: List of DocArrays containing the documents - that match the filter queries + :return: List of DocLists containing the documents that match the filter + queries """ ... @@ -377,14 +378,14 @@ def configure(self, runtime_config=None, **kwargs): def index(self, docs: Union[BaseDoc, Sequence[BaseDoc]], **kwargs): """index Documents into the index. - :param docs: Documents to index. - !!! note Passing a sequence of Documents that is not a DocList (such as a List of Docs) comes at a performance penalty. This is because the Index needs to check compatibility between itself and the data. With a DocList as input this is a single check; for other inputs compatibility needs to be checked for every Document individually. + + :param docs: Documents to index. """ self._logger.debug(f'Indexing {len(docs)} documents') docs_validated = self._validate_docs(docs) diff --git a/docarray/index/backends/elastic.py b/docarray/index/backends/elastic.py index 0374f2ec162..862b2673389 100644 --- a/docarray/index/backends/elastic.py +++ b/docarray/index/backends/elastic.py @@ -56,6 +56,7 @@ class ElasticDocIndex(BaseDocIndex, Generic[TSchema]): def __init__(self, db_config=None, **kwargs): + """Initialize ElasticDocIndex""" super().__init__(db_config=db_config, **kwargs) self._db_config = cast(self.DBConfig, self._db_config) @@ -116,6 +117,7 @@ def __init__(self, outer_instance, **kwargs): } def build(self, *args, **kwargs) -> Any: + """Build the elastic search query object.""" if len(self._query['query']) == 0: del self._query['query'] elif 'knn' in self._query: @@ -131,6 +133,15 @@ def find( limit: int = 10, num_candidates: Optional[int] = None, ): + """ + Find k-nearest neighbors of the query. + + :param query: query vector for KNN/ANN search. Has single axis. + :param search_field: name of the field to search on + :param limit: maximum number of documents to return per query + :param num_candidates: number of candidates + :return: self + """ self._outer_instance._validate_search_field(search_field) if isinstance(query, BaseDoc): query_vec = BaseDocIndex._get_values_by_column([query], search_field)[0] @@ -149,11 +160,24 @@ def find( # filter accepts Leaf/Compound query clauses # https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html def filter(self, query: Dict[str, Any], limit: int = 10): + """Find documents in the index based on a filter query + + :param query: the query to execute + :param limit: maximum number of documents to return + :return: self + """ self._query['size'] = limit self._query['query']['bool']['filter'].append(query) return self def text_search(self, query: str, search_field: str = 'text', limit: int = 10): + """Find documents in the index based on a text search query + + :param query: The text to search for + :param search_field: name of the field to search on + :param limit: maximum number of documents to find + :return: self + """ self._outer_instance._validate_search_field(search_field) self._query['size'] = limit self._query['query']['bool']['must'].append( @@ -167,12 +191,16 @@ def text_search(self, query: str, search_field: str = 'text', limit: int = 10): def build_query(self, **kwargs) -> QueryBuilder: """ - Build a query for this DocumentIndex. + Build a query for ElasticDocIndex. + :param kwargs: parameters to forward to QueryBuilder initialization + :return: QueryBuilder object """ return self.QueryBuilder(self, **kwargs) @dataclass class DBConfig(BaseDocIndex.DBConfig): + """Dataclass that contains all "static" configurations of ElasticDocIndex.""" + hosts: Union[ str, List[Union[str, Mapping[str, Union[str, int]], NodeConfig]], None ] = 'http://localhost:9200' @@ -183,6 +211,8 @@ class DBConfig(BaseDocIndex.DBConfig): @dataclass class RuntimeConfig(BaseDocIndex.RuntimeConfig): + """Dataclass that contains all "dynamic" configurations of ElasticDocIndex.""" + default_column_config: Dict[Any, Dict[str, Any]] = field(default_factory=dict) chunk_size: int = 500 @@ -234,6 +264,7 @@ def __post_init__(self): self.default_column_config['dense_vector'] = self.dense_vector_config() def dense_vector_config(self): + """Get the dense vector config.""" config = { 'dims': -1, 'index': True, @@ -250,7 +281,13 @@ def dense_vector_config(self): ############################################### def python_type_to_db_type(self, python_type: Type) -> Any: - """Map python type to database type.""" + """Map python type to database type. + Takes any python type and returns the corresponding database column type. + + :param python_type: a python type. + :return: the corresponding database column type, + or None if ``python_type`` is not supported. + """ for allowed_type in ELASTIC_PY_VEC_TYPES: if issubclass(python_type, allowed_type): return 'dense_vector' @@ -302,6 +339,9 @@ def _index( self._refresh(self._index_name) def num_docs(self) -> int: + """ + Get the number of documents. + """ return self._client.count(index=self._index_name)['count'] def _del_items( @@ -344,6 +384,20 @@ def _get_items(self, doc_ids: Sequence[str]) -> Sequence[TSchema]: return accumulated_docs def execute_query(self, query: Dict[str, Any], *args, **kwargs) -> Any: + """ + Execute a query on the ElasticDocIndex. + + Can take two kinds of inputs: + + 1. A native query of the underlying database. This is meant as a passthrough so that you + can enjoy any functionality that is not available through the Document index API. + 2. The output of this Document index' `QueryBuilder.build()` method. + + :param query: the query to execute + :param args: positional arguments to pass to the query + :param kwargs: keyword arguments to pass to the query + :return: the result of the query + """ if args or kwargs: raise ValueError( f'args and kwargs not supported for `execute_query` on {type(self)}' diff --git a/docarray/index/backends/elasticv7.py b/docarray/index/backends/elasticv7.py index e77aedfc2b4..83c35606912 100644 --- a/docarray/index/backends/elasticv7.py +++ b/docarray/index/backends/elasticv7.py @@ -16,6 +16,7 @@ class ElasticV7DocIndex(ElasticDocIndex): def __init__(self, db_config=None, **kwargs): + """Initialize ElasticV7DocIndex""" from elasticsearch import __version__ as __es__version__ if __es__version__[0] > 7: @@ -31,6 +32,7 @@ def __init__(self, db_config=None, **kwargs): class QueryBuilder(ElasticDocIndex.QueryBuilder): def build(self, *args, **kwargs) -> Any: + """Build the elastic search v7 query object.""" if ( 'script_score' in self._query['query'] and 'bool' in self._query['query'] @@ -51,6 +53,14 @@ def find( limit: int = 10, num_candidates: Optional[int] = None, ): + """ + Find k-nearest neighbors of the query. + + :param query: query vector for KNN/ANN search. Has single axis. + :param search_field: name of the field to search on + :param limit: maximum number of documents to return per query + :return: self + """ if num_candidates: warnings.warn('`num_candidates` is not supported in ElasticV7DocIndex') @@ -74,10 +84,14 @@ def find( @dataclass class DBConfig(ElasticDocIndex.DBConfig): + """Dataclass that contains all "static" configurations of ElasticDocIndex.""" + hosts: Union[str, List[str], None] = 'http://localhost:9200' # type: ignore @dataclass class RuntimeConfig(ElasticDocIndex.RuntimeConfig): + """Dataclass that contains all "dynamic" configurations of ElasticDocIndex.""" + def dense_vector_config(self): return {'dims': 128} @@ -86,6 +100,18 @@ def dense_vector_config(self): ############################################### def execute_query(self, query: Dict[str, Any], *args, **kwargs) -> Any: + """ + Execute a query on the ElasticDocIndex. + + Can take two kinds of inputs: + + 1. A native query of the underlying database. This is meant as a passthrough so that you + can enjoy any functionality that is not available through the Document index API. + 2. The output of this Document index' `QueryBuilder.build()` method. + + :param query: the query to execute + :return: the result of the query + """ if args or kwargs: raise ValueError( f'args and kwargs not supported for `execute_query` on {type(self)}' diff --git a/docarray/index/backends/hnswlib.py b/docarray/index/backends/hnswlib.py index acf9a4d2332..a15606661e2 100644 --- a/docarray/index/backends/hnswlib.py +++ b/docarray/index/backends/hnswlib.py @@ -77,6 +77,7 @@ def inner(self, *args, **kwargs): class HnswDocumentIndex(BaseDocIndex, Generic[TSchema]): def __init__(self, db_config=None, **kwargs): + """Initialize HnswDocumentIndex""" super().__init__(db_config=db_config, **kwargs) self._db_config = cast(HnswDocumentIndex.DBConfig, self._db_config) self._work_dir = self._db_config.work_dir @@ -138,6 +139,7 @@ def __init__(self, query: Optional[List[Tuple[str, Dict]]] = None): self._queries: List[Tuple[str, Dict]] = query or [] def build(self, *args, **kwargs) -> Any: + """Build the query object.""" return self._queries find = _collect_query_args('find') @@ -149,10 +151,14 @@ def build(self, *args, **kwargs) -> Any: @dataclass class DBConfig(BaseDocIndex.DBConfig): + """Dataclass that contains all "static" configurations of WeaviateDocumentIndex.""" + work_dir: str = '.' @dataclass class RuntimeConfig(BaseDocIndex.RuntimeConfig): + """Dataclass that contains all "dynamic" configurations of WeaviateDocumentIndex.""" + default_column_config: Dict[Type, Dict[str, Any]] = field( default_factory=lambda: { np.ndarray: { @@ -176,7 +182,13 @@ class RuntimeConfig(BaseDocIndex.RuntimeConfig): ############################################### def python_type_to_db_type(self, python_type: Type) -> Any: - """Map python type to database type.""" + """Map python type to database type. + Takes any python type and returns the corresponding database column type. + + :param python_type: a python type. + :return: the corresponding database column type, + or None if ``python_type`` is not supported. + """ for allowed_type in HNSWLIB_PY_VEC_TYPES: if issubclass(python_type, allowed_type): return np.ndarray @@ -188,7 +200,17 @@ def _index(self, column_data_dic, **kwargs): ... def index(self, docs: Union[BaseDoc, Sequence[BaseDoc]], **kwargs): - """index a document into the store""" + """Index Documents into the index. + + !!! note + Passing a sequence of Documents that is not a DocList + (such as a List of Docs) comes at a performance penalty. + This is because the Index needs to check compatibility between itself and + the data. With a DocList as input this is a single check; for other inputs + compatibility needs to be checked for every Document individually. + + :param docs: Documents to index. + """ if kwargs: raise ValueError(f'{list(kwargs.keys())} are not valid keyword arguments') @@ -209,6 +231,20 @@ def index(self, docs: Union[BaseDoc, Sequence[BaseDoc]], **kwargs): self._sqlite_conn.commit() def execute_query(self, query: List[Tuple[str, Dict]], *args, **kwargs) -> Any: + """ + Execute a query on the WeaviateDocumentIndex. + + Can take two kinds of inputs: + + 1. A native query of the underlying database. This is meant as a passthrough so that you + can enjoy any functionality that is not available through the Document index API. + 2. The output of this Document index' `QueryBuilder.build()` method. + + :param query: the query to execute + :param args: positional arguments to pass to the query + :param kwargs: keyword arguments to pass to the query + :return: the result of the query + """ if args or kwargs: raise ValueError( f'args and kwargs not supported for `execute_query` on {type(self)}' @@ -324,6 +360,9 @@ def _get_items(self, doc_ids: Sequence[str]) -> Sequence[TSchema]: return out_docs def num_docs(self) -> int: + """ + Get the number of documents. + """ return self._get_num_docs_sqlite() ############################################### diff --git a/docarray/index/backends/qdrant.py b/docarray/index/backends/qdrant.py index 3acd09d6b68..a1f3f23ab1b 100644 --- a/docarray/index/backends/qdrant.py +++ b/docarray/index/backends/qdrant.py @@ -57,6 +57,7 @@ class QdrantDocumentIndex(BaseDocIndex, Generic[TSchema]): UUID_NAMESPACE = uuid.UUID('3896d314-1e95-4a3a-b45a-945f9f0b541d') def __init__(self, db_config=None, **kwargs): + """Initialize QdrantDocumentIndex""" super().__init__(db_config=db_config, **kwargs) self._db_config: QdrantDocumentIndex.DBConfig = cast( QdrantDocumentIndex.DBConfig, self._db_config @@ -79,6 +80,8 @@ def __init__(self, db_config=None, **kwargs): @dataclass class Query: + """Dataclass describing a query.""" + vector_field: Optional[str] vector_query: Optional[NdArray] filter: Optional[rest.Filter] @@ -98,6 +101,10 @@ def __init__( self._text_search_filters: List[Tuple[str, str]] = text_search_filters or [] def build(self, limit: int) -> 'QdrantDocumentIndex.Query': + """ + Build a query object for QdrantDocumentIndex. + :return: QdrantDocumentIndex.Query object + """ vector_query = None if len(self._vector_filters) > 0: # If there are multiple vector queries applied, we can average them and @@ -127,6 +134,13 @@ def build(self, limit: int) -> 'QdrantDocumentIndex.Query': def find( # type: ignore[override] self, query: NdArray, search_field: str = '' ) -> 'QdrantDocumentIndex.QueryBuilder': + """ + Find k-nearest neighbors of the query. + + :param query: query vector for search. Has single axis. + :param search_field: field to perform search on + :return: QueryBuilder object + """ if self._vector_search_field and self._vector_search_field != search_field: raise ValueError( f'Trying to call .find for search_field = {search_field}, but ' @@ -143,6 +157,10 @@ def find( # type: ignore[override] def filter( # type: ignore[override] self, filter_query: rest.Filter ) -> 'QdrantDocumentIndex.QueryBuilder': + """Find documents in the index based on a filter query + :param filter_query: a filter + :return: QueryBuilder object + """ return QdrantDocumentIndex.QueryBuilder( vector_search_field=self._vector_search_field, vector_filters=self._vector_filters, @@ -153,6 +171,12 @@ def filter( # type: ignore[override] def text_search( # type: ignore[override] self, query: str, search_field: str = '' ) -> 'QdrantDocumentIndex.QueryBuilder': + """Find documents in the index based on a text search query + + :param query: The text to search for + :param search_field: name of the field to search on + :return: QueryBuilder object + """ return QdrantDocumentIndex.QueryBuilder( vector_search_field=self._vector_search_field, vector_filters=self._vector_filters, @@ -166,6 +190,8 @@ def text_search( # type: ignore[override] @dataclass class DBConfig(BaseDocIndex.DBConfig): + """Dataclass that contains all "static" configurations of QdrantDocumentIndex.""" + location: Optional[str] = None url: Optional[str] = None port: Optional[int] = 6333 @@ -189,6 +215,8 @@ class DBConfig(BaseDocIndex.DBConfig): @dataclass class RuntimeConfig(BaseDocIndex.RuntimeConfig): + """Dataclass that contains all "dynamic" configurations of QdrantDocumentIndex.""" + default_column_config: Dict[Type, Dict[str, Any]] = field( default_factory=lambda: { 'id': {}, # type: ignore[dict-item] @@ -198,6 +226,12 @@ class RuntimeConfig(BaseDocIndex.RuntimeConfig): ) def python_type_to_db_type(self, python_type: Type) -> Any: + """Map python type to database type. + Takes any python type and returns the corresponding database column type. + + :param python_type: a python type. + :return: the corresponding database column type. + """ if any(issubclass(python_type, vt) for vt in QDRANT_PY_VECTOR_TYPES): return 'vector' @@ -243,6 +277,9 @@ def _index(self, column_to_data: Dict[str, Generator[Any, None, None]]): ) def num_docs(self) -> int: + """ + Get the number of documents. + """ return self._client.count(collection_name=self._db_config.collection_name).count def _del_items(self, doc_ids: Sequence[str]): @@ -278,6 +315,20 @@ def _get_items( return [self._convert_to_doc(point) for point in response] def execute_query(self, query: Union[Query, RawQuery], *args, **kwargs) -> DocList: + """ + Execute a query on the QdrantDocumentIndex. + + Can take two kinds of inputs: + + 1. A native query of the underlying database. This is meant as a passthrough so that you + can enjoy any functionality that is not available through the Document index API. + 2. The output of this Document index's `QueryBuilder.build()` method. + + :param query: the query to execute + :param args: positional arguments to pass to the query + :param kwargs: keyword arguments to pass to the query + :return: the result of the query + """ if not isinstance(query, QdrantDocumentIndex.Query): points = self._execute_raw_query(query.copy()) elif query.vector_field: diff --git a/docarray/index/backends/weaviate.py b/docarray/index/backends/weaviate.py index c54d3e76f47..4fb44636860 100644 --- a/docarray/index/backends/weaviate.py +++ b/docarray/index/backends/weaviate.py @@ -71,6 +71,8 @@ class EmbeddedOptions: class WeaviateDocumentIndex(BaseDocIndex, Generic[TSchema]): def __init__(self, db_config=None, **kwargs) -> None: + """Initialize WeaviateDocumentIndex""" + self.embedding_column: Optional[str] = None self.properties: Optional[List[str]] = None # keep track of the column name that contains the bytes @@ -159,6 +161,16 @@ def _build_auth_credentials(self): return None def configure(self, runtime_config=None, **kwargs) -> None: + """ + Configure the WeaviateDocumentIndex. + You can either pass a config object to `config` or pass individual config + parameters as keyword arguments. + If a configuration object is passed, it will replace the current configuration. + If keyword arguments are passed, they will update the current configuration. + + :param runtime_config: the configuration to apply + :param kwargs: individual configuration parameters + """ super().configure(runtime_config, **kwargs) self._configure_client() @@ -202,6 +214,8 @@ def _create_schema(self) -> None: @dataclass class DBConfig(BaseDocIndex.DBConfig): + """Dataclass that contains all "static" configurations of WeaviateDocumentIndex.""" + host: str = 'http://localhost:8080' index_name: str = 'Document' username: Optional[str] = None @@ -212,6 +226,8 @@ class DBConfig(BaseDocIndex.DBConfig): @dataclass class RuntimeConfig(BaseDocIndex.RuntimeConfig): + """Dataclass that contains all "dynamic" configurations of WeaviateDocumentIndex.""" + default_column_config: Dict[Any, Dict[str, Any]] = field( default_factory=lambda: { np.ndarray: {}, @@ -297,6 +313,14 @@ def find( limit: int = 10, **kwargs, ): + """ + Find k-nearest neighbors of the query. + + :param query: query vector for KNN/ANN search. Has single axis. + :param search_field: name of the field to search on + :param limit: maximum number of documents to return per query + :return: a named tuple containing `documents` and `scores` + """ self._logger.debug('Executing `find`') if search_field != '': raise ValueError( @@ -392,6 +416,18 @@ def find_batched( limit: int = 10, **kwargs, ) -> FindResultBatched: + """Find documents in the index using nearest neighbor search. + + :param queries: query vector for KNN/ANN search. + Can be either a tensor-like (np.array, torch.Tensor, etc.) with a, + or a DocList. + If a tensor-like is passed, it should have shape (batch_size, vector_dim) + :param search_field: name of the field to search on. + Documents in the index are retrieved based on this similarity + of this field to the query. + :param limit: maximum number of documents to return per query + :return: a named tuple containing `documents` and `scores` + """ self._logger.debug('Executing `find_batched`') if search_field != '': raise ValueError( @@ -580,6 +616,20 @@ def _text_search_batched( return _FindResultBatched(list(docs), list(scores)) def execute_query(self, query: Any, *args, **kwargs) -> Any: + """ + Execute a query on the WeaviateDocumentIndex. + + Can take two kinds of inputs: + + 1. A native query of the underlying database. This is meant as a passthrough so that you + can enjoy any functionality that is not available through the Document index API. + 2. The output of this Document index' `QueryBuilder.build()` method. + + :param query: the query to execute + :param args: positional arguments to pass to the query + :param kwargs: keyword arguments to pass to the query + :return: the result of the query + """ da_class = DocList.__class_getitem__(cast(Type[BaseDoc], self._schema)) if isinstance(query, self.QueryBuilder): @@ -604,6 +654,9 @@ def f(doc): return self._client.query.raw(query) def num_docs(self) -> int: + """ + Get the number of documents. + """ index_name = self._db_config.index_name result = self._client.query.aggregate(index_name).with_meta_count().do() # TODO: decorator to check for errors @@ -612,7 +665,13 @@ def num_docs(self) -> int: return total_docs def python_type_to_db_type(self, python_type: Type) -> Any: - """Map python type to database type.""" + """Map python type to database type. + Takes any python type and returns the corresponding database column type. + + :param python_type: a python type. + :return: the corresponding database column type, + or None if ``python_type`` is not supported. + """ for allowed_type in WEAVIATE_PY_VEC_TYPES: if issubclass(python_type, allowed_type): return 'number[]' @@ -634,6 +693,10 @@ def python_type_to_db_type(self, python_type: Type) -> Any: raise ValueError(f'Unsupported column type for {type(self)}: {python_type}') def build_query(self) -> BaseDocIndex.QueryBuilder: + """ + Build a query for WeaviateDocumentIndex. + :return: QueryBuilder object + """ return self.QueryBuilder(self) def _get_embedding_field(self): @@ -670,6 +733,7 @@ def __init__(self, document_index): ] def build(self) -> Any: + """Build the query object.""" num_queries = len(self._queries) for i in range(num_queries): @@ -730,6 +794,14 @@ def find( score_name: Literal["certainty", "distance"] = "certainty", score_threshold: Optional[float] = None, ) -> Any: + """ + Find k-nearest neighbors of the query. + + :param query: query vector for search. Has single axis. + :param score_name: either `"certainty"` (default) or `"distance"` + :param score_threshold: the threshold of the score + :return: self + """ near_vector = { "vector": query, } @@ -745,6 +817,16 @@ def find_batched( score_name: Literal["certainty", "distance"] = "certainty", score_threshold: Optional[float] = None, ) -> Any: + """Find k-nearest neighbors of the query vectors. + + :param queries: query vector for KNN/ANN search. + Can be either a tensor-like (np.array, torch.Tensor, etc.) with a, + or a DocList. + If a tensor-like is passed, it should have shape `(batch_size, vector_dim)` + :param score_name: either `"certainty"` (default) or `"distance"` + :param score_threshold: the threshold of the score + :return: self + """ adj_queries, adj_clauses = self._resize_queries_and_clauses( self._queries, queries ) @@ -764,12 +846,20 @@ def find_batched( return self def filter(self, where_filter) -> Any: + """Find documents in the index based on a filter query + :param where_filter: a filter + :return: self + """ where_filter = where_filter.copy() self._overwrite_id(where_filter) self._queries[0] = self._queries[0].with_where(where_filter) return self def filter_batched(self, filters) -> Any: + """Find documents in the index based on a filter query + :param filters: filters + :return: self + """ adj_queries, adj_clauses = self._resize_queries_and_clauses( self._queries, filters ) @@ -785,11 +875,23 @@ def filter_batched(self, filters) -> Any: return self def text_search(self, query, search_field) -> Any: + """Find documents in the index based on a text search query + + :param query: The text to search for + :param search_field: name of the field to search on + :return: self + """ bm25 = {"query": query, "properties": [search_field]} self._queries[0] = self._queries[0].with_bm25(**bm25) return self def text_search_batched(self, queries, search_field) -> Any: + """Find documents in the index based on a text search query + + :param queries: The texts to search for + :param search_field: name of the field to search on + :return: self + """ adj_queries, adj_clauses = self._resize_queries_and_clauses( self._queries, queries ) diff --git a/docarray/store/abstract_doc_store.py b/docarray/store/abstract_doc_store.py index 16c17227a64..df7788f584a 100644 --- a/docarray/store/abstract_doc_store.py +++ b/docarray/store/abstract_doc_store.py @@ -11,7 +11,7 @@ class AbstractDocStore(ABC): @staticmethod @abstractmethod def list(namespace: str, show_table: bool) -> List[str]: - """List all DocArrays in the specified backend at the namespace. + """List all DocLists in the specified backend at the namespace. :param namespace: The namespace to list :param show_table: If true, a table is printed to the console diff --git a/docarray/typing/proto_register.py b/docarray/typing/proto_register.py index ff7dd1038dd..700fe744ad8 100644 --- a/docarray/typing/proto_register.py +++ b/docarray/typing/proto_register.py @@ -15,17 +15,19 @@ def _register_proto( This will add the type key to the global registry of types key used in the proto serialization and deserialization. This is for internal usage only. - EXAMPLE USAGE + --- - .. code-block:: python + ```python + from docarray.typing.proto_register import register_proto + from docarray.typing.abstract_type import AbstractType - from docarray.typing.proto_register import register_proto - from docarray.typing.abstract_type import AbstractType + @register_proto(proto_type_name='my_type') + class MyType(AbstractType): + ... + ``` - @register_proto(proto_type_name='my_type') - class MyType(AbstractType): - ... + --- :param cls: the class to register :return: the class diff --git a/docarray/typing/tensor/audio/abstract_audio_tensor.py b/docarray/typing/tensor/audio/abstract_audio_tensor.py index b987b2addfd..213d69b2ee0 100644 --- a/docarray/typing/tensor/audio/abstract_audio_tensor.py +++ b/docarray/typing/tensor/audio/abstract_audio_tensor.py @@ -16,7 +16,7 @@ class AbstractAudioTensor(AbstractTensor, ABC): def to_bytes(self) -> 'AudioBytes': """ - Convert audio tensor to [`AudioBytes`][docarray.typrin.AudioBytes]. + Convert audio tensor to [`AudioBytes`][docarray.typing.AudioBytes]. """ from docarray.typing.bytes.audio_bytes import AudioBytes diff --git a/docs/.gitignore b/docs/.gitignore index 38f32345848..006c5fe7420 100644 --- a/docs/.gitignore +++ b/docs/.gitignore @@ -1,6 +1,6 @@ api/* proto/* -README.md +../README.md index.md CONTRIBUTING.md \ No newline at end of file diff --git a/docs/api_references/doc_index/backends/elastic.md b/docs/api_references/doc_index/backends/elastic.md new file mode 100644 index 00000000000..287d6c3d42a --- /dev/null +++ b/docs/api_references/doc_index/backends/elastic.md @@ -0,0 +1,3 @@ +# ElasticDocIndex + +::: docarray.index.backends.elastic.ElasticDocIndex diff --git a/docs/api_references/doc_index/backends/elastic7.md b/docs/api_references/doc_index/backends/elastic7.md new file mode 100644 index 00000000000..0838ce7e89c --- /dev/null +++ b/docs/api_references/doc_index/backends/elastic7.md @@ -0,0 +1,3 @@ +# ElasticV7DocIndex + +::: docarray.index.backends.elasticv7.ElasticV7DocIndex diff --git a/docs/api_references/doc_index/backends/hnswlib.md b/docs/api_references/doc_index/backends/hnswlib.md new file mode 100644 index 00000000000..dfb1d9b582c --- /dev/null +++ b/docs/api_references/doc_index/backends/hnswlib.md @@ -0,0 +1,3 @@ +# HnswDocumentIndex + +::: docarray.index.backends.hnswlib.HnswDocumentIndex diff --git a/docs/api_references/doc_index/backends/qdrant.md b/docs/api_references/doc_index/backends/qdrant.md new file mode 100644 index 00000000000..3a4d6e4c98e --- /dev/null +++ b/docs/api_references/doc_index/backends/qdrant.md @@ -0,0 +1,3 @@ +# QdrantDocumentIndex + +::: docarray.index.backends.qdrant.QdrantDocumentIndex \ No newline at end of file diff --git a/docs/api_references/doc_index/backends/weaviate.md b/docs/api_references/doc_index/backends/weaviate.md new file mode 100644 index 00000000000..86bdf435619 --- /dev/null +++ b/docs/api_references/doc_index/backends/weaviate.md @@ -0,0 +1,3 @@ +# WeaviateDocumentIndex + +::: docarray.index.backends.weaviate.WeaviateDocumentIndex \ No newline at end of file diff --git a/docs/api_references/doc_index/doc_index.md b/docs/api_references/doc_index/doc_index.md new file mode 100644 index 00000000000..0cbcbe8cb74 --- /dev/null +++ b/docs/api_references/doc_index/doc_index.md @@ -0,0 +1,3 @@ +# DocIndex + +::: docarray.index.abstract.BaseDocIndex diff --git a/docs/data_types/first_steps.md b/docs/data_types/first_steps.md new file mode 100644 index 00000000000..4119e9df1e8 --- /dev/null +++ b/docs/data_types/first_steps.md @@ -0,0 +1,14 @@ +# Intro + +With DocArray you can represent text, image, video, audio, and 3D meshes, whether separate, nested or combined, +and process them as a DocList. + +This section covers the following sections: + +- [Text](text/text.md) +- [Image](image/image.md) +- [Audio](audio/audio.md) +- [Video](video/video.md) +- [3D Mesh](3d_mesh/3d_mesh.md) +- [Table](table/table.md) +- [Multimodal data](multimodal/multimodal.md) \ No newline at end of file diff --git a/docs/how_to/multimodal_training_and_serving.md b/docs/how_to/multimodal_training_and_serving.md index b89b852297f..9f5a1f73e3c 100644 --- a/docs/how_to/multimodal_training_and_serving.md +++ b/docs/how_to/multimodal_training_and_serving.md @@ -42,7 +42,7 @@ Note that in this notebook by no means we aim at reproduce any CLIP results (our but we rather want to show how DocList datastructures help researchers and practitioners to write beautiful and pythonic multi-modal PyTorch code. -```python tags=[] +```bash #!pip install "git+https://github.com/DocList/DocList@feat-rewrite-v2#egg=DocList[torch,image]" #!pip install torchvision #!pip install transformers @@ -83,14 +83,14 @@ The `BaseDoc` class allows users to define their own (nested, multi-modal) Docum Let's start by defining a few Documents to handle the different modalities that we will use during our training: ```python -from DocList import BaseDoc, DocList -from DocList.typing import TorchTensor, ImageUrl +from docarray import BaseDoc, DocList +from docarray.typing import TorchTensor, ImageUrl ``` Let's first create a Document for our Text modality. It will contain a number of `Tokens`, which we also define: ```python -from DocList.documents import TextDoc as BaseText +from docarray.documents import TextDoc as BaseText class Tokens(BaseDoc): @@ -102,43 +102,43 @@ class Tokens(BaseDoc): class Text(BaseText): tokens: Optional[Tokens] ``` -Notice the `TorchTensor` type. It is a thin wrapper around `torch.Tensor` that can be use like any other torch tensor, +Notice the [`TorchTensor`][docarray.typing.TorchTensor] type. It is a thin wrapper around `torch.Tensor` that can be use like any other torch tensor, but also enables additional features. One such feature is shape parametrization (`TorchTensor[48]`), which lets you hint and even enforce the desired shape of any tensor! -To represent our image data, we use the `Image` Document that is included in DocList: +To represent our image data, we use the [`ImageDoc`][docarray.documents.ImageDoc] that is included in DocList: ```python -from DocList.documents import ImageDoc +from docarray.documents import ImageDoc ``` -Under the hood, an `Image` looks something like this (with the only main difference that it can take tensors from any +Under the hood, an `ImageDoc` looks something like this (with the only main difference that it can take tensors from any supported ML framework): ```python -# class Image(BaseDoc): -# url: Optional[ImageUrl] -# tensor: Optional[TorchTesor] -# embedding: Optional[TorchTensor] +class ImageDoc(BaseDoc): + url: Optional[ImageUrl] + tensor: Optional[TorchTesor] + embedding: Optional[TorchTensor] ``` -Actually, the `BaseText` above also alredy includes `tensor`, `url` and `embedding` fields, so we can use those on our +Actually, the `BaseText` above also already includes `tensor`, `url` and `embedding` fields, so we can use those on our `Text` Document as well. The final Document used for training here is the `PairTextImage`, which simply combines the Text and Image modalities: ```python class PairTextImage(BaseDoc): - text: Text - image: Image + text: TextDoc + image: ImageDoc ``` ## Create the Dataset -In this section we will create a multi-modal pytorch dataset around the Flick8k dataset using DocList. +In this section we will create a multi-modal pytorch dataset around the Flick8k dataset using `DocList`. -We will use DocList data loading functionality to load the data and use Torchvision and Transformers to preprocess the data before feeding it to our deep learning model: +We will use `DocList` data loading functionality to load the data and use Torchvision and Transformers to preprocess the data before feeding it to our deep learning model: ```python from torch.utils.data import DataLoader, Dataset @@ -193,7 +193,7 @@ def get_flickr8k_da(file: str = "captions.txt", N: Optional[int] = None): In the `get_flickr8k_da` method we process the Flickr8k dataset into a `DocList`. -Now let's instantiate this dataset using the `MultiModalDataset` class. The constructor takes in the `da` and a dictionary of preprocessing transformations: +Now let's instantiate this dataset using the [`MultiModalDataset`][docarray.data.MultiModalDataset] class. The constructor takes in the `da` and a dictionary of preprocessing transformations: ```python da = get_flickr8k_da() @@ -201,7 +201,7 @@ preprocessing = {"image": VisionPreprocess(), "text": TextPreprocess()} ``` ```python -from DocList.data import MultiModalDataset +from docarray.data import MultiModalDataset dataset = MultiModalDataset[PairTextImage](da=da, preprocessing=preprocessing) loader = DataLoader( @@ -218,7 +218,7 @@ loader = DataLoader( In this section we create two encoders, one per modality (Text and Image). These encoders are normal PyTorch `nn.Module`s. -The only difference is that they operate on DocList rather that on torch.Tensor: +The only difference is that they operate on `DocList` rather that on torch.Tensor: ```python class TextEncoder(nn.Module): @@ -226,7 +226,7 @@ class TextEncoder(nn.Module): super().__init__() self.bert = DistilBertModel.from_pretrained("distilbert-base-uncased") - def forward(self, texts: DocList[Text]) -> TorchTensor: + def forward(self, texts: DocList[TextDoc]) -> TorchTensor: last_hidden_state = self.bert( input_ids=texts.tokens.input_ids, attention_mask=texts.tokens.attention_mask ).last_hidden_state @@ -240,8 +240,8 @@ class TextEncoder(nn.Module): return masked_output.sum(dim=1) / attention_mask.sum(-1, keepdim=True) ``` -The `TextEncoder` takes a `DocList` of `Text`s as input, and returns an embedding `TorchTensor` as output. -`DocList` can be seen as a list of `Text` documents, and the encoder will treat it as one batch. +The `TextEncoder` takes a `DocList` of `TextDoc`s as input, and returns an embedding `TorchTensor` as output. +`DocList` can be seen as a list of `TextDoc` documents, and the encoder will treat it as one batch. ```python @@ -251,13 +251,13 @@ class VisionEncoder(nn.Module): self.backbone = torchvision.models.resnet18(pretrained=True) self.linear = nn.LazyLinear(out_features=768) - def forward(self, images: DocList[Image]) -> TorchTensor: + def forward(self, images: DocList[ImageDoc]) -> TorchTensor: x = self.backbone(images.tensor) return self.linear(x) ``` -Similarly, the `VisionEncoder` also takes a `DocList` of `Image`s as input, and returns an embedding `TorchTensor` as output. -However, it operates on the `image` attribute of each Document. +Similarly, the `VisionEncoder` also takes a `DocList` of `ImageDoc`s as input, and returns an embedding `TorchTensor` as output. +However, it operates on the `tensor` attribute of each Document. Now we can instantiate our encoders: @@ -306,12 +306,14 @@ which is exactly what our model can operate on. So let's write a training loop and train our encoders: -```python tags=[] +```python from tqdm import tqdm with torch.autocast(device_type="cuda", dtype=torch.float16): for epoch in range(num_epoch): - for i, batch in tqdm(enumerate(loader), total=len(loader), desc=f"Epoch {epoch}"): + for i, batch in tqdm( + enumerate(loader), total=len(loader), desc=f"Epoch {epoch}" + ): batch.to(DEVICE) # DocList can be moved to device optim.zero_grad() @@ -342,7 +344,7 @@ FastAPI will be able to automatically translate it into a fully fledged API with ```python from fastapi import FastAPI -from DocList.base_doc import DocumentResponse +from docarray.base_doc import DocumentResponse ``` ```python diff --git a/docs/user_guide/representing/array.md b/docs/user_guide/representing/array.md index ac0311f00bf..9e31e63cbaf 100644 --- a/docs/user_guide/representing/array.md +++ b/docs/user_guide/representing/array.md @@ -1,19 +1,19 @@ # Array of documents -DocArray allows users to represent and manipulate multi-modal data to build AI applications (Generative AI, neural search, etc). +DocArray allows users to represent and manipulate multi-modal data to build AI applications such as neural search and generative AI. -As you have seen in the last section (LINK), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which represents a *single* document, a *single* datapoint. +As you have seen in the [previous section](array.md), the fundamental building block of DocArray is the [`BaseDoc`][docarray.base_doc.doc.BaseDoc] class which represents a *single* document, a *single* datapoint. However, in machine learning we often need to work with an *array* of documents, and an *array* of data points. -This section introduces the concept of `AnyDocArray` LINK which is an (abstract) collection of `BaseDoc`. This name of this library -- -`DocArray` -- is derived from this concept and it is short for `DocumentArray`. +This section introduces the concept of [`AnyDocArray`][docarray.array.AnyDocArray] which is an (abstract) collection of `BaseDoc`. This name of this library -- +`DocArray` -- is derived from this concept and is short for `DocumentArray`. ## AnyDocArray -`AnyDocArray` is an abstract class that represents an array of `BaseDoc`s which is not meant to be used directly, but to be subclassed. +[`AnyDocArray`][docarray.array.AnyDocArray] is an abstract class that represents an array of [`BaseDoc`][docarray.BaseDoc]s which is not meant to be used directly, but to be subclassed. -We provide two concrete implementations of `AnyDocArray` : +We provide two concrete implementations of [`AnyDocArray`][docarray.array.AnyDocArray] : - [`DocList`][docarray.array.doc_list.doc_list.DocList] which is a Python list of `BaseDoc`s - [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] which is a column based representation of `BaseDoc`s @@ -24,11 +24,11 @@ The spirit of `AnyDocArray`s is to extend the `BaseDoc` and `BaseModel` concepts ### Example -Before going into detail lets look at a code example. +Before going into detail let's look at a code example. !!! Note - `DocList` and `DocVec` are both `AnyDocArray`. The following section will use `DocList` as an example, but the same + `DocList` and `DocVec` are both `AnyDocArray`s. The following section will use `DocList` as an example, but the same applies to `DocVec`. First you need to create a `Doc` class, our data schema. Let's say you want to represent a banner with an image, a title and a description: @@ -253,7 +253,7 @@ the Array level. This is where the custom syntax `DocList[DocType]` comes into play. -!!! +!!! note `DocList[DocType]` creates a custom [`DocList`][docarray.array.doc_list.doc_list.DocList] that can only contain `DocType` Documents. This syntax is inspired by more statically typed languages, and even though it might offend Python purists, we believe that it is a good user experience to think of an Array of `BaseDoc`s rather than just an array of non-homogenous `BaseDoc`s. @@ -267,7 +267,7 @@ That said, `AnyDocArray` can also be used to create a non-homogenous `AnyDocArra `DocVec` cannot store non-homogenous `BaseDoc` and always needs the `DocVec[DocType]` syntax. The usage of a non-homogenous `DocList` is similar to a normal Python list but still offers DocArray functionality -like serialization and sending over the wire (LINK). However, it won't be able to extend the API of your custom schema to the Array level. +like [serialization and sending over the wire](../sending/first_step.md). However, it won't be able to extend the API of your custom schema to the Array level. Here is how you can instantiate a non-homogenous `DocList`: diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index ffdb9275f67..08ae1d52f00 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -71,7 +71,7 @@ Another difference is that [BaseDoc][docarray.base_doc.doc.BaseDoc] has an `id` Let's say you want to represent a YouTube video in your application, perhaps to build a search system for YouTube videos. A YouTube video is not only composed of a video, but also has a title, description, thumbnail (and more, but let's keep it simple). -All of these elements are from different `modalities` LINK TO MODALITIES SECTION (not ready): the title and description are text, the thumbnail is an image, and the video in itself is, well, a video. +All of these elements are from different [`modalities`](../../data_types/first_steps.md): the title and description are text, the thumbnail is an image, and the video in itself is, well, a video. DocArray allows to represent all of this multimodal data in a single object. @@ -122,7 +122,7 @@ class YouTubeVideoDoc(BaseDoc): You now have `YouTubeVideoDoc` which is a pythonic representation of a YouTube video. -This representation can now be used to send (LINK) or to store (LINK) data. You can even use it directly to [train a machine learning](../../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation. +This representation can now be used to [send](../sending/first_step.md) or to [store](../storing/first_step.md) data. You can even use it directly to [train a machine learning](../../how_to/multimodal_training_and_serving.md) [Pytorch](https://pytorch.org/docs/stable/index.html) model on this representation. !!! note diff --git a/mkdocs.yml b/mkdocs.yml index fe4f600a3e4..100a4da336a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -73,6 +73,9 @@ plugins: options: docstring_style: sphinx inherited_members: true + show_root_toc_entry: false + show_root_heading: true + show_submodules: yes nav: - Home: README.md @@ -102,6 +105,7 @@ nav: - how_to/multimodal_training_and_serving.md - how_to/optimize_performance_with_id_generation.md - Data Types: + - data_types/first_steps.md - data_types/text/text.md - data_types/image/image.md - data_types/audio/audio.md