From df30de83efd57a8c8716940a83e90b214872f940 Mon Sep 17 00:00:00 2001 From: Nicholas Dunham <11730795+NicholasDunham@users.noreply.github.com> Date: Mon, 7 Nov 2022 20:50:45 -0800 Subject: [PATCH 1/2] docs: refactor getting started section --- docs/datatypes/index.md | 4 +- docs/fundamentals/document/construct.md | 6 +- docs/fundamentals/documentarray/construct.md | 20 +-- docs/fundamentals/documentarray/index.md | 13 +- .../documentarray/interaction-cloud.md | 41 +++++ .../documentarray/serialization.md | 87 +++-------- docs/get-started/first-steps.md | 144 ++++++++++++++++++ docs/get-started/install.md | 16 +- docs/index.md | 3 +- 9 files changed, 240 insertions(+), 94 deletions(-) create mode 100644 docs/fundamentals/documentarray/interaction-cloud.md create mode 100644 docs/get-started/first-steps.md diff --git a/docs/datatypes/index.md b/docs/datatypes/index.md index 6d0688793ca..cc5cb6a5ae6 100644 --- a/docs/datatypes/index.md +++ b/docs/datatypes/index.md @@ -1,6 +1,6 @@ -# Multimodal Data +# Multimodal data -Whether you’re working with text, image, video, audio, 3D meshes or the nested or the combined of them, you can always represent them as Documents and process them as DocumentArray. Here are some motivate examples: +Whether you’re working with text, image, video, audio, 3D meshes, nested data, or some combination of these, you can always represent them as Documents and process them as DocumentArrays. Here are some motivating examples: ```{toctree} diff --git a/docs/fundamentals/document/construct.md b/docs/fundamentals/document/construct.md index fbf97b20043..2ad13112fcd 100644 --- a/docs/fundamentals/document/construct.md +++ b/docs/fundamentals/document/construct.md @@ -1,7 +1,7 @@ (construct-doc)= # Construct -Initializing a Document object is super easy. This chapter introduces the ways of constructing empty Document, filled Document. One can also construct Document from bytes, JSON, Protobuf message as introduced {ref}`in the next chapter`. +Initializing a Document object is super easy. This chapter introduces the ways of constructing empty Documents and filled Documents. One can also construct Documents from bytes, JSON, and Protobuf messages, as introduced {ref}`in the next chapter`. ## Construct an empty Document @@ -15,7 +15,7 @@ d = Document() ``` -Every Document will have a unique random `id` that helps you identify this Document. It can be used to {ref}`access this Document inside a DocumentArray`. +Every Document has a unique random `id` that helps you identify the Document. It can be used to {ref}`access this Document inside a DocumentArray`. ````{tip} The random `id` is the hex value of [UUID1](https://docs.python.org/3/library/uuid.html#uuid.uuid1). To convert it into the string of UUID: @@ -230,4 +230,4 @@ world ## What's next? -One can also construct Document from bytes, JSON, Protobuf message. These methods are introduced {ref}`in the next chapter`. +You can also construct Documents from bytes, JSON, and Protobuf messages. These methods are introduced {ref}`in the next chapter`. diff --git a/docs/fundamentals/documentarray/construct.md b/docs/fundamentals/documentarray/construct.md index 3455dc44473..97636f717c6 100644 --- a/docs/fundamentals/documentarray/construct.md +++ b/docs/fundamentals/documentarray/construct.md @@ -13,7 +13,7 @@ da = DocumentArray() ``` -Now you can use list-like interfaces such as `.append()` and `.extend()` as you would add elements to a Python List. +Now you can use list-like interfaces such as `.append()` and `.extend()` as you would to add elements to a Python List. ```python da.append(Document(text='hello world!')) @@ -24,7 +24,7 @@ da.extend([Document(text='hello'), Document(text='world!')]) ``` -Directly printing a DocumentArray does not show you too much useful information, you can use {meth}`~docarray.array.mixins.plot.PlotMixin.summary`. +Directly printing a DocumentArray doesn't show much useful information. Instead, you can use {meth}`~docarray.array.mixins.plot.PlotMixin.summary`. ```python @@ -49,7 +49,7 @@ da.summary() ## Construct with empty Documents -Like `numpy.zeros()`, you can quickly build a DocumentArray with only empty Documents: +You can quickly build a DocumentArray with only empty Documents, similar to `numpy.zeros()`: ```python from docarray import DocumentArray @@ -63,7 +63,7 @@ da = DocumentArray.empty(10) ## Construct from list-like objects -You can construct DocumentArray from a `Sequence`, `List`, `Tuple` or `Iterator` that yields `Document` object. +You can construct a DocumentArray from a `Sequence`, `List`, `Tuple`, or an `Iterator` that yields `Document` objects. ````{tab} From list of Documents ```python @@ -90,7 +90,7 @@ da = DocumentArray((Document() for _ in range(10))) ```` -As DocumentArray itself is also a "list-like object that yields `Document`", you can also construct DocumentArray from another DocumentArray: +As DocumentArray itself is also a "list-like object that yields `Document` objects", you can also construct a DocumentArray from another DocumentArray: ```python da = DocumentArray(...) @@ -98,7 +98,7 @@ da1 = DocumentArray(da) ``` -## Construct from multiple DocumentArray +## Construct from multiple DocumentArrays You can use `+` or `+=` to concatenate DocumentArrays together: @@ -135,7 +135,7 @@ da = DocumentArray(d1) ## Deep copy on elements -Note that, as in Python list, adding Document object into DocumentArray only adds its memory reference. The original Document is *not* copied. If you change the original Document afterwards, then the one inside DocumentArray will also change. Here is an example, +Note that, as in Python list, adding a Document object into DocumentArray only adds its memory reference. The original Document is *not* copied. If you change the original Document afterwards, then the one inside the DocumentArray will also change. Here is an example: ```python from docarray import DocumentArray, Document @@ -189,7 +189,7 @@ hello ## Construct from local files -You may recall the common pattern that {ref}`I mentioned here`. With {meth}`~docarray.document.generators.from_files` One can easily construct a DocumentArray object with all file paths defined by a glob expression. +You may recall the common pattern that {ref}`I mentioned here`. With {meth}`~docarray.document.generators.from_files`, one can easily construct a DocumentArray object with all file paths defined by a glob expression. ```python from docarray import DocumentArray @@ -199,11 +199,11 @@ da_png = DocumentArray.from_files('images/*.png') da_all = DocumentArray.from_files(['images/**/*.png', 'images/**/*.jpg', 'images/**/*.jpeg']) ``` -This will scan all filenames that match the expression and construct Documents with filled `.uri` attribute. You can control if to read each as text or binary with `read_mode` argument. +This will scan all filenames that match the expression and construct Documents with filled `.uri` attributes. You can specify whether to read each as text or binary with the `read_mode` argument. ## What's next? -In the next chapter, we will see how to construct DocumentArray from binary bytes, JSON, CSV, dataframe, Protobuf message. \ No newline at end of file +In the next chapter, we will see how to construct DocumentArrays from binary bytes, JSON, CSV, dataframe, and Protobuf message. \ No newline at end of file diff --git a/docs/fundamentals/documentarray/index.md b/docs/fundamentals/documentarray/index.md index 2e01be4c281..c7ddb495ac9 100644 --- a/docs/fundamentals/documentarray/index.md +++ b/docs/fundamentals/documentarray/index.md @@ -1,7 +1,7 @@ (documentarray)= # DocumentArray -This is a Document, we already know it can be a mix in data types and nested in structure: +This is a Document. We already know it can be a mix of data types and nested in structure: ```{figure} images/docarray-single.svg :width: 30% @@ -14,15 +14,15 @@ Then this is a DocumentArray: ``` -{class}`~docarray.array.document.DocumentArray` is a list-like container of {class}`~docarray.document.Document` objects. It is **the best way** when working with multiple Documents. +{class}`~docarray.array.document.DocumentArray` is a list-like container of {class}`~docarray.document.Document` objects. It is **the best way** of working with multiple Documents. -In a nutshell, you can simply consider it as a Python `list`, as it implements **all** list interfaces. That is, if you know how to use Python `list`, you already know how to use DocumentArray. +In a nutshell, you can simply think of it as a Python `list`, as it implements **all** list interfaces. That is, if you know how to use a Python `list`, you already know how to use DocumentArray. -It is also powerful as Numpy `ndarray` and Pandas `DataFrame`, allowing you to efficiently [access elements](access-elements.md) and [attributes](access-attributes.md) of contained Documents. +It is also as powerful as Numpy's `ndarray` and Pandas's `DataFrame`, allowing you to efficiently access [elements](access-elements.md) and [attributes](access-attributes.md) of contained Documents. -What makes it more exciting is those advanced features of DocumentArray. These features greatly accelerate data scientists work on accessing nested elements, evaluating, visualizing, parallel computing, serializing, matching etc. +What makes it more exciting is the advanced features of DocumentArray. These features greatly accelerate data scientists' work on accessing nested elements, evaluating, visualizing, parallel computing, serializing, matching etc. -Finally, if your data is too big to fit into memory, you can simply switch to an {ref}`on-disk/remote document store`. All API and user experiences remain the same. No need to learn anything else. +Finally, if your data is too big to fit into memory, you can simply switch to an {ref}`on-disk/remote document store`. All APIs and user experiences remain the same. No need to learn anything else. ## What's next? @@ -43,4 +43,5 @@ embedding matching subindex evaluation +interaction-cloud ``` diff --git a/docs/fundamentals/documentarray/interaction-cloud.md b/docs/fundamentals/documentarray/interaction-cloud.md new file mode 100644 index 00000000000..23001abc5f8 --- /dev/null +++ b/docs/fundamentals/documentarray/interaction-cloud.md @@ -0,0 +1,41 @@ +(interaction-cloud)= +# Interaction with Jina AI Cloud + +```{important} +This feature requires the `rich` and `requests` dependencies. You can do `pip install "docarray[full]"` to install them. +``` + +The {meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.push` and {meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.pull` methods allow you to serialize a DocumentArray object to Jina AI Cloud and share it across machines. + +Imagine you're working on a GPU machine via Google Colab/Jupyter. After preprocessing and embedding, you have everything you need in a DocumentArray. You can easily store it to the cloud via: + +```python +from docarray import DocumentArray + +da = DocumentArray(...) # heavy lifting, processing, GPU tasks... +da.push('myda123', show_progress=True) +``` + +```{figure} images/da-push.png + +``` + +Then on your local laptop, simply pull it: + +```python +from docarray import DocumentArray + +da = DocumentArray.pull('myda123', show_progress=True) +``` + +Now you can continue your work locally, analyzing `da` or visualizing it. Your friends & colleagues who know the token `myda123` can also pull that DocumentArray. It's useful when you want to quickly share the results with your colleagues & friends. + +The maximum size of an upload is 4GB under the `protocol='protobuf'` and `compress='gzip'` settings. The lifetime of an upload is one week after its creation. + +To avoid unnecessary downloads when the upstream DocumentArray is unchanged, you can add `DocumentArray.pull(..., local_cache=True)`. + +```{seealso} +DocArray allows pushing, pulling, and managing your DocumentArrays in Jina AI Cloud. +Read more about how to manage your data in Jina AI Cloud, using either the console or the DocArray Python API, in the +{ref}`Data Management section `. +``` diff --git a/docs/fundamentals/documentarray/serialization.md b/docs/fundamentals/documentarray/serialization.md index c6d2c568fc8..5d63a2eb2f5 100644 --- a/docs/fundamentals/documentarray/serialization.md +++ b/docs/fundamentals/documentarray/serialization.md @@ -1,9 +1,9 @@ (docarray-serialization)= # Serialization -DocArray is designed to be "ready-to-wire" at anytime. Serialization is important. -DocumentArray provides multiple serialization methods that allows one transfer DocumentArray object over network and across different microservices. -Moreover, there is the ability to store/load `DocumentArray` objects to/from disk. +DocArray is designed to be "ready-to-wire" at any time. Serialization is important. +DocumentArray provides multiple serialization methods that allow one to transfer DocumentArray objects over the network and across different microservices. +Moreover, it provides the ability to store/load `DocumentArray` objects to/from disk. - JSON string: `.from_json()`/`.to_json()` - Pydantic model: `.from_pydantic_model()`/`.to_pydantic_model()` @@ -13,7 +13,6 @@ Moreover, there is the ability to store/load `DocumentArray` objects to/from dis - Protobuf Message: `.from_protobuf()`/`.to_protobuf()` - Python List: `.from_list()`/`.to_list()` - Pandas Dataframe: `.from_dataframe()`/`.to_dataframe()` -- Cloud: `.push()`/`.pull()` @@ -22,11 +21,11 @@ Moreover, there is the ability to store/load `DocumentArray` objects to/from dis ```{tip} -If you are building a webservice and want to use JSON for passing DocArray objects, then data validation and field-filtering can be crucial. In this case, it is highly recommended to check out {ref}`fastapi-support` and follow the methods there. +If you are building a webservice and want to use JSON for passing DocArray objects, then data validation and field-filtering can be crucial. In this case, we highly recommend checking out {ref}`fastapi-support` and following the methods there. ``` ```{important} -Depending on which protocol you use, this feature requires `pydantic` or `protobuf` dependency. You can do `pip install "docarray[common]"` to install it. +Depending on which protocol you use, this feature requires the `pydantic` or `protobuf` dependency. You can do `pip install "docarray[common]"` to install both. ``` @@ -77,10 +76,10 @@ More parameters and usages can be found in the Document-level {ref}`doc-json`. ## From/to bytes ```{important} -Depending on your values of `protocol` and `compress` arguments, this feature may require `protobuf` and `lz4` dependencies. You can do `pip install "docarray[full]"` to install it. +Depending on the values of your `protocol` and `compress` arguments, this feature may require `protobuf` and `lz4` dependencies. You can do `pip install "docarray[full]"` to install them. ``` -Serialization into bytes often yield more compact representation than in JSON. Similar to {ref}`the Document serialization`, DocumentArray can be serialized with different `protocol` and `compress` combinations. In its most simple form, +Serialization into bytes often yields more compact representation than JSON. Similar to {ref}`the Document serialization`, DocumentArray can be serialized with different `protocol` and `compress` combinations. In its most simple form, ```python from docarray import DocumentArray, Document @@ -116,23 +115,23 @@ da_r.summary() ``` ```{tip} -If you go with default `protcol` and `compress` settings, you can simply use `bytes(da)`, which is more Pythonic. +If you go with the default `protcol` and `compress` settings, you can simply use `bytes(da)`, which is more Pythonic. ``` The table below summarize the supported serialization protocols and compressions: | `protocol=...` | Description | Remarks | |--------------------------|------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------| -| `pickle-array` (default) | Serialize the whole array in one-shot using Python `pickle` | Often fastest. Not portable to other languages. Insecure in production. | -| `protobuf-array` | Serialize the whole array using [`DocumentArrayProto`](../../../proto/#docarray.DocumentArrayProto). | Portable to other languages if they implement `DocumentArrayProto`. 2GB max-size (pre-compression) restriction by Protobuf. | -| `pickle` | Serialize elements one-by-one using Python `pickle`. | Allow streaming. Not portable to other languages. Insecure in production. | -| `protobuf` | Serialize elements one-by-one using [`DocumentProto`](../../../proto/#docarray.DocumentProto). | Allow streaming. Portable to other languages if they implement `DocumentProto`. No max-size restriction | +| `pickle-array` (default) | Serialize the whole array in one shot using Python `pickle` | Often fastest. Not portable to other languages. Insecure in production. | +| `protobuf-array` | Serialize the whole array using [`DocumentArrayProto`](../../../proto/#docarray.DocumentArrayProto). | Portable to other languages if they implement `DocumentArrayProto`. 2GB max size (pre-compression) restriction by Protobuf. | +| `pickle` | Serialize elements one-by-one using Python `pickle`. | Allows streaming. Not portable to other languages. Insecure in production. | +| `protobuf` | Serialize elements one-by-one using [`DocumentProto`](../../../proto/#docarray.DocumentProto). | Allows streaming. Portable to other languages if they implement `DocumentProto`. No max size restriction | -For compressions, the following algorithms are supported: `lz4`, `bz2`, `lzma`, `zlib`, `gzip`. The most frequently used ones are `lz4` (fastest) and `gzip` (most widely used). +For compression, the following algorithms are supported: `lz4`, `bz2`, `lzma`, `zlib`, `gzip`. The most frequently used ones are `lz4` (fastest) and `gzip` (most widely used). -If you specified non-default `protocol` and `compress` in {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_bytes`, you will need to specify the same in {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.from_bytes`. +If you specified non-default `protocol` and `compress` settings in {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_bytes`, you will need to specify the same in {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.from_bytes`. -Depending on the use cases, you can choose the one works best for you. Here is a benchmark on serializing a DocumentArray with one million near-empty Documents (i.e. init with `DocumentArray.empty(...)` where each Document has only `id`). +Depending on your use case, you can choose the one that works best for you. Here is a benchmark on serializing a DocumentArray with one million near-empty Documents (i.e. init with `DocumentArray.empty(...)` where each Document has only `id`). ```{figure} images/benchmark-size.svg ``` @@ -142,12 +141,12 @@ Depending on the use cases, you can choose the one works best for you. Here is a The benchmark was conducted [on the codebase of Jan. 5, 2022](https://github.com/jina-ai/docarray/tree/a56067e486d2318e05bcf6088bd1436040107ad2). -Depending on how you want to interpret the results, the figures above can be an over-estimation/under-estimation of the serialization latency: one may argue that near-empty Documents are not realistic, but serializing a DocumentArray with one million Documents is also unreal. In practice, DocumentArray passing across microservices are relatively small, say at thousands, for better overlapping the network latency and computational overhead. +Depending on how you want to interpret the results, the figures above can be an over-estimation/under-estimation of the serialization latency: one may argue that near-empty Documents are not realistic, but serializing a DocumentArray with one million Documents is also unreal. In practice, DocumentArrays passing across microservices are relatively small, say in the thousands of Documents, for better overlapping the network latency and computational overhead. (wire-format)= ### Wire format of `pickle` and `protobuf` -When set `protocol=pickle` or `protobuf`, the resulting bytes look like the following: +When you set `protocol=pickle` or `protobuf`, the resulting bytes look like the following: ```text -------------------------------------------------------------------------------------------------------- @@ -167,7 +166,7 @@ The pattern `dock_bytes` and `dock.to_bytes` is repeated `len(docs)` times. ### From/to disk -If you want to store a `DocumentArray` to disk you can use `.save_binary(filename, protocol, compress)` where `protocol` and `compress` refer to the protocol and compression methods used to serialize the data. +If you want to store a `DocumentArray` to disk you can use `.save_binary(filename, protocol, compress)`, where `protocol` and `compress` refer to the protocol and compression methods used to serialize the data. If you want to load a `DocumentArray` from disk you can use `.load_binary(filename, protocol, compress)`. For example, the following snippet shows how to save/load a `DocumentArray` in `my_docarray.bin`. @@ -202,7 +201,7 @@ da_rec.summary() ``` -User do not need to remember the protocol and compression methods on loading. You can simply specify `protocol` and `compress` in the file extension via: +You do not need to remember the protocol and compression methods on loading. You can simply specify `protocol` and `compress` in the file extension via: ```text filename.protobuf.gzip @@ -214,7 +213,7 @@ filename.protobuf.gzip ``` -When a filename is given as the above format in `.save_binary`, you can simply load it back with `.load_binary` without specifying the protocol and compress method again. +When a filename is given in the above format in `.save_binary`, you can simply load it back with `.load_binary` without specifying the protocol and compression methods again. The previous code snippet can be simplified to @@ -245,10 +244,10 @@ for d in da_generator: ## From/to base64 ```{important} -Depending on your values of `protocol` and `compress` arguments, this feature may require `protobuf` and `lz4` dependencies. You can do `pip install "docarray[full]"` to install it. +Depending on the values of your `protocol` and `compress` arguments, this feature may require the `protobuf` and `lz4` dependencies. You can do `pip install "docarray[full]"` to install them. ``` -Serialize into base64 can be useful when binary string is not allowed, e.g. in REST API. This can be easily done via {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_base64` and {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.from_base64`. Like in binary serialization, one can specify `protocol` and `compress`: +Serialization into base64 can be useful when a binary string is not allowed, e.g. in a REST API. This can be easily done via {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.to_base64` and {meth}`~docarray.array.mixins.io.binary.BinaryIOMixin.from_base64`. Like in binary serialization, one can specify `protocol` and `compress`: ```python from docarray import DocumentArray @@ -334,7 +333,7 @@ More parameters and usages can be found in the Document-level {ref}`doc-dict`. ## From/to dataframe ```{important} -This feature requires `pandas` dependency. You can do `pip install "docarray[full]"` to install it. +This feature requires the `pandas` dependency. You can do `pip install "docarray[full]"` to install it. ``` One can convert between a DocumentArray object and a `pandas.dataframe` object. @@ -358,43 +357,3 @@ To build a DocumentArray from dataframe, df = ... da = DocumentArray.from_dataframe(df) ``` - -## From/to cloud - -```{important} -This feature requires `rich` and `requests` dependency. You can do `pip install "docarray[full]"` to install it. -``` - -{meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.push` and {meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.pull` allows you to serialize a DocumentArray object to Jina Cloud and share it across machines. - -Considering you are working on a GPU machine via Google Colab/Jupyter. After preprocessing and embedding, you got everything you need in a DocumentArray. You can easily store it to the cloud via: - -```python -from docarray import DocumentArray - -da = DocumentArray(...) # heavylifting, processing, GPU task, ... -da.push('myda123', show_progress=True) -``` - -```{figure} images/da-push.png -``` - -Then on your local laptop, simply pull it: - -```python -from docarray import DocumentArray - -da = DocumentArray.pull('myda123', show_progress=True) -``` - -Now you can continue the work at local, analyzing `da` or visualizing it. Your friends & colleagues who know the token `myda123` can also pull that DocumentArray. It's useful when you want to quickly share the results with your colleagues & friends. - -The maximum size of an upload is 4GB under the `protocol='protobuf'` and `compress='gzip'` setting. The lifetime of an upload is one week after its creation. - -To avoid unnecessary download when upstream DocumentArray is unchanged, you can add `DocumentArray.pull(..., local_cache=True)`. - -```{seealso} -DocArray allows pushing, pulling, and managing your DocumentArrays in Jina AI Cloud. -Read more about how to manage your data in Jina AI Cloud, using either the console or the DocArray Python API, in the -{ref}`Data Management section `. -``` diff --git a/docs/get-started/first-steps.md b/docs/get-started/first-steps.md new file mode 100644 index 00000000000..82a9d95066c --- /dev/null +++ b/docs/get-started/first-steps.md @@ -0,0 +1,144 @@ +(first-steps)= +# First steps + +## Creating Documents + +You can create a Document by creating a new instance of the `Document` class, and optionally pass arguments to the constructor. + +```python +from docarray import Document +import numpy + +d0 = Document() +d1 = Document(text='hello') +d2 = Document(blob=b'\f1') +d3 = Document(tensor=numpy.array([1, 2, 3])) +d4 = Document( + uri='https://docarray.jina.ai', + mime_type='text/plain', + granularity=1, + adjacency=3, + tags={'foo': 'bar'}, +) +``` + +```text + + + + + +``` + +Every Document has a unique random `id` that helps you identify the Document. It can be used to {ref}`access this Document inside a DocumentArray`. + +````{tip} +When you `print()` a Document, you get a string representation such as ``. It shows the non-empty attributes of that Document as well as its `id`, which helps you understand the content of that Document. + +```text + + ^^^^^^^^^^^^^^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + | | + | | + non-empty fields | + Document.id +``` +```` + +You can learn more about constructing new Documents in the {ref}`Construct chapter`. + +You can also construct Documents from bytes, JSON, and Protobuf messages. These methods are introduced in the {ref}`Serialization chapter`. + +One of the most powerful features of Documents is their ability to hold nested data. This is explained further in the {ref}`Dataclass section`. + +## Constructing DocumentArrays + +A DocumentArray is a list-like container of Document objects. To create an empty DocumentArray: + +```python +from docarray import Document, DocumentArray + +da = DocumentArray() +``` + +```text + +``` + +Now you can use list-like interfaces such as `.append()` and `.extend()` as you would to add elements to a Python list—in fact, DocumentArray implements **all** list interfaces. This means that if you know how to use a Python `list`, you already know how to use a DocumentArray. + +```python +da.append(Document(text='hello world!')) +da.extend([Document(text='hello'), Document(text='world!')]) +``` + +```text + +``` + +Directly printing a DocumentArray doesn't show much useful information. Instead, you can use {meth}`~docarray.array.mixins.plot.PlotMixin.summary`. + + +```python +da.summary() +``` + +```text + Documents Summary + + Type DocumentArrayInMemory + Length 3 + Homogenous Documents True + Common Attributes ('id', 'text') + Multimodal dataclass False + + Attributes Summary + + Attribute Data type #Unique values Has empty value + ────────────────────────────────────────────────────────── + id ('str',) 3 False + text ('str',) 3 False +``` + +## Serializing DocumentArrays + +You can serialize your DocumentArray in a variety of ways: + +```python +da.to_json() + +da.save_binary('my_docarray.bin', protocol='protobuf', compress='lz4') + +da.to_dataframe() +``` + +These and other serialization formats are detailed in the {ref}`Serialization chapter`. + +## Interaction with Jina AI Cloud + +```{important} +This feature requires the `rich` and `requests` dependencies. You can do `pip install "docarray[full]"` to install them. +``` + +The {meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.push` and {meth}`~docarray.array.mixins.io.pushpull.PushPullMixin.pull` methods allow you to serialize a DocumentArray object to Jina AI Cloud and share it across machines. + +```python +from docarray import DocumentArray + +da = DocumentArray(...) # heavy lifting, processing, GPU tasks... +da.push('myda123', show_progress=True) +``` + +```{figure} ../documentarray/images/da-push.png + +``` + +Then on your local laptop, simply pull it: + +```python +from docarray import DocumentArray + +da = DocumentArray.pull('myda123', show_progress=True) +``` + +Further details are available in the {ref}`Interaction with Jina AI Cloud chapter`. \ No newline at end of file diff --git a/docs/get-started/install.md b/docs/get-started/install.md index c16c9b98558..85c66cc099c 100644 --- a/docs/get-started/install.md +++ b/docs/get-started/install.md @@ -11,7 +11,7 @@ Make sure you have Python 3.7+ and `numpy` installed on Linux/Mac/Windows: pip install docarray ``` -No extra dependency will be installed. +No extra dependencies will be installed. ```` ````{tab} Basic install via Conda @@ -20,7 +20,7 @@ No extra dependency will be installed. conda install -c conda-forge docarray ``` -No extra dependency will be installed. +No extra dependencies will be installed. ```` ````{tab} Common install @@ -84,7 +84,7 @@ This will install all requirements for reproducing tests on your local dev envir ## On Apple Silicon -If you own a MacOS device with an Apple Silicon M1/M2 chip, you can run DocArray **natively** on it (instead of running under Rosetta) and enjoy much better performance. This section summarizes how to install DocArray on Apple Silicon device. +If you own a MacOS device with an Apple silicon M1/M2 chip, you can run DocArray **natively** on it (instead of running under Rosetta) and enjoy much better performance. This section summarizes how to install DocArray on Apple Silicon devices. ### Check terminal and device @@ -103,7 +103,7 @@ arm64 ### Install Homebrew -`brew` is a package manager for macOS. If you already install it you need to confirm it is actually installed for Apple Silicon not for Rosetta. To check that, run +`brew` is a package manager for macOS. If you've already installed it you need to confirm it is correctly installed for Apple silicon, not for Rosetta. To check that, run ```bash which brew @@ -113,7 +113,7 @@ which brew /opt/homebrew/bin/brew ``` -If you find it is installed under `/usr/local/` instead of `/opt/homebrew/`, it means your `brew` is installed for Rosetta not for Apple Silicon. You need to reinstall it. [Here is an article on how to do it](https://apple.stackexchange.com/a/410829). +If you find it is installed under `/usr/local/` instead of `/opt/homebrew/`, it means your `brew` is installed for Rosetta, not for Apple Silicon. You need to reinstall it. [Here is an article on how to do it](https://apple.stackexchange.com/a/410829). ```{danger} Reinstalling `brew` can be a destructive operation. Please make sure you have backed up your data before proceeding. @@ -125,7 +125,7 @@ To (re)install brew, run /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" ``` -You may want to observe the output to check if it contains `/opt/homebrew` to make sure you are installing for Apple Silicon. +You may want to observe the output to make sure it contains `/opt/homebrew`. ### Install Python @@ -152,7 +152,7 @@ brew install python3 As of Aug 2022, this will install Python 3.10 natively for Apple Silicon. -Make sure to note down where `python` and `pip` are installed to. In this example, they are installed to `/opt/homebrew/bin/python3` and `/opt/homebrew/opt/python@3.10/libexec/bin/pip` respectively. +Make sure to note down where `python` and `pip` are installed to. In this example, they are installed to `/opt/homebrew/bin/python3` and `/opt/homebrew/opt/python@3.10/libexec/bin/pip`, respectively. ### Install dependencies wheels @@ -171,7 +171,7 @@ Now we can install Jina via `pip`. Note you need to use the right one: ``` -Congratulations! You have successfully installed Jina on Apple Silicon. +Congratulations! You have successfully installed Jina on Apple silicon. ````{tip} diff --git a/docs/index.md b/docs/index.md index 5f278236c15..c5c099272a9 100644 --- a/docs/index.md +++ b/docs/index.md @@ -30,7 +30,7 @@ get-started/what-is :hidden: get-started/install -datatypes/index +get-started/first-steps ``` ```{toctree} @@ -39,6 +39,7 @@ datatypes/index fundamentals/document/index fundamentals/documentarray/index +datatypes/index fundamentals/dataclass/index advanced/document-store/index fundamentals/cloud-support/index From 5ae9c7749666095bf20e0396fdf96905dcaaba8b Mon Sep 17 00:00:00 2001 From: Nicholas Dunham <11730795+NicholasDunham@users.noreply.github.com> Date: Tue, 8 Nov 2022 07:58:32 -0800 Subject: [PATCH 2/2] docs: fix typo Co-authored-by: Joan Fontanals Signed-off-by: Nicholas Dunham <11730795+NicholasDunham@users.noreply.github.com> --- docs/fundamentals/documentarray/serialization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/fundamentals/documentarray/serialization.md b/docs/fundamentals/documentarray/serialization.md index 5d63a2eb2f5..9105a16a851 100644 --- a/docs/fundamentals/documentarray/serialization.md +++ b/docs/fundamentals/documentarray/serialization.md @@ -115,7 +115,7 @@ da_r.summary() ``` ```{tip} -If you go with the default `protcol` and `compress` settings, you can simply use `bytes(da)`, which is more Pythonic. +If you go with the default `protocol` and `compress` settings, you can simply use `bytes(da)`, which is more Pythonic. ``` The table below summarize the supported serialization protocols and compressions: