From 60ef7fdce0009361cda2de2910c15231782376bb Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Wed, 19 Apr 2023 11:03:36 +0200 Subject: [PATCH 1/8] fix: rename files with very similar and not useful names Signed-off-by: Alex C-G --- docs/user_guide/storing/{first_steps.md => docindex.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/user_guide/storing/{first_steps.md => docindex.md} (100%) diff --git a/docs/user_guide/storing/first_steps.md b/docs/user_guide/storing/docindex.md similarity index 100% rename from docs/user_guide/storing/first_steps.md rename to docs/user_guide/storing/docindex.md From aed6574e1f6192d89f2a39acbc73c487c14b05ce Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Wed, 19 Apr 2023 11:20:22 +0200 Subject: [PATCH 2/8] docs(store): fix up english Signed-off-by: Alex C-G --- docs/user_guide/storing/docindex.md | 116 +++++++++++++------------- docs/user_guide/storing/first_step.md | 22 ++--- 2 files changed, 68 insertions(+), 70 deletions(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index 79cf1384ec1..fd2ae18111f 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -2,25 +2,25 @@ A Document Index lets you store your Documents and search through them using vector similarity. -This is useful if you want to store a bunch of data, and at a later point retrieve Documents that are similar to +This is useful if you want to store a bunch of data, and at a later point retrieve documents that are similar to some query that you provide. -Concrete examples where this is relevant are neural search application, Augmenting LLMs and Chatbots with domain knowledge ([Retrieval-Augmented Generation](https://arxiv.org/abs/2005.11401)), +Relevant concrete examples are neural search applications, augmenting LLMs and chatbots with domain knowledge ([Retrieval-Augmented Generation](https://arxiv.org/abs/2005.11401)), or recommender systems. !!! question "How does vector similarity search work?" Without going into too much detail, the idea behind vector similarity search is the following: - You represent every data point that you have (in our case, a Document) as a _vector_, or _embedding_. + You represent every data point that you have (in our case, a document) as a _vector_, or _embedding_. This vector should represent as much semantic information about your data as possible: Similar data points should be represented by similar vectors. These vectors (embeddings) are usually obtained by passing the data through a suitable neural network that has been trained to produce such semantic representations - this is the _encoding_ step. - Once you have your vector that represent your data, you can store them, for example in a vector database. + Once you have your vectors that represent your data, you can store them, for example in a vector database. To perform similarity search, you take your input query and encode it in the same way as the data in your database. - Then, the database will search through the stored vectors and return the ones that are most similar to your query. + Then, the database will search through the stored vectors and return those that are most similar to your query. This similarity is measured by a _similarity metric_, which can be [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity), [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance), or any other metric that you can think of. @@ -43,10 +43,10 @@ For this user guide you will use the [HnswDocumentIndex][docarray.index.backends because it doesn't require you to launch a database server. Instead, it will store your data locally. !!! note "Using a different vector database" - You can easily use Weaviate, Qdrant, or Elasticsearch instead, they share the same API! - To do so, check out their respective documentation sections. + You can easily use Weaviate, Qdrant, or Elasticsearch instead -- they share the same API! + To do so, check their respective documentation sections. -!!! note "HNSWLib-specific settings" +!!! note "Hnswlib-specific settings" The following sections explain the general concept of Document Index by using [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] as an example. For HNSWLib-specific settings, check out the [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] documentation @@ -61,7 +61,7 @@ because it doesn't require you to launch a database server. Instead, it will sto pip install "docarray[hnswlib]" ``` -To create a Document Index, your first need a Document that defines the schema of your index. +To create a Document Index, you first need a document that defines the schema of your index: ```python from docarray import BaseDoc @@ -80,10 +80,10 @@ db = HnswDocumentIndex[MyDoc](work_dir='./my_test_db') **Schema definition:** In this code snippet, `HnswDocumentIndex` takes a schema of the form of `MyDoc`. -The Document Index then _creates column for each field in `MyDoc`_. +The Document Index then _creates a column for each field in `MyDoc`_. -The column types in the backend database are determined the type hints of the fields in the Document. -Optionally, you can customize the database types for every field, as you can see [here](#customize-configurations). +The column types in the backend database are determined by the type hints of the document's fields. +Optionally, you can [customize the database types for every field](#customize-configurations). Most vector databases need to know the dimensionality of the vectors that will be stored. Here, that is automatically inferred from the type hint of the `embedding` field: `NdArray[128]` means that @@ -91,7 +91,7 @@ the database will store vectors with 128 dimensions. !!! note "PyTorch and TensorFlow support" Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that - for you. This is supported for all Document Index backends. No need to convert your tensors to numpy arrays manually! + for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually! **Database location:** @@ -126,11 +126,11 @@ This means that they share the same schema, and in general, the schema of a Docu need to have compatible schemas. !!! question "When are two schemas compatible?" - The schema of your Document Index and of your data need to be compatible with each other. + The schemas of your Document Index and data need to be compatible with each other. Let's say A is the schema of your Document Index and B is the schema of your data. - There are a few rules that determine if a schema A is compatible with a schema B. - If _any_ of the following is true, then A and B are compatible: + There are a few rules that determine if schema A is compatible with schema B. + If _any_ of the following are true, then A and B are compatible: - A and B are the same class - A and B have the same field names and field types @@ -140,9 +140,8 @@ need to have compatible schemas. Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. - -Provided with a Document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find] can find -similar Documents in the Document Index. +By using a document of type `MyDoc`, [find()][docarray.index.abstract.BaseDocIndex.find], you can find +similar Documents in the Document Index: === "Search by Document" @@ -186,7 +185,7 @@ How these scores are calculated depends on the backend, and can usually be [conf **Batched search:** -You can also search for multiple Documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. +You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method. === "Search by Documents" @@ -204,7 +203,7 @@ You can also search for multiple Documents at once, in a batch, using the [find_ print(f'{scores=}') ``` -=== "Search by raw vector" +=== "Search by raw vectors" ```python # create some query vectors @@ -219,24 +218,25 @@ You can also search for multiple Documents at once, in a batch, using the [find_ ``` The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing -a list of `DocList`s, one for each query, containing the closest matching documents; and the associated similarity scores. +a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores. -## Perform filter search and text search +## Filter search and text search -In addition to vector similarity search, the Document Index interface offers methods for text search and filter search: +In addition to vector similarity search, the Document Index interface offers methods for text search and filtered search: [text_search()][docarray.index.abstract.BaseDocIndex.text_search] and [filter()][docarray.index.abstract.BaseDocIndex.filter], -as well as their batched versions [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched] and [filter_batched()][docarray.index.abstract.BaseDocIndex.filter_batched] +as well as their batched versions [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched] and [filter_batched()][docarray.index.abstract.BaseDocIndex.filter_batched]. -The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not offer support for filter -or text search. +!!! note + The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not offer support for filter + or text search. -To see how to perform these operations, you can check out other backends that do. + To see how to perform filter or text search, you can check out other backends that offer support. -## Perform hybrid search through the query builder +## Hybrid search through the query builder -Document Index support atomic operations for vector similarity search, text search and filter search. +Document Index supports atomic operations for vector similarity search, text search and filter search. -In order to combine these operations into a singe, hybrid search query, you can use the query builder that is accessible +To combine these operations into a single, hybrid search query, you can use the query builder that is accessible through [build_query()][docarray.index.abstract.BaseDocIndex.build_query]: ```python @@ -255,18 +255,18 @@ results = db.execute_query(query) print(f'{results=}') ``` -In the example above you can see how to form a hybrid query that combines vector similarity search and filter search +In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search to obtain a combined set of results. -What kinds of atomic queries can be combined in this way depends on the backend. -Some can combine text search and vector search, others can perform filters and vectors search, etc. +The kinds of atomic queries that can be combined in this way depends on the backend. +Some backends can combine text search and vector search, while others can perform filters and vectors search, etc. To see what backend can do what, check out the [specific docs](#document-index). -## Access Documents by id +## Access documents by `id` -To retrieve a Document from a Document Index, you don't necessarily need to perform some fancy search. +To retrieve a document from a Document Index, you don't necessarily need to perform a fancy search. -You can also access data by the id that as assigned to every Document: +You can also access data by the `id` that was assigned to each document: ```python # prepare some data @@ -285,7 +285,7 @@ docs = db[ids] # get by list of ids ## Delete Documents -In the same way you can access Documents by id, you can delete them: +In the same way you can access Documents by id, you can also delete them: ```python # prepare some data @@ -304,15 +304,16 @@ del db[ids[1:]] # del by list of ids ## Customize configurations -It is DocArray's philosophy that each Document Index should "just work", meaning that it comes with a sane set of default -settings that can get you most of the way there. +DocArray's philosophy is that each Document Index should "just work", meaning that it comes with a sane set of defaults +that get you most of the way there. However, there are different configurations that you may want to tweak, including: + - The [ANN](https://ignite.apache.org/docs/latest/machine-learning/binary-classification/ann) algorithm used, for example [HNSW](https://www.pinecone.io/learn/hnsw/) or [ScaNN](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html) - Hyperparameters of the ANN algorithm, such as `ef_construction` for HNSW - The distance metric to use, such as cosine or L2 distance - The data type of each column in the database -- ... +- And many more... The specific configurations that you can tweak depend on the backend, but the interface to do so is universal. @@ -320,17 +321,17 @@ Document Indexes differentiate between three different kind of configurations: **Database configurations** -_Database configurations_ are configurations that pertain to the entire DB or DB table (as opposed to just a specific column), +_Database configurations_ are configurations that pertain to the entire database or table (as opposed to just a specific column), and that you _don't_ dynamically change at runtime. This commonly includes: + - host and port - index or collection name - authentication settings - ... - -For every backend, you can get the full list of configurations, and their defaults, like this: +For every backend, you can get a full list of configurations and their defaults: ```python from docarray.index import HnswDocumentIndex @@ -372,18 +373,17 @@ You can customize every field in this configuration: **Runtime configurations** -_Runtime configurations_ are configurations that pertain to the entire DB or DB table (as opposed to just a specific column), +_Runtime configurations_ are configurations that pertain to the entire database or table (as opposed to just a specific column), and that you can dynamically change at runtime. This commonly includes: - default batch size for batching operations -- default mapping from pythong types to DB column types -- default consistency level for various DB operations +- default mapping from pythong types to database column types +- default consistency level for various database operations - ... - -For every backend, you can get the full list of configurations, and their defaults, like this: +For every backend, you can get the full list of configurations and their defaults: ```python from docarray.index import HnswDocumentIndex @@ -396,7 +396,7 @@ print(runtime_config) ``` As you can see, `HnswDocumentIndex.RuntimeConfig` is a dataclass that contains only one configuration: -`default_column_config`, which is a mapping from python types to database column configurations. +`default_column_config`, which is a mapping from Python types to database column configurations. You can customize every field in this configuration using the [configure()][docarray.index.abstract.BaseDocIndex.configure] method: @@ -464,11 +464,11 @@ After this change, the new setting will be applied to _every_ column that corres For many vector databases, individual columns can have different configurations. This commonly includes: -- The data type of the column, e.g. `vector` vs `varchar` -- If it is a vector column, the dimensionality of the vector -- Whether an index should be built for a specific column +- the data type of the column, e.g. `vector` vs `varchar` +- the dimensionality of the vector (if it is a vector column) +- whether an index should be built for a specific column -The exact configurations that are available different from backend to backend, but in any case you can pass them +The available configurations vary from backend to backend, but in any case you can pass them directly in the schema of your Document Index, using the `Field()` syntax: ```python @@ -484,8 +484,8 @@ db = HnswDocumentIndex[Schema](work_dir='/tmp/my_db') ``` The `HnswDocumentIndex` above contains two columns which are configured differently: -- `tens` has a dimensionality of 100, can take up to 12 elements, and uses the `cosine` similarity space -- `tens_two` has a dimensionality of 10, and uses the `ip` similarity space, and an `M` hyperparameter of 4 +- `tens` has a dimensionality of `100`, can take up to `12` elements, and uses the `cosine` similarity space +- `tens_two` has a dimensionality of `10`, and uses the `ip` similarity space, and an `M` hyperparameter of 4 All configurations that are not explicitly set will be taken from the `default_column_config` of the `RuntimeConfig`. @@ -544,11 +544,9 @@ index_docs = [ doc_index.index(index_docs) ``` - **Search nested data:** -You can perform search on any nesting level. -To do so, use the dunder operator to specify the field defined in the nested data. +You can perform search on any nesting level by using the dunder operator to specify the field defined in the nested data. In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the nested `thumbnail` and `video` fields: diff --git a/docs/user_guide/storing/first_step.md b/docs/user_guide/storing/first_step.md index d17efae8b7e..e987c9698d5 100644 --- a/docs/user_guide/storing/first_step.md +++ b/docs/user_guide/storing/first_step.md @@ -3,10 +3,10 @@ In the previous sections we saw how to use [`BaseDoc`][docarray.base_doc.doc.BaseDoc], [`DocList`][docarray.array.doc_list.doc_list.DocList] and [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] to represent multi-modal data and send it over the wire. In this section we will see how to store and persist this data. -DocArray offers to ways of storing your data, each of which have their own documentation sections: +DocArray offers two ways of storing your data, each of which have their own documentation sections: -1. In a **[Document Store](#document-store)** for simple long-term storage -2. In a **[Document Index](#document-index)** for fast retrieval using vector similarity +1. **[Document Store](#document-store)** for simple long-term storage +2. **[Document Index](#document-index)** for fast retrieval using vector similarity ## Document Store @@ -14,22 +14,22 @@ DocArray offers to ways of storing your data, each of which have their own docum [`.push()`][docarray.array.doc_list.pushpull.PushPullMixin.push] and [`.pull()`][docarray.array.doc_list.pushpull.PushPullMixin.pull] methods. Under the hood, [DocStore][docarray.store.abstract_doc_store.AbstractDocStore] is used to persist a `DocList`. -You can store your documents on-disk. Alternatively, you can upload them to [AWS S3](https://aws.amazon.com/s3/), +You can either store your documents on-disk or upload them to [AWS S3](https://aws.amazon.com/s3/), [minio](https://min.io) or [Jina AI Cloud](https://cloud.jina.ai/user/storage). This section covers the following three topics: - - [Store](doc_store/store_file.md) of [`BaseDoc`][docarray.base_doc.doc.BaseDoc], [`DocList`][docarray.array.doc_list.doc_list.DocList] and [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] on-disk - - [Store on Jina AI Cloud](doc_store/store_jac.md) - - [Store on S3](doc_store/store_s3.md) + - [Storing](doc_store/store_file.md) [`BaseDoc`][docarray.base_doc.doc.BaseDoc], [`DocList`][docarray.array.doc_list.doc_list.DocList] and [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] on-disk + - [Storing on Jina AI Cloud](doc_store/store_jac.md) + - [Storing on S3](doc_store/store_s3.md) ## Document Index A Document Index lets you store your Documents and search through them using vector similarity. -This is useful if you want to store a bunch of data, and at a later point retrieve Documents that are similar to -some query that you provide. -Concrete examples where this is relevant are neural search application, Augmenting LLMs and Chatbots with domain knowledge ([Retrieval-Augmented Generation](https://arxiv.org/abs/2005.11401))]), +This is useful if you want to store a bunch of data, and at a later point retrieve documents that are similar to +a query that you provide. +Relevant concrete examples are neural search applications, augmenting LLMs and chatbots with domain knowledge ([Retrieval-Augmented Generation](https://arxiv.org/abs/2005.11401))]), or recommender systems. DocArray's Document Index concept achieves this by providing a unified interface to a number of [vector databases](https://learn.microsoft.com/en-us/semantic-kernel/concepts-ai/vectordb). @@ -40,4 +40,4 @@ Currently, DocArray supports the following vector databases: - [Weaviate](https://weaviate.io/) | [Docs](index_weaviate.md) - [Qdrant](https://qdrant.tech/) | [Docs](index_qdrant.md) - [Elasticsearch](https://www.elastic.co/elasticsearch/) v7 and v8 | [Docs](index_elastic.md) -- [HNSWlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md) +- [Hnswlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md) From 068bc0cb7398702e73fb5acc0b1db6b4ba5dbfff Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Wed, 19 Apr 2023 11:45:37 +0200 Subject: [PATCH 3/8] docs(store): fix up backend pages Signed-off-by: Alex C-G --- docs/user_guide/storing/index_elastic.md | 86 +++++++++++++---------- docs/user_guide/storing/index_hnswlib.md | 45 ++++++------ docs/user_guide/storing/index_weaviate.md | 35 +++++---- 3 files changed, 88 insertions(+), 78 deletions(-) diff --git a/docs/user_guide/storing/index_elastic.md b/docs/user_guide/storing/index_elastic.md index f7e331deb9a..a0d6f9e218b 100644 --- a/docs/user_guide/storing/index_elastic.md +++ b/docs/user_guide/storing/index_elastic.md @@ -10,9 +10,9 @@ DocArray comes with two Document Indexes for [Elasticsearch](https://www.elastic **native vector search (ANN) support**, alongside text and range search. [Elasticsearch v7.10](https://www.elastic.co/downloads/past-releases/elasticsearch-7-10-0) can store vectors, but - **does _not_ support native ANN vector search**, but only exhaustive (=slow) vector search, alongside text and range search. + **does _not_ support native ANN vector search**, but only exhaustive (i.e. slow) vector search, alongside text and range search. - Some users prefer to use ES v7.10 because it is available under a [different license](https://www.elastic.co/pricing/faq/licensing) compared to ES v8.0.0. + Some users prefer to use ES v7.10 because it is available under a [different license](https://www.elastic.co/pricing/faq/licensing) to ES v8.0.0. !!! note "Installation" To use [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex], you need to install the following dependencies: @@ -30,7 +30,7 @@ DocArray comes with two Document Indexes for [Elasticsearch](https://www.elastic ``` -The following examples is based on [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex], +The following example is based on [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex], but will also work for [ElasticV7DocIndex][docarray.index.backends.elasticv7.ElasticV7DocIndex]. # Start Elasticsearch @@ -63,17 +63,18 @@ docker-compose up ``` ## Construct + To construct an index, you first need to define a schema in the form of a `Document`. There are a number of configurations you can pack into your schema: + - Every field in your schema will become one column in the database - For vector fields, such as `NdArray`, `TorchTensor`, or `TensorflowTensor`, you need to specify a dimensionality to be able to perform vector search -- You can override the default column type for every field. To do that, you can pass any [ES field data type](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/mapping-types.html) to `field_name: Type = Field(col_type=...)`. You can see an example of this on the [section on keyword filters](#keyword-filter). +- You can override the default column type for every field by passing any [ES field data type](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/mapping-types.html) to `field_name: Type = Field(col_type=...)`. You can see an example of this in the [section on keyword filters](#keyword-filter). Additionally, you can pass a `hosts` argument to the `__init__()` method to connect to an ES instance. By default, it is `http://localhost:9200`. - ```python import numpy as np from pydantic import Field @@ -93,9 +94,10 @@ class SimpleDoc(BaseDoc): doc_index = ElasticDocIndex[SimpleDoc](hosts='http://localhost:9200') ``` -## Index Documents -Use `.index()` to add Documents into the index. -The`.num_docs()` method returns the total number of Documents in the index. +## Index documents + +Use `.index()` to add documents into the index. +The`.num_docs()` method returns the total number of documents in the index. ```python index_docs = [SimpleDoc(tensor=np.ones(128)) for _ in range(64)] @@ -105,8 +107,9 @@ doc_index.index(index_docs) print(f'number of docs in the index: {doc_index.num_docs()}') ``` -## Access Documents -To access the `Doc`, you need to specify the `id`. You can also pass a list of `id` to access multiple Documents. +## Access documents + +To access the `Doc`, you need to specify the `id`. You can also pass a list of `id` to access multiple documents. ```python # access a single Doc @@ -117,7 +120,8 @@ doc_index[index_docs[16].id, index_docs[17].id] ``` ### Persistence -You can hood into a database index that was persisted during a previous session. + +You can hook into a database index that was persisted during a previous session. To do so, you need to specify `index_name` and the `hosts`: ```python @@ -134,9 +138,10 @@ print(f'number of docs in the persisted index: {doc_index2.num_docs()}') ``` -## Delete Documents -To delete the Documents, use the built-in function `del` with the `id` of the Documents that you want to delete. -You can also pass a list of ids to delete multiple Documents. +## Delete documents + +To delete the documents, use the built-in function `del` with the `id` of the Documents that you want to delete. +You can also pass a list of `id`s to delete multiple documents. ```python # delete a single Doc @@ -147,16 +152,17 @@ del doc_index[index_docs[17].id, index_docs[18].id] ``` ## Find nearest neighbors + The `.find()` method is used to find the nearest neighbors of a vector. -You need to specify `search_field` that is used when performing the vector search. -This is the field that serves as the basis of comparison between your query and your indexed Documents. +You need to specify the `search_field` that is used when performing the vector search. +This is the field that serves as the basis of comparison between your query and indexed Documents. -You can use the `limit` argument to configurate how may Documents to return. +You can use the `limit` argument to configure how many documents to return. !!! note - [ElasticV7DocIndex][docarray.index.backends.elasticv7.ElasticV7DocIndex] is using Elasticsearch v7.10.1 which does not support approximate nearest neighbour algorithms such as HNSW. - This can lead to a poor performance when the search involves many vectors. + [ElasticV7DocIndex][docarray.index.backends.elasticv7.ElasticV7DocIndex] uses Elasticsearch v7.10.1, which does not support approximate nearest neighbour algorithms such as HNSW. + This can lead to poor performance when the search involves many vectors. [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex] does not have this limitation. ```python @@ -165,13 +171,14 @@ query = SimpleDoc(tensor=np.ones(128)) docs, scores = doc_index.find(query, limit=5, search_field='tensor') ``` - ## Nested data -When using the index you can define multiple fields, including nesting Documents inside another Document. + +When using the index you can define multiple fields, including nesting documents inside another document. Consider the following example: -You have `YouTubeVideoDoc` including the `tensor` field calculated based on the description. -Besides, `YouTbueVideoDoc` has `thumbnail` and `video` field, each of which has its own `tensor`. + +- You have `YouTubeVideoDoc` including the `tensor` field calculated based on the description. +- `YouTubeVideoDoc` has `thumbnail` and `video` fields, each with their own `tensor`. ```python from docarray.typing import ImageUrl, VideoUrl, AnyTensor @@ -209,10 +216,9 @@ index_docs = [ doc_index.index(index_docs) ``` -**You can perform search on any nesting level.** -To do so, use the dunder operator to specify the field defined in the nested data. +**You can perform search on any nesting level** by using the dunder operator to specify the field defined in the nested data. -In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the `thumbnail` and `video` field: +In the following example, you can see how to perform vector search on the `tensor` field of the `YouTubeVideoDoc` or the `tensor` field of the `thumbnail` and `video` field: ```python # example of find nested and flat index @@ -237,7 +243,7 @@ docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) To delete a nested data, you need to specify the `id`. !!! note - You can only delete `Doc` at the top level. Deletion of the `Doc` on the lower level is not supported yet. + You can only delete `Doc` at the top level. Deletion of `Doc`s on lower levels is not yet supported. ```python # example of delete nested and flat index @@ -245,9 +251,11 @@ del doc_index[index_docs[3].id, index_docs[4].id] ``` ## Other Elasticsearch queries -Besides the vector search, you can also perform other queries supported by Elasticsearch, such as text search, and various filters. + +Besides vector search, you can also perform other queries supported by Elasticsearch, such as text search, and various filters. ### Text search + As in "pure" Elasticsearch, you can use text search directly on the field of type `str`: ```python @@ -269,12 +277,14 @@ docs, scores = doc_index.text_search(query, search_field='text') ``` ### Query Filter -The `filter()` method accepts queries that follow the [Elasticsearch Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) and consist of leaf and compound clauses. + +The `filter()` method accepts queries that follow the [Elasticsearch Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) and consists of leaf and compound clauses. Using this, you can perform [keyword filters](#keyword-filter), [geolocation filters](#geolocation-filter) and [range filters](#range-filter). #### Keyword filter -To filter the Documents in your index by keyword, you can use `Field(col_type='keyword')` to enable keyword search for a given fields: + +To filter documents in your index by keyword, you can use `Field(col_type='keyword')` to enable keyword search for given fields: ```python class NewsDoc(BaseDoc): @@ -296,7 +306,8 @@ docs = doc_index.filter(query_filter) ``` #### Geolocation filter -To filter the Documents in your index by geolocation, you can use `Field(col_type='geo_point')` on a given field. + +To filter documents in your index by geolocation, you can use `Field(col_type='geo_point')` on a given field: ```python class NewsDoc(BaseDoc): @@ -330,7 +341,8 @@ docs = doc_index.filter(query) ``` #### Range filter -You can have [range field types](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/range.html) in your Document schema and set `Field(col_type='integer_range')`(or also `date_range`, etc.) to filter the docs based on the range of the field. + +You can have [range field types](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/range.html) in your document schema and set `Field(col_type='integer_range')`(or also `date_range`, etc.) to filter documents based on the range of the field. ```python class NewsDoc(BaseDoc): @@ -365,6 +377,7 @@ docs = doc_index.filter(query) ``` ### Hybrid serach and query builder + To combine any of the "atomic" search approaches above, you can use the `QueryBuilder` to build your own hybrid query. For this the `find()`, `filter()` and `text_search()` methods and their combination are supported. @@ -400,6 +413,7 @@ You can also manually build a valid ES query and directly pass it to the `execut ## Configuration options ### DBConfig + The following configs can be set in `DBConfig`: | Name | Description | Default | @@ -411,11 +425,11 @@ The following configs can be set in `DBConfig`: | `index_mappings` | Other [index mappings](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/mapping.html) in a Dict for creating the index | dict | You can pass any of the above as keyword arguments to the `__init__()` method or pass an entire configuration object. -To see how, see [here](first_steps.md#configuration-options#customize-configurations). +See [here](first_steps.md#configuration-options#customize-configurations) for more information. ### RuntimeConfig -The `RuntimeConfig` dataclass of `ElasticDocIndex` consists of `default_column_config` and `chunk_size`. You can change `chunk_size` for batch operations. +The `RuntimeConfig` dataclass of `ElasticDocIndex` consists of `default_column_config` and `chunk_size`. You can change `chunk_size` for batch operations: ```python doc_index = ElasticDocIndex[SimpleDoc]() @@ -432,5 +446,5 @@ class SimpleDoc(BaseDoc): doc_index = ElasticDocIndex[SimpleDoc]() ``` -You can pass the above as a keyword arguments the `configure()` method or pass an entire configuration object. -To see how, see [here](first_steps.md#configuration-options#customize-configurations). +You can pass the above as keyword arguments to the `configure()` method or pass an entire configuration object. +See [here](first_steps.md#configuration-options#customize-configurations) for more information. diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index 88530cc2fde..8665bfe86f5 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -1,26 +1,25 @@ # Hnswlib Document Index !!! note - To use [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex], one need to install the extra dependency with the following command + To use [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex], you need to install the extra dependency with the following command: ```console pip install "docarray[hnswlib]" ``` [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] is a lightweight Document Index implementation -that runs fully locally and is best suited for small to medium sized datasets. -It stores vectors on disc in [hnswlib](https://github.com/nmslib/hnswlib), and stores all other data in [SQLite](https://www.sqlite.org/index.html). +that runs fully locally and is best suited for small- to medium-sized datasets. +It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and stores all other data in [SQLite](https://www.sqlite.org/index.html). !!! note "Production readiness" [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] is a great starting point - for small to medium sized datasets, but it is not battle tested in production. If scalability, uptime, etc. are - important to you, we recommend you eventually transition to one of our database backed Document Index implementations: + for small- to medium-sized datasets, but it is not battle tested in production. If scalability, uptime, etc. are + important to you, we recommend you eventually transition to one of our database-backed Document Index implementations: - [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] - [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex] - [ElasticDocumentIndex][docarray.index.backends.elastic.ElasticDocIndex] - ## Basic Usage To see how to create a [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] instance, add Documents, @@ -35,7 +34,7 @@ This section lays out the configurations and options that are specific to [HnswD The `DBConfig` of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] expects only one argument: `work_dir`. -This is the location where all of the Index's data will be stored: The vaious HNSWLib indexes, as well as the SQLite database. +This is the location where all of the Index's data will be stored, namely the various HNSWLib indexes and the SQLite database. You can pass this directly to the constructor: @@ -53,21 +52,20 @@ class MyDoc(BaseDoc): db = HnswDocumentIndex[MyDoc](work_dir='./path/to/db') ``` -You can specify and existing directory that holds that from a previous session. -In that case, the Index will load the data from that directory. +To load existing data, you can specify a directory that stores data from a previous session. -!!! note "HNSWLib file lock" - HNSWLib uses a file lock to prevent multiple processes from accessing the same index at the same time. +!!! note "Hnswlib file lock" + Hnswlib uses a file lock to prevent multiple processes from accessing the same index at the same time. This means that if you try to open an index that is already open in another process, you will get an error. To avoid this, you can specify a different `work_dir` for each process. ### RuntimeConfig -The `RuntimeConfig` of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] contains only one entry, +The `RuntimeConfig` of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] contains only one entry: the default mapping from Python types to column configurations. You can see in the [section below](#field-wise-configurations) how to override configurations for specific fields. -If you want to set configurations globally, i.e. for all vector fields in your Documents, you can do that using `RuntimeConfig`: +If you want to set configurations globally, i.e. for all vector fields in your documents, you can do that using `RuntimeConfig`: ```python import numpy as np @@ -95,7 +93,7 @@ db.configure( This will set the default configuration for all vector fields to the one specified in the example above. !!! note - Even if your vectors come from PyTorch or TensorFlow, you can and should still use the `np.ndarray` configuration. + Even if your vectors come from PyTorch or TensorFlow, you can (and should) still use the `np.ndarray` configuration. This is because all tensors are converted to `np.ndarray` under the hood. For more information on these settings, see [below](#field-wise-configurations). @@ -105,7 +103,7 @@ stored as-is in a SQLite database. ### Field-wise configurations -There are various setting that you can tweak for every vector field that you index into HNSWLib. +There are various setting that you can tweak for every vector field that you index into Hnswlib. You pass all of those using the `field: Type = Field(...)` syntax: @@ -123,7 +121,7 @@ db = HnswDocumentIndex[Schema](work_dir='/tmp/my_db') In the example above you can see how to configure two different vector fields, with two different sets of settings. -In this way, you can pass [all options that HNSWLib supports](https://github.com/nmslib/hnswlib#api-description): +In this way, you can pass [all options that Hnswlib supports](https://github.com/nmslib/hnswlib#api-description): | Keyword | Description | Default | |-------------------|--------------------------------------------------------------------------------------------------------------------------------|---------| @@ -136,10 +134,11 @@ In this way, you can pass [all options that HNSWLib supports](https://github.com | `allow_replace_deleted` | enables replacing of deleted elements with new added ones | True | | `num_threads` | sets the number of cpu threads to use | 1 | -You can find more details on there parameters [here](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md). +You can find more details on the parameters [here](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md). ## Nested Index -When using the index, you can define multiple fields as well as the nested structure. In the following example, you have `YouTubeVideoDoc` including the `tensor` field calculated based on the description. Besides, `YouTbueVideoDoc` has `thumbnail` and `video` field, each of which has its own `tensor`. + +When using the index, you can define multiple fields and their nested structure. In the following example, you have `YouTubeVideoDoc` including the `tensor` field calculated based on the description. `YouTubeVideoDoc` has `thumbnail` and `video` fields, each with their own `tensor`. ```python from docarray.typing import ImageUrl, VideoUrl, AnyTensor @@ -177,7 +176,7 @@ index_docs = [ doc_index.index(index_docs) ``` -Use the `search_field` to specify which field to be used when performing the vector search. You can use the dunder operator to specify the field defined in the nested data. In the following codes, you can perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the `thumbnail` and `video` field. +You can use the `search_field` to specify which field to use when performing the vector search. You can use the dunder operator to specify the field defined in the nested data. In the following code, you can perform vector search on the `tensor` field of the `YouTubeVideoDoc` or on the `tensor` field of the `thumbnail` and `video` field: ```python # example of find nested and flat index @@ -196,12 +195,12 @@ docs, scores = doc_index.find(query_doc, search_field='thumbnail__tensor', limit docs, scores = doc_index.find(query_doc, search_field='video__tensor', limit=3) ``` -To delete a nested data, you need to specify the `id`. +To delete nested data, you need to specify the `id`. !!! note - You can only delete `Doc` at the top level. Deletion of the `Doc` on the lower level is not supported yet. + You can only delete `Doc` at the top level. Deletion of the `Doc` on lower levels is not yet supported. ```python -# example of delete nested and flat index +# example of deleting nested and flat index del doc_index[index_docs[6].id] -``` \ No newline at end of file +``` diff --git a/docs/user_guide/storing/index_weaviate.md b/docs/user_guide/storing/index_weaviate.md index f43c387d875..83d84dd8fa9 100644 --- a/docs/user_guide/storing/index_weaviate.md +++ b/docs/user_guide/storing/index_weaviate.md @@ -22,21 +22,18 @@ jupyter: ``` This is the user guide for the [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex], -focussing on special features and configurations of Weaviate. - -For general usage of a Document Index, see the [general user guide](./first_steps.md#document-index). +focusing on special features and configurations of Weaviate. +For general usage of a Document Index, see the [general user guide](./docindex.md). # 1. Start Weaviate service -To use [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex], it needs to hook into a running Weaviate service. +To use [WeaviateDocumentIndex][docarray.index.backends.weaviate.WeaviateDocumentIndex], DocArray needs to hook into a running Weaviate service. There are multiple ways to start a Weaviate instance, depending on your use case. ## 1.1. Options - Overview -There are multiple ways to start a Weaviate instance. - | Instance type | General use case | Configurability | Notes | | ----- | ----- | ----- | ----- | | **Weaviate Cloud Services (WCS)** | Development and production | Limited | **Recommended for most users** | @@ -111,6 +108,7 @@ embedded_options = EmbeddedOptions() Weaviate offers [multiple authentication options](https://weaviate.io/developers/weaviate/configuration/authentication), as well as [authorization options](https://weaviate.io/developers/weaviate/configuration/authorization). With DocArray, you can use any of: + - Anonymous access (public instance), - OIDC with username & password, and - API-key based authentication. @@ -210,13 +208,13 @@ Additionally, you can specify the below settings when you instantiate a configur | **Category: General** | | host | str | Weaviate instance url | http://localhost:8080 | | **Category: Authentication** | -| username | str | username known to the specified authentication provider (e.g. WCS) | None | `jp@weaviate.io` | -| password | str | corresponding password | None | `p@ssw0rd` | +| username | str | Username known to the specified authentication provider (e.g. WCS) | None | `jp@weaviate.io` | +| password | str | Corresponding password | None | `p@ssw0rd` | | auth_api_key | str | API key known to the Weaviate instance | None | `mys3cretk3y` | | **Category: Data schema** | | index_name | str | Class name to use to store the document | `Document` | | **Category: Embedded Weaviate** | -| embedded_options| EmbeddedOptions | options for embedded weaviate | None | +| embedded_options| EmbeddedOptions | Options for embedded weaviate | None | The type `EmbeddedOptions` can be specified as described [here](https://weaviate.io/developers/weaviate/installation/embedded#embedded-options) @@ -246,15 +244,16 @@ store.configure(runtimeconfig) # Batch settings being passed on | batch_config | Dict[str, Any] | dictionary to configure the weaviate client's batching logic | see below | Read more: + - Weaviate [docs on batching with the Python client](https://weaviate.io/developers/weaviate/client-libraries/python#batching) ## 3. Available column types -Python data types are mapped to Weaviate type according to the below convention. +Python data types are mapped to Weaviate type according to the below conventions. -| python type | weaviate type | +| Python type | Weaviate type | | ----------- | ------------- | | docarray.typing.ID | string | | str | text | @@ -279,7 +278,7 @@ A list of available Weaviate data types [is here](https://weaviate.io/developers ## 4. Adding example data -Putting it together, we can add data as shown below using Weaviate as the document store. +Putting it together, we can add data below using Weaviate as the Document Index: ```python import numpy as np @@ -332,23 +331,21 @@ store.index(docs) ### 4.1. Notes -- In order to use vector search, you need to specify `is_embedding` for exactly one field. - - This is as Weaviate is configured to allow one vector per data object. +- To use vector search, you need to specify `is_embedding` for exactly one field. + - This is because Weaviate is configured to allow one vector per data object. - If you would like to see Weaviate support multiple vectors per object, [upvote the issue](https://github.com/weaviate/weaviate/issues/2465) which will help to prioritize it. - For a field to be considered as an embedding, its type needs to be of subclass `np.ndarray` or `AbstractTensor` and `is_embedding` needs to be set to `True`. - If `is_embedding` is set to `False` or not provided, the field will be treated as a `number[]`, and as a result, it will not be added to Weaviate's vector index. - It is possible to create a schema without specifying `is_embedding` for any field. - This will however mean that the document will not be vectorized and cannot be searched using vector search. - ## 5. Query Builder/Hybrid Search - ### 5.1. Text search To perform a text search, follow the below syntax. -This will perform a text search for the word "hello" in the field "text" and return the first 2 results: +This will perform a text search for the word "hello" in the field "text" and return the first two results: ```python q = store.build_query().text_search("world", search_field="text").limit(2).build() @@ -361,7 +358,7 @@ docs To perform a vector similarity search, follow the below syntax. -This will perform a vector similarity search for the vector [1, 2] and return the first 2 results: +This will perform a vector similarity search for the vector [1, 2] and return the first two results: ```python q = store.build_query().find([1, 2]).limit(2).build() @@ -374,7 +371,7 @@ docs To perform a hybrid search, follow the below syntax. -This will perform a hybrid search for the word "hello" and the vector [1, 2] and return the first 2 results: +This will perform a hybrid search for the word "hello" and the vector [1, 2] and return the first two results: **Note**: Hybrid search searches through the object vector and all fields. Accordingly, the `search_field` keyword it will have no effect. From 1e2bc8b69ebee952480e8864b7e20279b238fec4 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Wed, 19 Apr 2023 12:22:01 +0200 Subject: [PATCH 4/8] docs(docindex): heading wording Signed-off-by: Alex C-G --- docs/user_guide/storing/docindex.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md index fd2ae18111f..91c29197e5f 100644 --- a/docs/user_guide/storing/docindex.md +++ b/docs/user_guide/storing/docindex.md @@ -136,7 +136,7 @@ need to have compatible schemas. - A and B have the same field names and field types - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A -## Perform vector similarity search +## Vector similarity search Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method. From 244fc70fe1faffba342b1f936fc14d5a96e19141 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Wed, 19 Apr 2023 12:25:37 +0200 Subject: [PATCH 5/8] docs(store): fix docstores Signed-off-by: Alex C-G --- .../storing/doc_store/store_file.md | 24 +++++++----- .../user_guide/storing/doc_store/store_jac.md | 38 ++++++++++--------- 2 files changed, 34 insertions(+), 28 deletions(-) diff --git a/docs/user_guide/storing/doc_store/store_file.md b/docs/user_guide/storing/doc_store/store_file.md index 8602eb71adb..53df1bf2b7c 100644 --- a/docs/user_guide/storing/doc_store/store_file.md +++ b/docs/user_guide/storing/doc_store/store_file.md @@ -1,10 +1,13 @@ # Store on-disk -When you want to use your [DocList][docarray.array.doc_list.doc_list.DocList] in another place, you can use the -[`.push()`][docarray.array.doc_list.pushpull.PushPullMixin.push] function to push the [DocList][docarray.array.doc_list.doc_list.DocList] -to one place and later use the [`.pull()`][docarray.array.doc_list.pushpull.PushPullMixin.pull] function to pull its content back. +When you want to use your [DocList][docarray.array.doc_list.doc_list.DocList] in another place, you can use: + +- the [`.push()`][docarray.array.doc_list.pushpull.PushPullMixin.push] method to push the [DocList][docarray.array.doc_list.doc_list.DocList] +to one place. +- the [`.pull()`][docarray.array.doc_list.pushpull.PushPullMixin.pull] method to pull its content back. + +## Push and pull -## Push & pull To use the store locally, you need to pass a local file path to the function starting with `'file://'`. ```python @@ -21,14 +24,15 @@ dl.push('file://simple_dl') dl_pull = DocList[SimpleDoc].pull('file://simple_dl') ``` -A file with the name of `simple_dl.docs` being created to store the `DocList`. +A file with the name of `simple_dl.docs` will be created in `$HOME/.docarray/cache` to store the `DocList`. + +## Push and pull with streaming -## Push & pull with streaming -When you have a large amount of documents to push and pull, you could use the streaming function. +When you have a large amount of documents to push and pull, you can use the streaming method: [`.push_stream()`][docarray.array.doc_list.pushpull.PushPullMixin.push_stream] and -[`.pull_stream()`][docarray.array.doc_list.pushpull.PushPullMixin.pull_stream] can help you to stream the `DocList` in -order to save the memory usage. You set multiple `DocList` to pull from the same source as well. +[`.pull_stream()`][docarray.array.doc_list.pushpull.PushPullMixin.pull_stream] stream the `DocList` +to save memory usage. You set multiple `DocList`s to pull from the same source as well: ```python from docarray import BaseDoc, DocList @@ -63,4 +67,4 @@ for d1, d2 in zip(dl_pull_stream_1, dl_pull_stream_2): get SimpleDoc(id='1389877ac97b3e6d0e8eb17568934708', text='doc 6'), get SimpleDoc(id='1389877ac97b3e6d0e8eb17568934708', text='doc 6') get SimpleDoc(id='264b0eff2cd138d296f15c685e15bf23', text='doc 7'), get SimpleDoc(id='264b0eff2cd138d296f15c685e15bf23', text='doc 7') ``` - \ No newline at end of file + diff --git a/docs/user_guide/storing/doc_store/store_jac.md b/docs/user_guide/storing/doc_store/store_jac.md index 2975df7311f..fd5b69be56b 100644 --- a/docs/user_guide/storing/doc_store/store_jac.md +++ b/docs/user_guide/storing/doc_store/store_jac.md @@ -1,18 +1,21 @@ # Store on Jina AI Cloud -When you want to use your [`DocList`][docarray.DocList] in another place, you can use the -[`.push()`][docarray.array.doc_list.pushpull.PushPullMixin.push] method to push the `DocList` to Jina AI Cloud and later use the -[`.pull()`][docarray.array.doc_list.pushpull.PushPullMixin.pull] function to pull its content back. + +When you want to use your [`DocList`][docarray.DocList] in another place, you can use: +- the [`.push()`][docarray.array.doc_list.pushpull.PushPullMixin.push] method to push the `DocList` to Jina AI Cloud . +- the [`.pull()`][docarray.array.doc_list.pushpull.PushPullMixin.pull] function to pull its content back. !!! note - To store on Jina AI Cloud, you need to install the extra dependency with the following line + To store documents on Jina AI Cloud, you need to install the extra dependency with the following line: + ```cmd pip install "docarray[jac]" ``` -## Push & pull +## Push and pull + To use the store [`DocList`][docarray.DocList] on Jina AI Cloud, you need to pass a Jina AI Cloud path to the function starting with `'jac://'`. -Before getting started, you need to have an account at [Jina AI Cloud](http://cloud.jina.ai/) and created a [Personal Access Token (PAT)](https://cloud.jina.ai/settings/tokens). +Before getting started, create an account at [Jina AI Cloud](http://cloud.jina.ai/) and a [Personal Access Token (PAT)](https://cloud.jina.ai/settings/tokens). ```python from docarray import BaseDoc, DocList @@ -34,26 +37,25 @@ dl.push(f'jac://{DL_NAME}') dl_pull = DocList[SimpleDoc].pull(f'jac://{DL_NAME}') ``` - !!! note - When using `.push()` and `.pull()`, `DocList` calls the default boto3 client. Be sure your default session is correctly set up. + When using `.push()` and `.pull()`, `DocList` calls the default `boto3` client. Be sure your default session is correctly set up. +## Push and pull with streaming -## Push & pull with streaming -When you have a large amount of documents to push and pull, you could use the streaming function. +When you have a large amount of documents to push and pull, you can use the streaming function. [`.push_stream()`][docarray.array.doc_list.pushpull.PushPullMixin.push_stream] and -[`.pull_stream()`][docarray.array.doc_list.pushpull.PushPullMixin.pull_stream] can help you to stream the -[`DocList`][docarray.DocList] in order to save the memory usage. -You set multiple `DocList` to pull from the same source as well. -The usage is the same as using streaming with local files. -Please refer to [Push & Pull with streaming with local files](store_file.md#push-pull-with-streaming). - +[`.pull_stream()`][docarray.array.doc_list.pushpull.PushPullMixin.pull_stream] stream the +[`DocList`][docarray.DocList] to save memory usage. +You can set multiple `DocList` to pull from the same source as well. +The usage is the same as streaming with local files. +Please refer to [push and pull with streaming with local files](store_file.md#push-and-pull-with-streaming). ## Delete -To delete the store, you need to use the static method [`.delete()`][docarray.store.jac.JACDocStore.delete] of [`JACDocStore`][docarray.store.jac.JACDocStore] class. + +To delete the store, you need to use the static method [`.delete()`][docarray.store.jac.JACDocStore.delete] of [`JACDocStore`][docarray.store.jac.JACDocStore] class: ```python from docarray.store import JACDocStore JACDocStore.delete(f'jac://{DL_NAME}') -``` \ No newline at end of file +``` From 5a38583286c730cf3ebc33d6205ecf71df8fa3ad Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Wed, 19 Apr 2023 12:25:54 +0200 Subject: [PATCH 6/8] docs(mkdocs): fix docindex path Signed-off-by: Alex C-G --- mkdocs.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mkdocs.yml b/mkdocs.yml index e02ebac51a1..0fc22490b16 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -100,7 +100,7 @@ nav: - Storing data: - user_guide/storing/first_step.md - DocIndex: - - user_guide/storing/first_steps.md + - user_guide/storing/docindex.md - user_guide/storing/index_hnswlib.md - user_guide/storing/index_weaviate.md - user_guide/storing/index_elastic.md From 51c72bf804db3375909b64c044baf1400591b6d0 Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Wed, 19 Apr 2023 15:39:36 +0200 Subject: [PATCH 7/8] docs(storage): fix broken links Signed-off-by: Alex C-G --- docs/user_guide/storing/index_elastic.md | 4 ++-- docs/user_guide/storing/index_hnswlib.md | 2 +- docs/user_guide/storing/index_qdrant.md | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/user_guide/storing/index_elastic.md b/docs/user_guide/storing/index_elastic.md index a0d6f9e218b..a21528c46b8 100644 --- a/docs/user_guide/storing/index_elastic.md +++ b/docs/user_guide/storing/index_elastic.md @@ -425,7 +425,7 @@ The following configs can be set in `DBConfig`: | `index_mappings` | Other [index mappings](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/mapping.html) in a Dict for creating the index | dict | You can pass any of the above as keyword arguments to the `__init__()` method or pass an entire configuration object. -See [here](first_steps.md#configuration-options#customize-configurations) for more information. +See [here](docindex.md#configuration-options#customize-configurations) for more information. ### RuntimeConfig @@ -447,4 +447,4 @@ doc_index = ElasticDocIndex[SimpleDoc]() ``` You can pass the above as keyword arguments to the `configure()` method or pass an entire configuration object. -See [here](first_steps.md#configuration-options#customize-configurations) for more information. +See [here](docindex.md#configuration-options#customize-configurations) for more information. diff --git a/docs/user_guide/storing/index_hnswlib.md b/docs/user_guide/storing/index_hnswlib.md index 8665bfe86f5..d873c059a57 100644 --- a/docs/user_guide/storing/index_hnswlib.md +++ b/docs/user_guide/storing/index_hnswlib.md @@ -23,7 +23,7 @@ It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and s ## Basic Usage To see how to create a [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] instance, add Documents, -perform search, etc. see the [general user guide](./first_steps.md#document-index). +perform search, etc. see the [general user guide](./docindex.md). ## Configuration diff --git a/docs/user_guide/storing/index_qdrant.md b/docs/user_guide/storing/index_qdrant.md index d03a12e4e37..7d832f1dd67 100644 --- a/docs/user_guide/storing/index_qdrant.md +++ b/docs/user_guide/storing/index_qdrant.md @@ -10,7 +10,7 @@ The following is a starter script for using the [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex], based on the [Qdrant](https://qdrant.tech/) vector search engine. -For general usage of a Document Index, see the [general user guide](./first_steps.md#document-index). +For general usage of a Document Index, see the [general user guide](./docindex.md#document-index). !!! tip "See all configuration options" To see all configuration options for the [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex], @@ -111,4 +111,4 @@ results = doc_index.filter( ], ), ) -``` \ No newline at end of file +``` From f9cdfa7f7055c64575a1242faee73c0d85af0e6d Mon Sep 17 00:00:00 2001 From: Alex C-G Date: Fri, 21 Apr 2023 17:02:24 +0200 Subject: [PATCH 8/8] fix: docindex link Signed-off-by: Alex C-G --- docs/migration_guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/migration_guide.md b/docs/migration_guide.md index a9f2d3000e4..ab347f5eac2 100644 --- a/docs/migration_guide.md +++ b/docs/migration_guide.md @@ -99,7 +99,7 @@ book_titles = docs.title # returns a list[str] ## Changes to Document Store -In v2 the `Document Store` has been renamed to [`DocIndex`](user_guide/storing/first_steps.md) and can be used for fast retrieval using vector similarity. +In v2 the `Document Store` has been renamed to [`DocIndex`](user_guide/storing/docindex.md) and can be used for fast retrieval using vector similarity. DocArray v2 `DocIndex` supports: - [Weaviate](https://weaviate.io/) @@ -114,4 +114,4 @@ in v2 you can initialize a `DocIndex` object of your choice, such as: db = HnswDocumentIndex[MyDoc](work_dir='/my/work/dir') ``` -In contrast, [`DocStore`](user_guide/storing/first_step.md#document-store) in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud. +In contrast, [`DocStore`](user_guide/storing/docindex.md#document-store) in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.