From 9d04395143b711244fd420ad9faa148b737d0cb2 Mon Sep 17 00:00:00 2001 From: Johannes Messner Date: Mon, 14 Nov 2022 11:50:04 +0100 Subject: [PATCH] docs: add list-like to table of parameters for all backends Signed-off-by: Johannes Messner --- docs/advanced/document-store/annlite.md | 19 ++++----- docs/advanced/document-store/elasticsearch.md | 25 ++++++------ docs/advanced/document-store/index.md | 3 +- docs/advanced/document-store/qdrant.md | 35 +++++++++-------- docs/advanced/document-store/redis.md | 39 ++++++++++--------- docs/advanced/document-store/sqlite.md | 3 +- docs/advanced/document-store/weaviate.md | 2 +- 7 files changed, 66 insertions(+), 60 deletions(-) diff --git a/docs/advanced/document-store/annlite.md b/docs/advanced/document-store/annlite.md index b5be46af7ef..922b4938306 100644 --- a/docs/advanced/document-store/annlite.md +++ b/docs/advanced/document-store/annlite.md @@ -38,15 +38,16 @@ Other functions behave the same as in-memory DocumentArray. The following configs can be set: -| Name | Description | Default | -|-------------------|---------------------------------------------------------------------------------------------------------|---------------------------------------------------------------| -| `n_dim` | Number of dimensions of embeddings to be stored and retrieved | **This is always required** | -| `data_path` | The data folder where the data is located | **A random temp folder** | -| `metric` | Distance metric to be used during search. Can be 'cosine', 'dot' or 'euclidean' | 'cosine' | -| `ef_construction` | The size of the dynamic list for the nearest neighbors (used during the construction) | `None`, defaults to the default value in the AnnLite package* | -| `ef_search` | The size of the dynamic list for the nearest neighbors (used during the search) | `None`, defaults to the default value in the AnnLite package* | -| `max_connection` | The number of bi-directional links created for every new element during construction. | `None`, defaults to the default value in the AnnLite package* | -| `n_components` | The output dimension of PCA model. Should be a positive number and less than `n_dim` if it's not `None` | `None`, defaults to the default value in the AnnLite package* | +| Name | Description | Default | +|-------------------|----------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------| +| `n_dim` | Number of dimensions of embeddings to be stored and retrieved | **This is always required** | +| `data_path` | The data folder where the data is located | **A random temp folder** | +| `metric` | Distance metric to be used during search. Can be 'cosine', 'dot' or 'euclidean' | 'cosine' | +| `ef_construction` | The size of the dynamic list for the nearest neighbors (used during the construction) | `None`, defaults to the default value in the AnnLite package* | +| `ef_search` | The size of the dynamic list for the nearest neighbors (used during the search) | `None`, defaults to the default value in the AnnLite package* | +| `max_connection` | The number of bi-directional links created for every new element during construction. | `None`, defaults to the default value in the AnnLite package* | +| `n_components` | The output dimension of PCA model. Should be a positive number and less than `n_dim` if it's not `None` | `None`, defaults to the default value in the AnnLite package* | +| `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True | *You can check the default values in [the AnnLite source code](https://github.com/jina-ai/annlite/blob/main/annlite/core/index/hnsw/index.py) diff --git a/docs/advanced/document-store/elasticsearch.md b/docs/advanced/document-store/elasticsearch.md index a6f83a22679..b55e3ba3172 100644 --- a/docs/advanced/document-store/elasticsearch.md +++ b/docs/advanced/document-store/elasticsearch.md @@ -391,18 +391,19 @@ results = da.find('cheap', index='price') The following configs can be set: -| Name | Description | Default | -|-------------------|-------------------------------------------------------------------------------------------------------|---------------------------------------------------------| -| `hosts` | Hostname of the Elasticsearch server | `http://localhost:9200` | -| `es_config` | Other ES configs in a Dict and pass to `Elasticsearch` client constructor, e.g. `cloud_id`, `api_key` | None | -| `index_name` | Elasticsearch index name; the class name of Elasticsearch index object to set this DocumentArray | None | -| `n_dim` | Dimensionality of the embeddings | None | -| `distance` | Similarity metric in Elasticsearch | `cosine` | -| `ef_construction` | The size of the dynamic list for the nearest neighbors. | `None`, defaults to the default value in ElasticSearch* | -| `m` | Similarity metric in Elasticsearch | `None`, defaults to the default value in ElasticSearch* | -| `index_text` | Boolean flag indicating whether to index `.text` or not | False | -| `tag_indices` | List of tags to index | False | -| `batch_size` | Batch size used to handle storage refreshes/updates | 64 | +| Name | Description | Default | +|-------------------|----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------| +| `hosts` | Hostname of the Elasticsearch server | `http://localhost:9200` | +| `es_config` | Other ES configs in a Dict and pass to `Elasticsearch` client constructor, e.g. `cloud_id`, `api_key` | None | +| `index_name` | Elasticsearch index name; the class name of Elasticsearch index object to set this DocumentArray | None | +| `n_dim` | Dimensionality of the embeddings | None | +| `distance` | Similarity metric in Elasticsearch | `cosine` | +| `ef_construction` | The size of the dynamic list for the nearest neighbors. | `None`, defaults to the default value in ElasticSearch* | +| `m` | Similarity metric in Elasticsearch | `None`, defaults to the default value in ElasticSearch* | +| `index_text` | Boolean flag indicating whether to index `.text` or not | False | +| `tag_indices` | List of tags to index | False | +| `batch_size` | Batch size used to handle storage refreshes/updates | 64 | +| `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True | ```{tip} You can read more about HNSW parameters and their default values [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html#dense-vector-params) diff --git a/docs/advanced/document-store/index.md b/docs/advanced/document-store/index.md index aa43be24210..ee71868fbab 100644 --- a/docs/advanced/document-store/index.md +++ b/docs/advanced/document-store/index.md @@ -564,7 +564,8 @@ The solution is simple: use {ref}`column-selector`: da[0, 'text'] = 'hello' ``` -### Performance Issue caused by List-like structure +### Performance issue caused by list-like structure + DocArray allows list-like behavior by adding an offset-to-id mapping structure to storage backends. Such feature (adding this structure) means the database stores, along with documents, meta information about document order. However, list_like behavior is not useful in indexers where concurrent usage is possible and users do not need information about document location. diff --git a/docs/advanced/document-store/qdrant.md b/docs/advanced/document-store/qdrant.md index 10c98cb7a4f..b7438f5499d 100644 --- a/docs/advanced/document-store/qdrant.md +++ b/docs/advanced/document-store/qdrant.md @@ -76,23 +76,24 @@ Other functions behave the same as in-memory DocumentArray. The following configs can be set: -| Name | Description | Default | -|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------| -| `n_dim` | Number of dimensions of embeddings to be stored and retrieved | **This is always required** | -| `collection_name` | Qdrant collection name client | **Random collection name generated** | -| `distance` | Distance metric to use during search. Can be 'cosine', 'dot' or 'euclidean' | `'cosine'` | -| `host` | Hostname of the Qdrant server | `'localhost'` | -| `port` | Port of the Qdrant server | `6333` | -| `grpc_port` | Port of the Qdrant gRPC interface | `6334` | -| `prefer_grpc` | Set `True` to use gPRC interface whenever possible in custom methods | `False` | -| `api_key` | API key for authentication in Qdrant Cloud | `None` | -| `https` | Set `True` to use HTTPS(SSL) protocol | `None` | -| `serialize_config` | [Serialization config of each Document](../../../fundamentals/document/serialization.md) | `None` | -| `scroll_batch_size` | Batch size used when scrolling over the storage | `64` | -| `ef_construct` | Number of neighbours to consider during the index building. Larger = more accurate search, more time to build index | `None`, defaults to the default value in Qdrant* | -| `full_scan_threshold` | Minimal size (in KiloBytes) of vectors for additional payload-based indexing | `None`, defaults to the default value in Qdrant* | -| `m` | Number of edges per node in the index graph. Larger = more accurate search, more space required | `None`, defaults to the default value in Qdrant* | -| `columns` | Other fields to store in Document | `None` | +| Name | Description | Default | +|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------| +| `n_dim` | Number of dimensions of embeddings to be stored and retrieved | **This is always required** | +| `collection_name` | Qdrant collection name client | **Random collection name generated** | +| `distance` | Distance metric to use during search. Can be 'cosine', 'dot' or 'euclidean' | `'cosine'` | +| `host` | Hostname of the Qdrant server | `'localhost'` | +| `port` | Port of the Qdrant server | `6333` | +| `grpc_port` | Port of the Qdrant gRPC interface | `6334` | +| `prefer_grpc` | Set `True` to use gPRC interface whenever possible in custom methods | `False` | +| `api_key` | API key for authentication in Qdrant Cloud | `None` | +| `https` | Set `True` to use HTTPS(SSL) protocol | `None` | +| `serialize_config` | [Serialization config of each Document](../../../fundamentals/document/serialization.md) | `None` | +| `scroll_batch_size` | Batch size used when scrolling over the storage | `64` | +| `ef_construct` | Number of neighbours to consider during the index building. Larger = more accurate search, more time to build index | `None`, defaults to the default value in Qdrant* | +| `full_scan_threshold` | Minimal size (in KiloBytes) of vectors for additional payload-based indexing | `None`, defaults to the default value in Qdrant* | +| `m` | Number of edges per node in the index graph. Larger = more accurate search, more space required | `None`, defaults to the default value in Qdrant* | +| `columns` | Other fields to store in Document | `None` | +| `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True | *You can read more about the HNSW parameters and their default values [here](https://qdrant.tech/documentation/indexing/#vector-index) diff --git a/docs/advanced/document-store/redis.md b/docs/advanced/document-store/redis.md index 523882e9db8..aee5d43560b 100644 --- a/docs/advanced/document-store/redis.md +++ b/docs/advanced/document-store/redis.md @@ -117,25 +117,26 @@ Other functions behave the same as in-memory DocumentArray. The following configs can be set: -| Name | Description | Default | -|-------------------|---------------------------------------------------------------------------------------------------|-------------------------------------------------- | -| `host` | Host address of the Redis server | `'localhost'` | -| `port` | Port of the Redis Server | `6379` | -| `redis_config` | Other Redis configs in a Dict and pass to `Redis` client constructor, e.g. `socket_timeout`, `ssl`| `{}` | -| `index_name` | Redis index name; the name of RedisSearch index to set this DocumentArray | `None` | -| `n_dim` | Dimensionality of the embeddings | `None` | -| `update_schema` | Boolean flag indicating whether to update Redis Search schema | `True` | -| `distance` | Similarity distance metric in Redis, one of {`'L2'`, `'IP'`, `'COSINE'`} | `'COSINE'` | -| `batch_size` | Batch size used to handle storage updates | `64` | -| `method` | Vector similarity index algorithm in Redis, either `FLAT` or `HNSW` | `'HNSW'` | -| `index_text` | Boolean flag indicating whether to index `.text`. `True` will enable full text search on `.text` | `None` | -| `tag_indices` | List of tags to index as text field | `[]` | -| `ef_construction` | Optional parameter for Redis HNSW algorithm | `200` | -| `m` | Optional parameter for Redis HNSW algorithm | `16` | -| `ef_runtime` | Optional parameter for Redis HNSW algorithm | `10` | -| `block_size` | Optional parameter for Redis FLAT algorithm | `1048576` | -| `initial_cap` | Optional parameter for Redis HNSW and FLAT algorithm | `None`, defaults to the default value in Redis | -| `columns` | Other fields to store in Document and build schema | `None` | +| Name | Description | Default | +|-------------------|----------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------- | +| `host` | Host address of the Redis server | `'localhost'` | +| `port` | Port of the Redis Server | `6379` | +| `redis_config` | Other Redis configs in a Dict and pass to `Redis` client constructor, e.g. `socket_timeout`, `ssl` | `{}` | +| `index_name` | Redis index name; the name of RedisSearch index to set this DocumentArray | `None` | +| `n_dim` | Dimensionality of the embeddings | `None` | +| `update_schema` | Boolean flag indicating whether to update Redis Search schema | `True` | +| `distance` | Similarity distance metric in Redis, one of {`'L2'`, `'IP'`, `'COSINE'`} | `'COSINE'` | +| `batch_size` | Batch size used to handle storage updates | `64` | +| `method` | Vector similarity index algorithm in Redis, either `FLAT` or `HNSW` | `'HNSW'` | +| `index_text` | Boolean flag indicating whether to index `.text`. `True` will enable full text search on `.text` | `None` | +| `tag_indices` | List of tags to index as text field | `[]` | +| `ef_construction` | Optional parameter for Redis HNSW algorithm | `200` | +| `m` | Optional parameter for Redis HNSW algorithm | `16` | +| `ef_runtime` | Optional parameter for Redis HNSW algorithm | `10` | +| `block_size` | Optional parameter for Redis FLAT algorithm | `1048576` | +| `initial_cap` | Optional parameter for Redis HNSW and FLAT algorithm | `None`, defaults to the default value in Redis | +| `columns` | Other fields to store in Document and build schema | `None` | +| `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True | You can check the default values in [the docarray source code](https://github.com/jina-ai/docarray/blob/main/docarray/array/storage/redis/backend.py). For vector search configurations, default values are those of the database backend, which you can find in the [Redis documentation](https://redis.io/docs/stack/search/reference/vectors/). diff --git a/docs/advanced/document-store/sqlite.md b/docs/advanced/document-store/sqlite.md index 968950f5b03..6184588d1de 100644 --- a/docs/advanced/document-store/sqlite.md +++ b/docs/advanced/document-store/sqlite.md @@ -39,5 +39,6 @@ The following configs can be set: | `table_name` | SQLite table name | a random name | | `serialize_config` | [Serialization config of each Document](../../../fundamentals/document/serialization.md) | None | | `conn_config` | [Connection config pass to `sqlite3.connect`](https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection) | None | -| `journal_mode` | [SQLite Pragma: journal mode](https://www.sqlite.org/pragma.html#pragma_journal_mode) | `'DELETE'` | +| `journal_mode` | [SQLite Pragma: journal mode](https://www.sqlite.org/pragma.html#pragma_journal_mode) | `'DELETE'` | | `synchronous` | [SQLite Pragma: synchronous](https://www.sqlite.org/pragma.html#pragma_synchronous) | `'OFF'` | +| `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True | diff --git a/docs/advanced/document-store/weaviate.md b/docs/advanced/document-store/weaviate.md index 0604fc8ea3a..3563a9232cd 100644 --- a/docs/advanced/document-store/weaviate.md +++ b/docs/advanced/document-store/weaviate.md @@ -103,7 +103,7 @@ The following configs can be set: | `flat_search_cutoff` | Absolute number of objects configured as the threshold for a flat-search cutoff. If a filter on a filtered vector search matches fewer than the specified elements, the HNSW index is bypassed entirely and a flat (brute-force) search is performed instead. This can speed up queries with very restrictive filters considerably. Optional, defaults to 40000. Set to 0 to turn off flat-search cutoff entirely. | `None`, defaults to the default value in Weaviate* | | `cleanup_interval_seconds` | How often the async process runs that “repairs” the HNSW graph after deletes and updates. (Prior to the repair/cleanup process, deleted objects are simply marked as deleted, but still a fully connected member of the HNSW graph. After the repair has run, the edges are reassigned and the datapoints deleted for good). Typically this value does not need to be adjusted, but if deletes or updates are very frequent it might make sense to adjust the value up or down. (Higher value means it runs less frequently, but cleans up more in a single batch. Lower value means it runs more frequently, but might not be as efficient with each run). | `None`, defaults to the default value in Weaviate* | | `skip` | There are situations where it doesn’t make sense to vectorize a class. For example if the class is just meant as glue between two other class (consisting only of references) or if the class contains mostly duplicate elements (Note that importing duplicate vectors into HNSW is very expensive as the algorithm uses a check whether a candidate’s distance is higher than the worst candidate’s distance for an early exit condition. With (mostly) identical vectors, this early exit condition is never met leading to an exhaustive search on each import or query). In this case, you can skip indexing a vector all-together. To do so, set "skip" to "true". skip defaults to false; if not set to true, classes will be indexed normally. This setting is immutable after class initialization. | `None`, defaults to the default value in Weaviate* | - +| `list_like` | Controls if ordering of Documents is persisted in the Database. Disabling this breaks list-like features, but can improve performance. | True | *You can read more about the HNSW parameters and their default values [here](https://weaviate.io/developers/weaviate/current/vector-index-plugins/hnsw.html#how-to-use-hnsw-and-parameters)