From 879ee6531001a1439a384b49fc59355a3fee84c4 Mon Sep 17 00:00:00 2001 From: anna-charlotte Date: Mon, 17 Apr 2023 13:53:19 +0200 Subject: [PATCH 1/5] docs: resolve todos and incomplete sections and missing links Signed-off-by: anna-charlotte --- docs/user_guide/representing/first_step.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index c1b41b623c6..f61510cb978 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -50,7 +50,7 @@ all the features of `BaseModel` in your `Doc` class. `BaseDoc`: * Will perform data validation: `BaseDoc` will check that the data you pass to it is valid. If not, it will raise an error. Data being "valid" is actually defined by the type used in the type hint itself, but we will come back to this concept [later](../../data_types/first_steps.md). * Can be configured using a nested `Config` class, see Pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more detail on what kind of config Pydantic offers. -* Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools that use Pydantic, like [FastAPI]('https://fastapi.tiangolo.com/'). +* Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools that use Pydantic, like [FastAPI]('https://fastapi.tiangolo.com/').======= ## Representing multimodal and nested data From e3347ee072d26d6d667e4cf8c8b13e430566683f Mon Sep 17 00:00:00 2001 From: anna-charlotte Date: Mon, 17 Apr 2023 16:40:11 +0200 Subject: [PATCH 2/5] docs: first draft migration guide Signed-off-by: anna-charlotte --- docarray/documents/legacy/legacy_document.py | 4 +- docs/migration_guide.md | 71 ++++++++++++++++++++ mkdocs.yml | 1 + 3 files changed, 74 insertions(+), 2 deletions(-) create mode 100644 docs/migration_guide.md diff --git a/docarray/documents/legacy/legacy_document.py b/docarray/documents/legacy/legacy_document.py index 3d9cde62d13..eea42f1d93e 100644 --- a/docarray/documents/legacy/legacy_document.py +++ b/docarray/documents/legacy/legacy_document.py @@ -8,10 +8,10 @@ class LegacyDocument(BaseDoc): """ - This Document is the LegacyDocument. It follows the same schema as in DocList v1. + This Document is the LegacyDocument. It follows the same schema as in DocArray v1. It can be useful to start migrating a codebase from v1 to v2. - Nevertheless, the API is not totally compatible with DocAray v1 `Document`. + Nevertheless, the API is not totally compatible with DocArray v1 `Document`. Indeed, none of the method associated with `Document` are present. Only the schema of the data is similar. diff --git a/docs/migration_guide.md b/docs/migration_guide.md new file mode 100644 index 00000000000..a13e2e1d331 --- /dev/null +++ b/docs/migration_guide.md @@ -0,0 +1,71 @@ +# Migration guide + +## Document + +- `Document` has been renamed to [`BaseDoc`][docarray.BaseDoc]. +- `BaseDoc` can not be used directly, but instead has to be extended. +- Following from the previous point, the extending of `BaseDoc` allows for a flexible schema while the +`Document` class in v1 only allowed for a fixed schema, with one of `tensor`, `text` and `blob`, +and additional `chunks` and `matches`. +- In v2 we have the [`LegacyDocument`][docarray.documents.legacy.LegacyDocument] class, + which extends `BaseDoc` while following the same schema as in DocArray v1. + The `LegacyDocument` can be useful to start migrating your codebase from v1 to v2. + Nevertheless, the API is not totally compatible with DocArray v1 `Document`. + Indeed, due to the added flexibility none of the method associated with `Document` are present. + Only the schema of the data is similar. + +## DocumentArray + +### DocList + +- The `DocumentArray` class from v1 has been renamed to [`DocList`][docarray.array.DocList], to be more descriptive of its actual functionality. + +### DocVec + +- Additionally, we added the class [`DocVec`][docarray.array.DocVec]. Both `DocVec` and `DocList` extend `AnyDocArray`. +- `DocVec` is a container of Documents appropriates to perform computation that require batches of data +(ex: matrix multiplication, distance calculation, deep learning forward pass). +- A `DocVec` has a similar interface as `DocList` +but with an underlying implementation that is column based instead of row based. +Each field of the schema of the `DocVec` (the `.doc_type` which is a +`BaseDoc`) will be stored in a column. +If the field is a tensor, the data from all Documents will be stored as a single +doc_vec (torch/np/tf) tensor. If the tensor field is `AnyTensor` or a Union of tensor types, the +`.tensor_type` will be used to determine the type of the doc_vec column. + + +### Access attributes of your DocumentArray + +In v1 you could access an attribute of all Documents in your DocumentArray by calling the plural of the attribute's name on your DocArray instance. +In v2 you don't have to use the plural, but instead just use the document's attribute name. +This will return a list of `type(attribute)`. + +```python +from docarray import BaseDoc, DocList + + +class Book(BaseDoc): + title: str + author: str = None + + +docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)]) +book_titles = docs.title + +assert isinstance(book_titles, list) + +for title in book_titles: + assert isinstance(title, str) +``` + +## Document Store + +In v2 the `Document Store` has been renamed to [`DocIndex`][docarray.index.Doc] and can be used for fast retrieval using vector similarity. +DocArray v2 `DocIndex` supports: + +- Weaviate +- Qdrant +- ElasticSearch +- HNSWLib + +In contrast, `DocStore` in v2 can be used for simple long-term storage, such as with AWS S3 buckets or JINA AI Cloud. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 00d66e44129..0c770dc64c1 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -120,6 +120,7 @@ nav: - data_types/3d_mesh/3d_mesh.md - data_types/table/table.md - data_types/multimodal/multimodal.md + - Migration guide: migration_guide.md - ... - Glossary: glossary.md - Contributing: CONTRIBUTING.md From 5b6f7654cc7c25bc784b5b77513eb84c1aad7258 Mon Sep 17 00:00:00 2001 From: anna-charlotte Date: Mon, 17 Apr 2023 17:34:45 +0200 Subject: [PATCH 3/5] docs: update guide Signed-off-by: anna-charlotte --- docs/migration_guide.md | 51 ++++++++++++++++++++++++++++------------- 1 file changed, 35 insertions(+), 16 deletions(-) diff --git a/docs/migration_guide.md b/docs/migration_guide.md index a13e2e1d331..a68b8c98654 100644 --- a/docs/migration_guide.md +++ b/docs/migration_guide.md @@ -1,6 +1,6 @@ # Migration guide -## Document +## Changes to `Document` - `Document` has been renamed to [`BaseDoc`][docarray.BaseDoc]. - `BaseDoc` can not be used directly, but instead has to be extended. @@ -8,21 +8,23 @@ `Document` class in v1 only allowed for a fixed schema, with one of `tensor`, `text` and `blob`, and additional `chunks` and `matches`. - In v2 we have the [`LegacyDocument`][docarray.documents.legacy.LegacyDocument] class, - which extends `BaseDoc` while following the same schema as in DocArray v1. + which extends `BaseDoc` while following the same schema as v1's `Document`. The `LegacyDocument` can be useful to start migrating your codebase from v1 to v2. - Nevertheless, the API is not totally compatible with DocArray v1 `Document`. + Nevertheless, the API is not fully compatible with DocArray v1 `Document`. Indeed, due to the added flexibility none of the method associated with `Document` are present. Only the schema of the data is similar. -## DocumentArray +## Changes to `DocumentArray` ### DocList -- The `DocumentArray` class from v1 has been renamed to [`DocList`][docarray.array.DocList], to be more descriptive of its actual functionality. +- The `DocumentArray` class from v1 has been renamed to [`DocList`][docarray.array.DocList], +to be more descriptive of its actual functionality, since it is a list of `BaseDoc`s ### DocVec -- Additionally, we added the class [`DocVec`][docarray.array.DocVec]. Both `DocVec` and `DocList` extend `AnyDocArray`. +- Additionally, we introduced the class [`DocVec`][docarray.array.DocVec], which is a column based representation of `BaseDoc`s. +Both `DocVec` and `DocList` extend `AnyDocArray`. - `DocVec` is a container of Documents appropriates to perform computation that require batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass). - A `DocVec` has a similar interface as `DocList` @@ -33,12 +35,27 @@ If the field is a tensor, the data from all Documents will be stored as a single doc_vec (torch/np/tf) tensor. If the tensor field is `AnyTensor` or a Union of tensor types, the `.tensor_type` will be used to determine the type of the doc_vec column. +### Parameterized DocList +- With the added flexibility of your document schema, and therefore endless options to design your document schema, +when initializing a `DocList` it does not necessarily have to be homogenous. +- If you want a homogenous `DocList` you can parameterize it at initialization time: +```python +from docarray import DocList +from docarray.documents import ImageDoc + +docs = DocList[ImageDoc]() +``` + +Methods like `.from_csv()` or `.pull()` only work with parameterized `DocList`s. ### Access attributes of your DocumentArray -In v1 you could access an attribute of all Documents in your DocumentArray by calling the plural of the attribute's name on your DocArray instance. -In v2 you don't have to use the plural, but instead just use the document's attribute name. +- In v1 you could access an attribute of all Documents in your DocumentArray by calling the plural +of the attribute's name on your DocArray instance. +- In v2 you don't have to use the plural, but instead just use the document's attribute name, +since `AnyDocArray` will expose the same attributes as the `BaseDoc`s it contains. This will return a list of `type(attribute)`. +However, this only works if (and only if) all the `BaseDoc`s in the `AnyDocArray` have the same schema. Therfore this only works ```python from docarray import BaseDoc, DocList @@ -50,17 +67,16 @@ class Book(BaseDoc): docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)]) -book_titles = docs.title - -assert isinstance(book_titles, list) +book_titles = docs.title # returns a list[str] -for title in book_titles: - assert isinstance(title, str) +# this would fail +# docs = DocList([Book(title=f'title {i}') for i in range(5)]) +# book_titles = docs.title ``` -## Document Store +## Changes to Document Store -In v2 the `Document Store` has been renamed to [`DocIndex`][docarray.index.Doc] and can be used for fast retrieval using vector similarity. +In v2 the `Document Store` has been renamed to [`DocIndex`](user_guide/storing/first_steps.md) and can be used for fast retrieval using vector similarity. DocArray v2 `DocIndex` supports: - Weaviate @@ -68,4 +84,7 @@ DocArray v2 `DocIndex` supports: - ElasticSearch - HNSWLib -In contrast, `DocStore` in v2 can be used for simple long-term storage, such as with AWS S3 buckets or JINA AI Cloud. \ No newline at end of file +Instead of creating a `DocumentArray` instance and setting the `storage` parameter to a vector database of your choice, +in v2 you can initialize a `DocIndex` object of your choice, such as `db = HnswDocumentIndex\[MyDoc](work_dir='/my/work/dir'). + +In contrast, [`DocStore`](user_guide/storing/first_step.md#document-store) in v2 can be used for simple long-term storage, such as with AWS S3 buckets or JINA AI Cloud. \ No newline at end of file From 07ab67bf12e5b788a0a5b750da5f369c36deb21e Mon Sep 17 00:00:00 2001 From: anna-charlotte Date: Mon, 17 Apr 2023 17:39:31 +0200 Subject: [PATCH 4/5] fix: remove typo Signed-off-by: anna-charlotte --- docs/user_guide/representing/first_step.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/representing/first_step.md b/docs/user_guide/representing/first_step.md index f61510cb978..c1b41b623c6 100644 --- a/docs/user_guide/representing/first_step.md +++ b/docs/user_guide/representing/first_step.md @@ -50,7 +50,7 @@ all the features of `BaseModel` in your `Doc` class. `BaseDoc`: * Will perform data validation: `BaseDoc` will check that the data you pass to it is valid. If not, it will raise an error. Data being "valid" is actually defined by the type used in the type hint itself, but we will come back to this concept [later](../../data_types/first_steps.md). * Can be configured using a nested `Config` class, see Pydantic [documentation](https://docs.pydantic.dev/usage/model_config/) for more detail on what kind of config Pydantic offers. -* Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools that use Pydantic, like [FastAPI]('https://fastapi.tiangolo.com/').======= +* Can be used as a drop-in replacement for `BaseModel` in your code and is compatible with tools that use Pydantic, like [FastAPI]('https://fastapi.tiangolo.com/'). ## Representing multimodal and nested data From 42a1d1547772e1878fa40e9e28134be134d38289 Mon Sep 17 00:00:00 2001 From: anna-charlotte Date: Mon, 17 Apr 2023 17:59:35 +0200 Subject: [PATCH 5/5] fix: add introduction Signed-off-by: anna-charlotte --- docs/migration_guide.md | 43 +++++++++++++++++++++++++++++++++-------- 1 file changed, 35 insertions(+), 8 deletions(-) diff --git a/docs/migration_guide.md b/docs/migration_guide.md index a68b8c98654..797b7bcefd7 100644 --- a/docs/migration_guide.md +++ b/docs/migration_guide.md @@ -1,17 +1,40 @@ # Migration guide +If you are using DocArray v<0.30.0, you will be familiar with its [dataclass API](https://docarray.jina.ai/fundamentals/dataclass/). + +_DocArray v2 is that idea, taken seriously._ Every document is created through dataclass-like interface, +courtesy of [Pydantic](https://pydantic-docs.helpmanual.io/usage/models/). + +This gives the following advantages: + +- **Flexibility:** No need to conform to a fixed set of fields -- your data defines the schema +- **Multi-modality:** Easily store multiple modalities and multiple embeddings in the same Document +- **Language agnostic:** At its core, Documents are just dictionaries. This makes it easy to create and send them from any language, not just Python. + +You may also be familiar with our old Document Stores for vector DB integration. +They are now called **Document Indexes** and offer the following improvements: + +- **Hybrid search:** You can now combine vector search with text search, and even filter by arbitrary fields +- **Production-ready:** The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain +- **Increased flexibility:** We strive to support any configuration or setting that you could perform through the DB's first-party client + +For now, Document Indexes support **[Weaviate](https://weaviate.io/)**, **[Qdrant](https://qdrant.tech/)**, **[ElasticSearch](https://www.elastic.co/)**, and **[HNSWLib](https://github.com/nmslib/hnswlib)**, with more to come. + ## Changes to `Document` - `Document` has been renamed to [`BaseDoc`][docarray.BaseDoc]. -- `BaseDoc` can not be used directly, but instead has to be extended. +- `BaseDoc` can not be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface. - Following from the previous point, the extending of `BaseDoc` allows for a flexible schema while the `Document` class in v1 only allowed for a fixed schema, with one of `tensor`, `text` and `blob`, and additional `chunks` and `matches`. +- Due to the added flexibility, one can not know what fields your document class will provide. + Therefore, various methods from v1 (such as `.load_uri_to_image_tensor()`) are not supported in v2. + Instead, we provide some of those methods on [typing-level](data_types/first_steps.md). - In v2 we have the [`LegacyDocument`][docarray.documents.legacy.LegacyDocument] class, which extends `BaseDoc` while following the same schema as v1's `Document`. The `LegacyDocument` can be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1 `Document`. - Indeed, due to the added flexibility none of the method associated with `Document` are present. + Indeed, none of the method associated with `Document` are present. Only the schema of the data is similar. ## Changes to `DocumentArray` @@ -46,7 +69,7 @@ from docarray.documents import ImageDoc docs = DocList[ImageDoc]() ``` -Methods like `.from_csv()` or `.pull()` only work with parameterized `DocList`s. +- Methods like `.from_csv()` or `.pull()` only work with parameterized `DocList`s. ### Access attributes of your DocumentArray @@ -79,12 +102,16 @@ book_titles = docs.title # returns a list[str] In v2 the `Document Store` has been renamed to [`DocIndex`](user_guide/storing/first_steps.md) and can be used for fast retrieval using vector similarity. DocArray v2 `DocIndex` supports: -- Weaviate -- Qdrant -- ElasticSearch -- HNSWLib +- [Weaviate](https://weaviate.io/) +- [Qdrant](https://qdrant.tech/) +- [ElasticSearch](https://www.elastic.co/) +- [HNSWLib](https://github.com/nmslib/hnswlib) Instead of creating a `DocumentArray` instance and setting the `storage` parameter to a vector database of your choice, -in v2 you can initialize a `DocIndex` object of your choice, such as `db = HnswDocumentIndex\[MyDoc](work_dir='/my/work/dir'). +in v2 you can initialize a `DocIndex` object of your choice, such as: + +```python +db = HnswDocumentIndex[MyDoc](work_dir='/my/work/dir') +``` In contrast, [`DocStore`](user_guide/storing/first_step.md#document-store) in v2 can be used for simple long-term storage, such as with AWS S3 buckets or JINA AI Cloud. \ No newline at end of file