docarray · JohannesMessner · Apr 24, 2023 · Apr 24, 2023 · alexcg1 · Apr 24, 2023
diff --git a/docs/user_guide/storing/docindex.md b/docs/user_guide/storing/docindex.md
@@ -77,7 +77,7 @@ class MyDoc(BaseDoc):
 db = HnswDocumentIndex[MyDoc](work_dir='./my_test_db')
 ```
 
-**Schema definition:**
+### Schema definition
 
 In this code snippet, `HnswDocumentIndex` takes a schema of the form of `MyDoc`.
 The Document Index then _creates a column for each field in `MyDoc`_.
@@ -93,6 +93,69 @@ the database will store vectors with 128 dimensions.
     Instead of using `NdArray` you can use `TorchTensor` or `TensorFlowTensor` and the Document Index will handle that
     for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually!
 
+
+### Using a predefined Document as schema
+
+DocArray offers a number of predefined Documents, like [ImageDoce][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
-DocArray offers a number of predefined Documents, like [ImageDoce][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
+DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
-DocArray offers a number of predefined Documents, like [ImageDoce][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
+DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
+If you try to use these directly as a schema for a Document Index, you will get unexpected behavior:
+Depending on the backend, and exception will be raised, or no vector index for ANN lookup will be built.
-Depending on the backend, and exception will be raised, or no vector index for ANN lookup will be built.
+Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built.
-Depending on the backend, and exception will be raised, or no vector index for ANN lookup will be built.
+Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built.
+
+The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding`
+field. But this is crucial information for any vector database to work properly!
+
+You can work around this problem by subclassing the predefined Document and adding the dimensionality information:
+
+=== "Using type hint"
+    ```python
+    from docarray.documents import TextDoc
+    from docarray.typing import NdArray
+    from docarray.index import HnswDocumentIndex
+
+
+    class MyDoc(TextDoc):
+        embedding: NdArray[128]
+
+
+    db = HnswDocumentIndex[MyDoc](work_dir='test_db')
+    ```
+
+=== "Using Field()"
+    ```python
+    from docarray.documents import TextDoc
+    from docarray.typing import AnyTensor
+    from docarray.index import HnswDocumentIndex
+    from pydantic import Field
+
+
+    class MyDoc(TextDoc):
+        embedding: AnyTensor = Field(n_dim=128)
+
+
+    db = HnswDocumentIndex[MyDoc](work_dir='test_db3')
+    ```
+
+Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the
+predefined Document type, or of your custom Document type.
-predefined Document type, or of your custom Document type.
+predefined Document types, or your custom Document type.
-predefined Document type, or of your custom Document type.
+predefined Document types, or your custom Document type.
+
+The [next section](#index-data) goes into more detail about data indexing, but note that if you have some `TextDoc`s, `ImageDoc`s etc. that you want to index, you _don't_ need to cast them to `MyDoc`:
+
+```python
+from docarray import DocList
+
+# data of type TextDoc
+data = DocList[TextDoc](
+    [
+        TextDoc(text='hello world', embedding=np.random.rand(128)),
+        TextDoc(text='hello world', embedding=np.random.rand(128)),
+        TextDoc(text='hello world', embedding=np.random.rand(128)),
+    ]
+)
+
+# you can index this into Document Index of type MyDoc
+db.index(data)
+```
+
+
 **Database location:**
 
 For `HnswDocumentIndex` you need to specify a `work_dir` where the data will be stored; for other backends you
@@ -136,6 +199,8 @@ need to have compatible schemas.
     - A and B have the same field names and field types
     - A and B have the same field names, and, for every field, the type of B is a subclass of the type of A
 
+    In particular this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index.
-    In particular this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index.
+    In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index.
-    In particular this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index.
+    In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index.
+
 ## Vector similarity search
 
 Now that you have indexed your data, you can perform vector similarity search using the [find()][docarray.index.abstract.BaseDocIndex.find] method.
@@ -183,7 +248,7 @@ matching documents and their associated similarity scores.
 
 How these scores are calculated depends on the backend, and can usually be [configured](#customize-configurations).
 
-**Batched search:**
+### Batched search
 
 You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method.
 
@@ -319,7 +384,7 @@ The specific configurations that you can tweak depend on the backend, but the in
 
 Document Indexes differentiate between three different kind of configurations:
 
-**Database configurations**
+### Database configurations
 
 _Database configurations_ are configurations that pertain to the entire database or table (as opposed to just a specific column),
 and that you _don't_ dynamically change at runtime.
@@ -371,7 +436,7 @@ You can customize every field in this configuration:
     # > HnswDocumentIndex.DBConfig(work_dir='/tmp/my_db')
     ```
 
-**Runtime configurations**
+### Runtime configurations
 
 _Runtime configurations_ are configurations that pertain to the entire database or table (as opposed to just a specific column),
 and that you can dynamically change at runtime.
@@ -459,7 +524,7 @@ You can customize every field in this configuration using the [configure()][doca
 
 After this change, the new setting will be applied to _every_ column that corresponds to a `np.ndarray` type.
 
-**Column configurations**
+### Column configurations
 
 For many vector databases, individual columns can have different configurations.