-
Notifications
You must be signed in to change notification settings - Fork 238
Storing different documents in weaviate does not work because the index_name is not inferred from the BaseDoc subclass' name #1455
Description
Storing two different types of documents in weaviate does not work.
I dont understand the API design here, is this intentional?
for example, lets assume I want to do a multilingual search, and have to use different embedding models to achieve this, or lets take the example from the documentation and add a different class, this will result in an error.
Example:
import numpy as np
from pydantic import Field
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
from docarray.index.backends.weaviate import WeaviateDocumentIndex
# Define a document schema
class Document(BaseDoc):
text: str
embedding: NdArray[2] = Field(
dims=2, is_embedding=True
) # Embedding column -> vector representation of the document
file: NdArray[100] = Field(dims=100)
# define a book schema
class Book(BaseDoc):
title: str
embedding: NdArray = Field( is_embedding=True)
# Make a list of 3 docs to index
docs = DocList[Document]([
Document(
text="Hello world", embedding=np.array([1, 2]), file=np.random.rand(100), id="1"
),
Document(
text="Hello world, how are you?",
embedding=np.array([3, 4]),
file=np.random.rand(100),
id="2",
),
Document(
text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut",
embedding=np.array([5, 6]),
file=np.random.rand(100),
id="3",
),
])
books = DocList[Book]([
Book(
title="Harry Potter", embedding=np.array([1, 2,3]), id="1"
),
Book(
title="Lords of the rings", embedding=np.array([3, 4,5]), id="2"
),
])
batch_config = {
"batch_size": 20,
"dynamic": False,
"timeout_retries": 3,
"num_workers": 1,
}
runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config)
dbconfig = WeaviateDocumentIndex.DBConfig(
host="http://localhost:8080"
)
store = WeaviateDocumentIndex[Document](db_config=dbconfig)
store.configure(runtimeconfig) # Batch settings being passed on
store.index(docs)
store = WeaviateDocumentIndex[Book](db_config=dbconfig)
store.configure(runtimeconfig) # Batch settings being passed on
store.index(books)This gives:
---------------------------------------------------------------------------
UnexpectedStatusCodeException Traceback (most recent call last)
[/tmp/ipykernel_22411/2065625298.py](https://vscode-remote+ssh-002dremote-002baimwel-002dec2.vscode-resource.vscode-cdn.net/tmp/ipykernel_22411/2065625298.py) in ()
56 runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config)
57
---> 58 store = WeaviateDocumentIndex[Book](db_config=dbconfig)
59 store.configure(runtimeconfig) # Batch settings being passed on
60 store.index(docs)
[~/isco-labeling-tool/.venv/lib/python3.9/site-packages/docarray/index/backends/weaviate.py](https://vscode-remote+ssh-002dremote-002baimwel-002dec2.vscode-resource.vscode-cdn.net/home/ec2-user/isco-labeling-tool/notebooks/~/isco-labeling-tool/.venv/lib/python3.9/site-packages/docarray/index/backends/weaviate.py) in __init__(self, db_config, **kwargs)
110 self._set_embedding_column()
111 self._set_properties()
--> 112 self._create_schema()
113
114 def _set_properties(self) -> None:
[~/isco-labeling-tool/.venv/lib/python3.9/site-packages/docarray/index/backends/weaviate.py](https://vscode-remote+ssh-002dremote-002baimwel-002dec2.vscode-resource.vscode-cdn.net/home/ec2-user/isco-labeling-tool/notebooks/~/isco-labeling-tool/.venv/lib/python3.9/site-packages/docarray/index/backends/weaviate.py) in _create_schema(self)
217 )
218 else:
--> 219 self._client.schema.create_class(schema)
220
221 @dataclass
[~/isco-labeling-tool/.venv/lib/python3.9/site-packages/weaviate/schema/crud_schema.py](https://vscode-remote+ssh-002dremote-002baimwel-002dec2.vscode-resource.vscode-cdn.net/home/ec2-user/isco-labeling-tool/notebooks/~/isco-labeling-tool/.venv/lib/python3.9/site-packages/weaviate/schema/crud_schema.py) in create_class(self, schema_class)
180 # validate the class before loading
...
--> 708 raise UnexpectedStatusCodeException("Create class", response)
709
710 def _create_classes_with_primitives(self, schema_classes_list: list) -> None:
UnexpectedStatusCodeException: Create class! Unexpected status code: 422, with response body: {'error': [{'message': "Name 'Document' already used as a name for an Object class"}]}.
This happens because in docarray/index/backends/weaviate.py on line 210:
schema["class"] = self._db_config.index_name
So the class that is being added to weaviate (the table) is named after the _db_config's index_name, instead of the name of the schema class.
One can solve this in this (quite hacky) manner:
dbconfig = WeaviateDocumentIndex.DBConfig(
host="http://localhost:8080",
index_name=books.doc_type.__name__
)
store = WeaviateDocumentIndex[Book](db_config=dbconfig)
store.configure(runtimeconfig) # Batch settings being passed on
store.index(books)
or use the Book.__name__ property.
Anyway, my proposed solution would be to use the name of the class, and not some undocumented name in the db_config, this behaviour is quite unexpected tbh.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status