Skip to content

Storing different documents in weaviate does not work because the index_name is not inferred from the BaseDoc subclass' name #1455

@hugocool

Description

@hugocool

Storing two different types of documents in weaviate does not work.
I dont understand the API design here, is this intentional?

for example, lets assume I want to do a multilingual search, and have to use different embedding models to achieve this, or lets take the example from the documentation and add a different class, this will result in an error.
Example:

import numpy as np
from pydantic import Field
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
from docarray.index.backends.weaviate import WeaviateDocumentIndex

# Define a document schema
class Document(BaseDoc):
    text: str
    embedding: NdArray[2] = Field(
        dims=2, is_embedding=True
    )  # Embedding column -> vector representation of the document
    file: NdArray[100] = Field(dims=100)

# define a book schema
class Book(BaseDoc):
    title: str
    embedding: NdArray = Field( is_embedding=True) 

# Make a list of 3 docs to index
docs = DocList[Document]([
    Document(
        text="Hello world", embedding=np.array([1, 2]), file=np.random.rand(100), id="1"
    ),
    Document(
        text="Hello world, how are you?",
        embedding=np.array([3, 4]),
        file=np.random.rand(100),
        id="2",
    ),
    Document(
        text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut",
        embedding=np.array([5, 6]),
        file=np.random.rand(100),
        id="3",
    ),
])

books = DocList[Book]([
    Book(
        title="Harry Potter", embedding=np.array([1, 2,3]), id="1"
    ),
    Book(
        title="Lords of the rings", embedding=np.array([3, 4,5]), id="2"
    ),
])

batch_config = {
    "batch_size": 20,
    "dynamic": False,
    "timeout_retries": 3,
    "num_workers": 1,
}

runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config)
dbconfig = WeaviateDocumentIndex.DBConfig(
    host="http://localhost:8080"
) 
store = WeaviateDocumentIndex[Document](db_config=dbconfig)
store.configure(runtimeconfig)  # Batch settings being passed on
store.index(docs)

store = WeaviateDocumentIndex[Book](db_config=dbconfig)
store.configure(runtimeconfig)  # Batch settings being passed on
store.index(books)

This gives:

---------------------------------------------------------------------------
UnexpectedStatusCodeException             Traceback (most recent call last)
[/tmp/ipykernel_22411/2065625298.py](https://vscode-remote+ssh-002dremote-002baimwel-002dec2.vscode-resource.vscode-cdn.net/tmp/ipykernel_22411/2065625298.py) in ()
     56 runtimeconfig = WeaviateDocumentIndex.RuntimeConfig(batch_config=batch_config)
     57 
---> 58 store = WeaviateDocumentIndex[Book](db_config=dbconfig)
     59 store.configure(runtimeconfig)  # Batch settings being passed on
     60 store.index(docs)

[~/isco-labeling-tool/.venv/lib/python3.9/site-packages/docarray/index/backends/weaviate.py](https://vscode-remote+ssh-002dremote-002baimwel-002dec2.vscode-resource.vscode-cdn.net/home/ec2-user/isco-labeling-tool/notebooks/~/isco-labeling-tool/.venv/lib/python3.9/site-packages/docarray/index/backends/weaviate.py) in __init__(self, db_config, **kwargs)
    110         self._set_embedding_column()
    111         self._set_properties()
--> 112         self._create_schema()
    113 
    114     def _set_properties(self) -> None:

[~/isco-labeling-tool/.venv/lib/python3.9/site-packages/docarray/index/backends/weaviate.py](https://vscode-remote+ssh-002dremote-002baimwel-002dec2.vscode-resource.vscode-cdn.net/home/ec2-user/isco-labeling-tool/notebooks/~/isco-labeling-tool/.venv/lib/python3.9/site-packages/docarray/index/backends/weaviate.py) in _create_schema(self)
    217             )
    218         else:
--> 219             self._client.schema.create_class(schema)
    220 
    221     @dataclass

[~/isco-labeling-tool/.venv/lib/python3.9/site-packages/weaviate/schema/crud_schema.py](https://vscode-remote+ssh-002dremote-002baimwel-002dec2.vscode-resource.vscode-cdn.net/home/ec2-user/isco-labeling-tool/notebooks/~/isco-labeling-tool/.venv/lib/python3.9/site-packages/weaviate/schema/crud_schema.py) in create_class(self, schema_class)
    180         # validate the class before loading
...
--> 708             raise UnexpectedStatusCodeException("Create class", response)
    709 
    710     def _create_classes_with_primitives(self, schema_classes_list: list) -> None:

UnexpectedStatusCodeException: Create class! Unexpected status code: 422, with response body: {'error': [{'message': "Name 'Document' already used as a name for an Object class"}]}.

This happens because in docarray/index/backends/weaviate.py on line 210:

schema["class"] = self._db_config.index_name

So the class that is being added to weaviate (the table) is named after the _db_config's index_name, instead of the name of the schema class.

One can solve this in this (quite hacky) manner:

dbconfig = WeaviateDocumentIndex.DBConfig(
    host="http://localhost:8080",
    index_name=books.doc_type.__name__
)
store = WeaviateDocumentIndex[Book](db_config=dbconfig)
store.configure(runtimeconfig)  # Batch settings being passed on
store.index(books)

or use the Book.__name__ property.

Anyway, my proposed solution would be to use the name of the class, and not some undocumented name in the db_config, this behaviour is quite unexpected tbh.

Metadata

Metadata

Assignees

Labels

area/document-indexConcerning Document Index or a Document Index backenddifficulty/mediumSuited after 1 or 2 good-first-issues

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions