Skip to content

Support root_id for storage backends #775

@JohannesMessner

Description

@JohannesMessner

Some of our users work with deeply nested data, where they perform vector search on some nesting level, but are actually interested in retrieving the root level documents.

In memory this can be solved by traversing the nested structure on the fly, but with a database backend it is not doable: nested levels are only present in serialized form, so one would have to load everything into memory in order to be able to traverse the structure.

To tackle this, we propose the following:

  • we create a function get_root_doc(da, doc) that returns the root document of doc. The implementation could be something similar to this:
def get_root_doc(da, doc):
    root_da_flat = da[...]
    result = doc
    while result.parent_id:
        result = root_da_flat[result.parent_id]
    return result
  • For storage backends we expose an api that allows you to search by some nesting level, but retrieve documents on the root level: da.find(..., return_root=True)
  • It works the following way:
    1. when inserting a (batch of) Document(s), it calls get_root_doc() on that
    2. It stores the root document's id as a separate column in the database
    3. when searching with return_root=True it performs a search, then take the result's stored root_id, and returns the root document based on that
    4. The level the user searches on needs to exist as a subindex (this is already the case), and the root level is always properly indexed anyways. The intermediate nesting levels can stay serialized.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions