Skip to content

test(sqlite): add more test to cover sqlite backend#81

Merged
hanxiao merged 43 commits intomainfrom
tests-sqlite
Feb 1, 2022
Merged

test(sqlite): add more test to cover sqlite backend#81
hanxiao merged 43 commits intomainfrom
tests-sqlite

Conversation

@hanxiao
Copy link
Copy Markdown
Member

@hanxiao hanxiao commented Jan 26, 2022

No description provided.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 26, 2022

Codecov Report

Merging #81 (459b6c8) into main (2d5e93b) will increase coverage by 0.43%.
The diff coverage is 88.37%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #81      +/-   ##
==========================================
+ Coverage   83.75%   84.18%   +0.43%     
==========================================
  Files          92       93       +1     
  Lines        4117     4147      +30     
==========================================
+ Hits         3448     3491      +43     
+ Misses        669      656      -13     
Flag Coverage Δ
docarray 84.18% <88.37%> (+0.43%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/array/storage/sqlite/getsetdel.py 100.00% <ø> (+1.66%) ⬆️
docarray/array/storage/weaviate/getsetdel.py 80.00% <ø> (-19.02%) ⬇️
docarray/array/mixins/setitem.py 81.30% <80.88%> (-4.84%) ⬇️
docarray/array/mixins/group.py 87.17% <81.81%> (-2.48%) ⬇️
docarray/math/ndarray.py 90.52% <83.33%> (-0.30%) ⬇️
docarray/array/storage/sqlite/binary.py 89.65% <89.65%> (ø)
docarray/array/storage/base/getsetdel.py 81.25% <92.85%> (+27.72%) ⬆️
docarray/array/mixins/content.py 98.52% <100.00%> (-0.11%) ⬇️
docarray/array/mixins/embed.py 90.90% <100.00%> (ø)
docarray/array/mixins/traverse.py 94.44% <100.00%> (+1.90%) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d5e93b...459b6c8. Read the comment docs.

Comment on lines +7 to +8
@pytest.mark.parametrize('da_cls', [DocumentArray, DocumentArraySqlite])
def test_empty_non_zero(da_cls):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although empty is possible to DocumentArraySqlite (basicly because we allow config-free construction via temp files), I don't think it should be supported for storage backends in general (and no backend in particular).
If we think about it, where does DocumentArrayWeaviate.empty(10) store ? Should docs go to the default localhost:8080 server configuration ?

Copy link
Copy Markdown
Member Author

@hanxiao hanxiao Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref1: https://github.com/RaRe-Technologies/sqlitedict/blob/master/sqlitedict.py
ref2: https://github.com/osoken/sqlitecollections

they all support default connection as temp file, so it's not invented by us. let's follow this behavior

Comment on lines +68 to +72
def batch_indices(
self,
batch_size: int,
shuffle: bool = False,
) -> Generator[list, None, None]:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function seems never used anywhere?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, it can be removed or kept as an alternative tobatch_ids since docarray suports both indexing by ids and indexing by indices. Probably for in_memory it's faster to do it by index and in sqlite to do it by ids.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, we can restore it later if needed

Comment on lines +16 to +38
class SqliteBinaryIOMixin(BinaryIOMixin):
"""Save/load an array to a binary file."""

@classmethod
def load_binary(
cls: Type['T'],
file: Union[str, BinaryIO, bytes],
protocol: str = 'protobuf-array',
compress: Optional[str] = None,
_show_progress: bool = False,
streaming: bool = False,
) -> Union['DocumentArray', Generator['Document', None, None]]:
"""Load array elements from a compressed binary file.

:param file: File or filename or serialized bytes where the data is stored.
:param protocol: protocol to use. 'pickle-array' is not supported for DocumentArraySqlite
:param compress: compress algorithm to use
:param _show_progress: show progress bar, only works when protocol is `pickle` or `protobuf`
:param streaming: if `True` returns a generator over `Document` objects.
In case protocol is pickle the `Documents` are streamed from disk to save memory usage
:return: a DocumentArray object
"""
_check_protocol(protocol)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think a better way to do this is the following:

  1. moving _check_protocol(protocol) to DocumentArray level mixin
  2. add class variable available_protocols , e.g.
class IOMixin:

   available_protocols = ['a', 'b', 'c', 'd']
   def _check_protocols(self, value: str):
       if value not in self.available_protocols:
          raise ...

   def from_bytes():
      self._check_protocols
  1. Finally, all DA subclass just need to specify a white list by overriding available_protocols class variable

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanxiao I'll still need to subclass the mix and override methods like load_binary because I need to change the default protocol. Is it okay to implement your suggestion while still overriding the methods ?

There's another option which is adding attribute default_protocol in the same way, and modifying the protocol parameter to be None (will bad for autocompletion).
Your thoughts ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, okay then let's leave it like this.

Copy link
Copy Markdown
Member Author

@hanxiao hanxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implement available_protocols on the top-level

for si, _val in zip(index, value):
self[si] = _val # leverage existing setter
for si, _val in zip(index, value):
self[si] = _val # leverage existing setter
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid types in the sequence should be caught and not fail silently; particularly things like da[1.0, 2.0] coming from np or torch might be a pitfall.

Suggested change
self[si] = _val # leverage existing setter
else:
raise IndexError(f"{index} should be either a sequence of bool, int or str")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single quote btw

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oups not used to it yet 😕

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +164 to +170
if _a in ('tensor', 'embedding'):
if _a == 'tensor':
_docs.tensors = _v
elif _a == 'embedding':
_docs.embeddings = _v
for _d in _docs:
self._set_doc_by_id(_d.id, _d)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if _a in ('tensor', 'embedding'):
if _a == 'tensor':
_docs.tensors = _v
elif _a == 'embedding':
_docs.embeddings = _v
for _d in _docs:
self._set_doc_by_id(_d.id, _d)
if _a == 'tensor':
_docs.tensors = _v
elif _a == 'embedding':
_docs.embeddings = _v
for _d in _docs:
self._set_doc_by_id(_d.id, _d)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for _d in _docs:
    self._set_doc_by_id(_d.id, _d)

should be common to both branches

@hanxiao hanxiao merged commit 3c4ba0b into main Feb 1, 2022
@hanxiao hanxiao deleted the tests-sqlite branch February 1, 2022 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants