test(sqlite): add more test to cover sqlite backend#81
Conversation
Codecov Report
@@ Coverage Diff @@
## main #81 +/- ##
==========================================
+ Coverage 83.75% 84.18% +0.43%
==========================================
Files 92 93 +1
Lines 4117 4147 +30
==========================================
+ Hits 3448 3491 +43
+ Misses 669 656 -13
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
| @pytest.mark.parametrize('da_cls', [DocumentArray, DocumentArraySqlite]) | ||
| def test_empty_non_zero(da_cls): |
There was a problem hiding this comment.
although empty is possible to DocumentArraySqlite (basicly because we allow config-free construction via temp files), I don't think it should be supported for storage backends in general (and no backend in particular).
If we think about it, where does DocumentArrayWeaviate.empty(10) store ? Should docs go to the default localhost:8080 server configuration ?
There was a problem hiding this comment.
ref1: https://github.com/RaRe-Technologies/sqlitedict/blob/master/sqlitedict.py
ref2: https://github.com/osoken/sqlitecollections
they all support default connection as temp file, so it's not invented by us. let's follow this behavior
docarray/array/mixins/group.py
Outdated
| def batch_indices( | ||
| self, | ||
| batch_size: int, | ||
| shuffle: bool = False, | ||
| ) -> Generator[list, None, None]: |
There was a problem hiding this comment.
this function seems never used anywhere?
There was a problem hiding this comment.
It's not, it can be removed or kept as an alternative tobatch_ids since docarray suports both indexing by ids and indexing by indices. Probably for in_memory it's faster to do it by index and in sqlite to do it by ids.
There was a problem hiding this comment.
removed, we can restore it later if needed
| class SqliteBinaryIOMixin(BinaryIOMixin): | ||
| """Save/load an array to a binary file.""" | ||
|
|
||
| @classmethod | ||
| def load_binary( | ||
| cls: Type['T'], | ||
| file: Union[str, BinaryIO, bytes], | ||
| protocol: str = 'protobuf-array', | ||
| compress: Optional[str] = None, | ||
| _show_progress: bool = False, | ||
| streaming: bool = False, | ||
| ) -> Union['DocumentArray', Generator['Document', None, None]]: | ||
| """Load array elements from a compressed binary file. | ||
|
|
||
| :param file: File or filename or serialized bytes where the data is stored. | ||
| :param protocol: protocol to use. 'pickle-array' is not supported for DocumentArraySqlite | ||
| :param compress: compress algorithm to use | ||
| :param _show_progress: show progress bar, only works when protocol is `pickle` or `protobuf` | ||
| :param streaming: if `True` returns a generator over `Document` objects. | ||
| In case protocol is pickle the `Documents` are streamed from disk to save memory usage | ||
| :return: a DocumentArray object | ||
| """ | ||
| _check_protocol(protocol) |
There was a problem hiding this comment.
i think a better way to do this is the following:
- moving
_check_protocol(protocol)toDocumentArraylevel mixin - add class variable
available_protocols, e.g.
class IOMixin:
available_protocols = ['a', 'b', 'c', 'd']
def _check_protocols(self, value: str):
if value not in self.available_protocols:
raise ...
def from_bytes():
self._check_protocols- Finally, all DA subclass just need to specify a white list by overriding
available_protocolsclass variable
There was a problem hiding this comment.
@hanxiao I'll still need to subclass the mix and override methods like load_binary because I need to change the default protocol. Is it okay to implement your suggestion while still overriding the methods ?
There's another option which is adding attribute default_protocol in the same way, and modifying the protocol parameter to be None (will bad for autocompletion).
Your thoughts ?
There was a problem hiding this comment.
you are right, okay then let's leave it like this.
hanxiao
left a comment
There was a problem hiding this comment.
implement available_protocols on the top-level
| for si, _val in zip(index, value): | ||
| self[si] = _val # leverage existing setter | ||
| for si, _val in zip(index, value): | ||
| self[si] = _val # leverage existing setter |
There was a problem hiding this comment.
Invalid types in the sequence should be caught and not fail silently; particularly things like da[1.0, 2.0] coming from np or torch might be a pitfall.
| self[si] = _val # leverage existing setter | |
| else: | |
| raise IndexError(f"{index} should be either a sequence of bool, int or str") |
| if _a in ('tensor', 'embedding'): | ||
| if _a == 'tensor': | ||
| _docs.tensors = _v | ||
| elif _a == 'embedding': | ||
| _docs.embeddings = _v | ||
| for _d in _docs: | ||
| self._set_doc_by_id(_d.id, _d) |
There was a problem hiding this comment.
| if _a in ('tensor', 'embedding'): | |
| if _a == 'tensor': | |
| _docs.tensors = _v | |
| elif _a == 'embedding': | |
| _docs.embeddings = _v | |
| for _d in _docs: | |
| self._set_doc_by_id(_d.id, _d) | |
| if _a == 'tensor': | |
| _docs.tensors = _v | |
| elif _a == 'embedding': | |
| _docs.embeddings = _v | |
| for _d in _docs: | |
| self._set_doc_by_id(_d.id, _d) |
There was a problem hiding this comment.
for _d in _docs:
self._set_doc_by_id(_d.id, _d)
should be common to both branches
No description provided.