Conversation
docarray/array/mixins/io/binary.py
Outdated
| while True: | ||
| b = current_bytes + fp.read(500) | ||
| if delimiter is None: | ||
| _len = len(random_uuid().bytes) |
There was a problem hiding this comment.
this length can be computed only once
There was a problem hiding this comment.
I think for efficiency and mantainability, this method should rely on using load_binary from DocumentArray itself
There was a problem hiding this comment.
if delimiter is None:
_len = len(random_uuid().bytes)
this should be move out of the loop and the first _len chars are always the delimiter.
- make sure to cover
compressas otherwise the usage is limited. Note, all compressions I used there they already implemented streamed file handler, you justopenit, check Python official docs & lz4 docs for details. Refer to other part of the code on how I implement compression. - Agree with Joan,
load_binary(..., stream: bool = False) -> Union[DocumentArray, Generator[DocumentArray, ...]]is better. Note my type hint syntax onGeneratormay not be correct.
| def load_binary_stream( | ||
| cls: Type['T'], | ||
| file: Union[str, BinaryIO, bytes], | ||
| def _load_binary_stream( |
There was a problem hiding this comment.
I thin _load_binary_stream should leverage _load_binary_all by just giving a subset of bytes. I think like this we can leverage optimizations from both sides
|
fyi @tadejsv |
|
@davidbp continues working on this pr |
hanxiao
left a comment
There was a problem hiding this comment.
hold the progress, @alaeddine-13 will implement as https://jinaai.slack.com/archives/C02AC4T8Y5T/p1642243391185600?thread_ts=1642136195.148700&cid=C02AC4T8Y5T
Codecov Report
@@ Coverage Diff @@
## main #46 +/- ##
===========================================
- Coverage 82.58% 57.95% -24.64%
===========================================
Files 67 70 +3
Lines 3228 3377 +149
===========================================
- Hits 2666 1957 -709
- Misses 562 1420 +858
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
hanxiao
left a comment
There was a problem hiding this comment.
there is no need cls._load_binary_all, a simple DocumenArray(generator) will do the work.
…into feat-read-stream
fec2726 to
d7ae1a5
Compare
d7ae1a5 to
fec2726
Compare
|
Thanks for your contribution ❤️ Note, other CI tests will not start until the commit messages get fixed. This message will be deleted automatically when the commit messages get fixed. |
|
📝 Docs are deployed on https://ft-feat-read-stream--jina-docs.netlify.app 🎉 |
|
closing because commit history is messed up and it changed the original idea that we started with florian. New PR in #62 |
Currently, when reading from bytes, we load everything into memory. Instead, we should just read the docs as stream.
(pr together with @davidbp)