feat: read stream by florian-hoenicke · Pull Request #46 · docarray/docarray

florian-hoenicke · 2022-01-13T18:23:18Z

Currently, when reading from bytes, we load everything into memory. Instead, we should just read the docs as stream.
(pr together with @davidbp)

JoanFM · 2022-01-13T18:32:02Z

docarray/array/mixins/io/binary.py

+            while True:
+                b = current_bytes + fp.read(500)
+                if delimiter is None:
+                    _len = len(random_uuid().bytes)


this length can be computed only once

I think for efficiency and mantainability, this method should rely on using load_binary from DocumentArray itself

if delimiter is None: _len = len(random_uuid().bytes)

this should be move out of the loop and the first _len chars are always the delimiter.

make sure to cover compress as otherwise the usage is limited. Note, all compressions I used there they already implemented streamed file handler, you just open it, check Python official docs & lz4 docs for details. Refer to other part of the code on how I implement compression.

Agree with Joan, load_binary(..., stream: bool = False) -> Union[DocumentArray, Generator[DocumentArray, ...]] is better. Note my type hint syntax on Generator may not be correct.

JoanFM · 2022-01-14T08:37:38Z

docarray/array/mixins/io/binary.py

-    def load_binary_stream(
-        cls: Type['T'],
-        file: Union[str, BinaryIO, bytes],
+    def _load_binary_stream(


I thin _load_binary_stream should leverage _load_binary_all by just giving a subset of bytes. I think like this we can leverage optimizations from both sides

florian-hoenicke · 2022-01-14T09:04:56Z

fyi @tadejsv

florian-hoenicke · 2022-01-14T10:55:07Z

@davidbp continues working on this pr

hanxiao

hold the progress, @alaeddine-13 will implement as https://jinaai.slack.com/archives/C02AC4T8Y5T/p1642243391185600?thread_ts=1642136195.148700&cid=C02AC4T8Y5T

…into feat-read-stream

codecov · 2022-01-17T07:25:20Z

Codecov Report

Merging #46 (1827ff0) into main (107d227) will decrease coverage by 24.63%.
The diff coverage is 13.04%.

❗ Current head 1827ff0 differs from pull request most recent head fec2726. Consider uploading reports for the commit fec2726 to get more accurate results

@@             Coverage Diff             @@
##             main      #46       +/-   ##
===========================================
- Coverage   82.58%   57.95%   -24.64%     
===========================================
  Files          67       70        +3     
  Lines        3228     3377      +149     
===========================================
- Hits         2666     1957      -709     
- Misses        562     1420      +858

Flag	Coverage Δ
docarray	`57.95% <13.04%> (-24.64%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
docarray/array/mixins/io/binary.py	`22.22% <11.11%> (-73.34%)`	⬇️
docarray/__init__.py	`100.00% <100.00%> (ø)`
docarray/math/evaluation.py	`0.00% <0.00%> (-82.15%)`	⬇️
docarray/array/mixins/embed.py	`15.15% <0.00%> (-75.76%)`	⬇️
docarray/array/mixins/evaluation.py	`20.00% <0.00%> (-68.58%)`	⬇️
docarray/array/mixins/reduce.py	`28.57% <0.00%> (-67.86%)`	⬇️
docarray/math/distance/paddle.py	`0.00% <0.00%> (-66.67%)`	⬇️
docarray/array/mixins/io/pushpull.py	`28.75% <0.00%> (-66.25%)`	⬇️
docarray/array/mixins/io/csv.py	`25.00% <0.00%> (-63.89%)`	⬇️
docarray/math/distance/tensorflow.py	`0.00% <0.00%> (-61.91%)`	⬇️
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d764341...fec2726. Read the comment docs.

hanxiao

there is no need cls._load_binary_all, a simple DocumenArray(generator) will do the work.

docarray/array/mixins/io/binary.py

…into feat-read-stream

github-actions · 2022-01-18T15:25:15Z

Thanks for your contribution ❤️
💔 Unfortunately, this PR has one ore more bad commit messages, it can not be merged. To fix this problem, please refer to:

Note, other CI tests will not start until the commit messages get fixed.

This message will be deleted automatically when the commit messages get fixed.

github-actions · 2022-01-18T15:27:34Z

📝 Docs are deployed on https://ft-feat-read-stream--jina-docs.netlify.app 🎉

davidbp · 2022-01-18T15:30:48Z

closing because commit history is messed up and it changed the original idea that we started with florian. New PR in #62

feat: initial draft

3c1fbc7

florian-hoenicke assigned davidbp and florian-hoenicke Jan 13, 2022

github-actions bot added size/s area/core component/array labels Jan 13, 2022

JoanFM reviewed Jan 13, 2022

View reviewed changes

florian-hoenicke added 3 commits January 13, 2022 20:12

feat: constant delimiter len

5403b03

fix: offset error

b6a6491

test: load stream

c30138f

github-actions bot added size/m area/testing and removed size/s labels Jan 14, 2022

JoanFM reviewed Jan 14, 2022

View reviewed changes

florian-hoenicke removed their assignment Jan 14, 2022

hanxiao requested changes Jan 15, 2022

View reviewed changes

davidbp added 3 commits January 17, 2022 08:16

tests: add protocol test to stream load

715f510

test: add protocol test to stream load

281786d

Merge branch 'feat-read-stream' of https://github.com/jina-ai/docarray …

e4d72df

…into feat-read-stream

hanxiao requested changes Jan 17, 2022

View reviewed changes

docs: v1 serialization format

1827ff0

github-actions bot added the area/docs label Jan 17, 2022

davidbp added 3 commits January 17, 2022 19:19

refactor: to bytes v1 serialization

1400933

refactor: load v1 serialization protocol

c8c1024

refactor: load binary stream

0cac456

numb3r3 reviewed Jan 18, 2022

View reviewed changes

docarray/array/mixins/io/binary.py Show resolved Hide resolved

refactor: load binary fix

b7f12de

davidbp added 6 commits January 18, 2022 09:07

update: test with new serialization

f066282

test: fix stream test

a6689d1

test: fix stream test

d182852

test: fix stream test

927e5cb

test: add complete test with all compression

189f3b5

Merge branch 'feat-read-stream' of https://github.com/jina-ai/docarray …

4a91d93

…into feat-read-stream

github-actions bot added the component/document label Jan 18, 2022

davidbp added 13 commits January 18, 2022 15:17

docs: v1 serialization format

6e6ec5a

refactor: to bytes v1 serialization

9a799a8

refactor: load v1 serialization protocol

57f769b

refactor: load binary stream

f7fe6e7

refactor: load binary fix

e7207aa

update: test with new serialization

1c73390

test: fix stream test

f60d13a

test: fix stream test

780b32c

test: add complete test with all compression

30266be

Merge branch 'feat-read-stream' of https://github.com/jina-ai/docarray …

fc21e74

…into feat-read-stream

refactor: remove prints

f593b73

refactor: fix black

6ca89d6

refactor: remove batches

fec2726

davidbp force-pushed the feat-read-stream branch from fec2726 to d7ae1a5 Compare January 18, 2022 15:23

alaeddine-13 force-pushed the feat-read-stream branch from d7ae1a5 to fec2726 Compare January 18, 2022 15:24

davidbp closed this Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: read stream#46

feat: read stream#46
florian-hoenicke wants to merge 31 commits intomainfrom
feat-read-stream

florian-hoenicke commented Jan 13, 2022 •

edited

Loading

Uh oh!

JoanFM Jan 13, 2022

Uh oh!

JoanFM Jan 13, 2022

Uh oh!

hanxiao Jan 13, 2022

Uh oh!

JoanFM Jan 14, 2022

Uh oh!

florian-hoenicke commented Jan 14, 2022

Uh oh!

florian-hoenicke commented Jan 14, 2022

Uh oh!

hanxiao left a comment

Uh oh!

codecov bot commented Jan 17, 2022 •

edited

Loading

Uh oh!

hanxiao left a comment

Uh oh!

Uh oh!

github-actions bot commented Jan 18, 2022

Uh oh!

github-actions bot commented Jan 18, 2022

Uh oh!

davidbp commented Jan 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

florian-hoenicke commented Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoanFM Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

JoanFM Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

hanxiao Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

JoanFM Jan 14, 2022

Choose a reason for hiding this comment

Uh oh!

florian-hoenicke commented Jan 14, 2022

Uh oh!

florian-hoenicke commented Jan 14, 2022

Uh oh!

hanxiao left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hanxiao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 18, 2022

Uh oh!

github-actions bot commented Jan 18, 2022

Uh oh!

davidbp commented Jan 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

florian-hoenicke commented Jan 13, 2022 •

edited

Loading

codecov bot commented Jan 17, 2022 •

edited

Loading