Skip to content

feat: read stream#46

Closed
florian-hoenicke wants to merge 31 commits intomainfrom
feat-read-stream
Closed

feat: read stream#46
florian-hoenicke wants to merge 31 commits intomainfrom
feat-read-stream

Conversation

@florian-hoenicke
Copy link
Copy Markdown

@florian-hoenicke florian-hoenicke commented Jan 13, 2022

Currently, when reading from bytes, we load everything into memory. Instead, we should just read the docs as stream.
(pr together with @davidbp)

while True:
b = current_bytes + fp.read(500)
if delimiter is None:
_len = len(random_uuid().bytes)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this length can be computed only once

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for efficiency and mantainability, this method should rely on using load_binary from DocumentArray itself

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if delimiter is None:
                    _len = len(random_uuid().bytes)

this should be move out of the loop and the first _len chars are always the delimiter.

  1. make sure to cover compress as otherwise the usage is limited. Note, all compressions I used there they already implemented streamed file handler, you just open it, check Python official docs & lz4 docs for details. Refer to other part of the code on how I implement compression.
  2. Agree with Joan, load_binary(..., stream: bool = False) -> Union[DocumentArray, Generator[DocumentArray, ...]] is better. Note my type hint syntax on Generator may not be correct.

def load_binary_stream(
cls: Type['T'],
file: Union[str, BinaryIO, bytes],
def _load_binary_stream(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thin _load_binary_stream should leverage _load_binary_all by just giving a subset of bytes. I think like this we can leverage optimizations from both sides

@florian-hoenicke
Copy link
Copy Markdown
Author

fyi @tadejsv

@florian-hoenicke
Copy link
Copy Markdown
Author

@davidbp continues working on this pr

@florian-hoenicke florian-hoenicke removed their assignment Jan 14, 2022
Copy link
Copy Markdown
Member

@hanxiao hanxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 17, 2022

Codecov Report

Merging #46 (1827ff0) into main (107d227) will decrease coverage by 24.63%.
The diff coverage is 13.04%.

❗ Current head 1827ff0 differs from pull request most recent head fec2726. Consider uploading reports for the commit fec2726 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##             main      #46       +/-   ##
===========================================
- Coverage   82.58%   57.95%   -24.64%     
===========================================
  Files          67       70        +3     
  Lines        3228     3377      +149     
===========================================
- Hits         2666     1957      -709     
- Misses        562     1420      +858     
Flag Coverage Δ
docarray 57.95% <13.04%> (-24.64%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/array/mixins/io/binary.py 22.22% <11.11%> (-73.34%) ⬇️
docarray/__init__.py 100.00% <100.00%> (ø)
docarray/math/evaluation.py 0.00% <0.00%> (-82.15%) ⬇️
docarray/array/mixins/embed.py 15.15% <0.00%> (-75.76%) ⬇️
docarray/array/mixins/evaluation.py 20.00% <0.00%> (-68.58%) ⬇️
docarray/array/mixins/reduce.py 28.57% <0.00%> (-67.86%) ⬇️
docarray/math/distance/paddle.py 0.00% <0.00%> (-66.67%) ⬇️
docarray/array/mixins/io/pushpull.py 28.75% <0.00%> (-66.25%) ⬇️
docarray/array/mixins/io/csv.py 25.00% <0.00%> (-63.89%) ⬇️
docarray/math/distance/tensorflow.py 0.00% <0.00%> (-61.91%) ⬇️
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d764341...fec2726. Read the comment docs.

Copy link
Copy Markdown
Member

@hanxiao hanxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need cls._load_binary_all, a simple DocumenArray(generator) will do the work.

@github-actions
Copy link
Copy Markdown

Thanks for your contribution ❤️
💔 Unfortunately, this PR has one ore more bad commit messages, it can not be merged. To fix this problem, please refer to:

Note, other CI tests will not start until the commit messages get fixed.

This message will be deleted automatically when the commit messages get fixed.

@github-actions
Copy link
Copy Markdown

📝 Docs are deployed on https://ft-feat-read-stream--jina-docs.netlify.app 🎉

@davidbp
Copy link
Copy Markdown
Contributor

davidbp commented Jan 18, 2022

closing because commit history is messed up and it changed the original idea that we started with florian. New PR in #62

@davidbp davidbp closed this Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants