Skip to content

Use multipart upload in S3 push pull backend #1230

@Jackmin801

Description

@Jackmin801

Currently, pushing a Document stream to S3 works fine.

N: int = 2 ** 20
DocumentArray[TextDoc].push_stream((TextDoc(text=f'text {i}') for i in range(N)), url=f's3://da-pushpull/da-{N}', show_progress=True)

However, the upload does not use multipart upload. This is because smart_open, which we use to write to S3 object, will be very slow when uploading parts other than the first one.

The multipart upload was turned off so that the upload succeeds, but this causes the upload to use more memory than is necessary.

Describe the solution you'd like
Uploading DocumentArrays with size above the multipart size should have constant memory usage.
i.e. Uploading a 1GB Document stream should use the same memory as a 2GB Document stream

Describe alternatives you've considered
Maybe we can implement the multipart uploading ourselves?

Additional context
The mutlipart upload works fine in local tests and CI, which uses a minio container as the S3 endpoint.
However, it will get stuck on uploads to "the real" S3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo/Features

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions