-
Notifications
You must be signed in to change notification settings - Fork 238
Use multipart upload in S3 push pull backend #1230
Description
Currently, pushing a Document stream to S3 works fine.
N: int = 2 ** 20
DocumentArray[TextDoc].push_stream((TextDoc(text=f'text {i}') for i in range(N)), url=f's3://da-pushpull/da-{N}', show_progress=True)However, the upload does not use multipart upload. This is because smart_open, which we use to write to S3 object, will be very slow when uploading parts other than the first one.
The multipart upload was turned off so that the upload succeeds, but this causes the upload to use more memory than is necessary.
Describe the solution you'd like
Uploading DocumentArrays with size above the multipart size should have constant memory usage.
i.e. Uploading a 1GB Document stream should use the same memory as a 2GB Document stream
Describe alternatives you've considered
Maybe we can implement the multipart uploading ourselves?
Additional context
The mutlipart upload works fine in local tests and CI, which uses a minio container as the S3 endpoint.
However, it will get stuck on uploads to "the real" S3.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status