refactor: better management of stack mod by samsja · Pull Request #1008 · docarray/docarray

samsja · 2023-01-11T12:47:00Z

Signed-off-by: Sami Jaghouar [email protected]

Context:

To give some context, when a DocumentArray enter in stacked mode the tensor data from each document is moved to the column and the document does not own the data anymore.

What we were doing before this PR was to put None (or even recently to even delete the attribute) inside each document attributes data that has been moved to the DocumentArrayStacked columns, and only when accessing the DA stacked we were putting a view of the column in the data.

This cause several problems: this might lead to thread problem because we don't currently have any-lock to prevent access to the undefined data (still in the column but not in the document as a view) and the stack operation let the DocumentArray in a very weird state where it cannot really be use because the data is missing.

This PR change this behavior. Now the data of each document is replace with a view to the column. This mean the DocumentArray is the same after and before the stacking, just that the data in the document is just a view but everything will work the same.

Signed-off-by: Sami Jaghouar <[email protected]>

github-actions · 2023-01-11T13:29:41Z

📝 Docs are deployed on https://ft-refactor-better-stacked-management--jina-docs.netlify.app 🎉

JohannesMessner · 2023-01-11T15:00:10Z

docarray/array/array_stacked.py

                    columns[field_to_stack] = type_.__docarray_stack__(to_stack)  # type: ignore # noqa: E501

+        for field_name, column in columns.items():
+            for doc, val in zip(docs, column):


This is doing a second iteration over all documents, I find that a bit concerning from a performance perspective. Is there a way to eagerly assign ever docs view on the fly, in the loop above?

Not they are no way (that I can see at least).
First you collect the data, then you stack the data, then you put back the view in each document.

You cannot put back the view until you stack and you cannot stack until you collect everything. Therefore I think we always need two loop.

One thing though that is stupid in my implementation is that I should switch the two for loop line 170, 171

One thing though that is stupid in my implementation is that I should switch the two for loop line 170, 171

Actually I think that this makes zero difference

Okay I think there is a way to do only one loop.

We could:

pre-allocate the data for the tensor column (we can predict the shape and the type by looking at the first row of the da)

when collecting the actual data we could replace the data with a view to the pre alocated column

when stacking put the data into the preallocate column in place. Therefore each document has already a view to the good part of the column and we don't need iterate again

yeah probably makes no difference, but it is O(#columns x #documents) which does not sound great to me...

what do you mean by pre-allocate?

Here is the hacky idea that I had:

While doing the first loop over all the documents, we already know the index that every view of every document will have: it is the same index as the document has in the documentarray.
Thus we can modify its getattr directly there to index into the big tensor in the right place. Of course at this moment in time the big tensor doesnt exist yet, but that's not an issue, it will be created later.

what i dont like about this is the messing with getattr, but there is probably a cleaner way to follow the same general idea.
I think your idea is also along those lines?

What you are saying is basically what we did before

My idea is just to pre-alocate so that we know in advance where the column data will stay so that we can get the view (pointer kind of ) to the right place even before the actual data is created

tbh I think it is okay not to do the optimization now

I think we are basically talking about the same idea ^^ but let's leave it for now, as premature optimization is the root of all evil, as we all know ;)

refactor: better management of stack mod

0db7ef9

Signed-off-by: Sami Jaghouar <[email protected]>

github-actions bot added size/s area/core component/array labels Jan 11, 2023

fix: mypy

f9b4af6

Signed-off-by: Sami Jaghouar <[email protected]>

samsja requested review from Jackmin801 and anna-charlotte January 11, 2023 14:49

JohannesMessner requested changes Jan 11, 2023

View reviewed changes

JohannesMessner approved these changes Jan 12, 2023

View reviewed changes

samsja merged commit 1ac338f into feat-rewrite-v2 Jan 12, 2023

samsja deleted the refactor-better-stacked-management branch January 12, 2023 08:11

samsja mentioned this pull request Jan 13, 2023

refactor: optimization on stacking #1016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: better management of stack mod#1008

refactor: better management of stack mod#1008
samsja merged 2 commits intofeat-rewrite-v2from
refactor-better-stacked-management

samsja commented Jan 11, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Jan 11, 2023

Uh oh!

JohannesMessner Jan 11, 2023

Uh oh!

samsja Jan 11, 2023

Uh oh!

samsja Jan 11, 2023

Uh oh!

samsja Jan 11, 2023

Uh oh!

JohannesMessner Jan 11, 2023

Uh oh!

JohannesMessner Jan 11, 2023

Uh oh!

JohannesMessner Jan 11, 2023

Uh oh!

samsja Jan 11, 2023

Uh oh!

samsja Jan 11, 2023

Uh oh!

JohannesMessner Jan 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samsja commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context:

Uh oh!

github-actions bot commented Jan 11, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samsja commented Jan 11, 2023 •

edited

Loading