Skip to content

refactor: better management of stack mod#1008

Merged
samsja merged 2 commits intofeat-rewrite-v2from
refactor-better-stacked-management
Jan 12, 2023
Merged

refactor: better management of stack mod#1008
samsja merged 2 commits intofeat-rewrite-v2from
refactor-better-stacked-management

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented Jan 11, 2023

Signed-off-by: Sami Jaghouar [email protected]

Context:

To give some context, when a DocumentArray enter in stacked mode the tensor data from each document is moved to the column and the document does not own the data anymore.

What we were doing before this PR was to put None (or even recently to even delete the attribute) inside each document attributes data that has been moved to the DocumentArrayStacked columns, and only when accessing the DA stacked we were putting a view of the column in the data.

This cause several problems: this might lead to thread problem because we don't currently have any-lock to prevent access to the undefined data (still in the column but not in the document as a view) and the stack operation let the DocumentArray in a very weird state where it cannot really be use because the data is missing.

This PR change this behavior. Now the data of each document is replace with a view to the column. This mean the DocumentArray is the same after and before the stacking, just that the data in the document is just a view but everything will work the same.

Signed-off-by: Sami Jaghouar <[email protected]>
@github-actions
Copy link
Copy Markdown

📝 Docs are deployed on https://ft-refactor-better-stacked-management--jina-docs.netlify.app 🎉

columns[field_to_stack] = type_.__docarray_stack__(to_stack) # type: ignore # noqa: E501

for field_name, column in columns.items():
for doc, val in zip(docs, column):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing a second iteration over all documents, I find that a bit concerning from a performance perspective. Is there a way to eagerly assign ever docs view on the fly, in the loop above?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not they are no way (that I can see at least).
First you collect the data, then you stack the data, then you put back the view in each document.

You cannot put back the view until you stack and you cannot stack until you collect everything. Therefore I think we always need two loop.

One thing though that is stupid in my implementation is that I should switch the two for loop line 170, 171

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing though that is stupid in my implementation is that I should switch the two for loop line 170, 171

Actually I think that this makes zero difference

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think there is a way to do only one loop.

We could:

  • pre-allocate the data for the tensor column (we can predict the shape and the type by looking at the first row of the da)
  • when collecting the actual data we could replace the data with a view to the pre alocated column
  • when stacking put the data into the preallocate column in place. Therefore each document has already a view to the good part of the column and we don't need iterate again

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah probably makes no difference, but it is O(#columns x #documents) which does not sound great to me...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by pre-allocate?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the hacky idea that I had:

While doing the first loop over all the documents, we already know the index that every view of every document will have: it is the same index as the document has in the documentarray.
Thus we can modify its getattr directly there to index into the big tensor in the right place. Of course at this moment in time the big tensor doesnt exist yet, but that's not an issue, it will be created later.

what i dont like about this is the messing with getattr, but there is probably a cleaner way to follow the same general idea.
I think your idea is also along those lines?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are saying is basically what we did before

My idea is just to pre-alocate so that we know in advance where the column data will stay so that we can get the view (pointer kind of ) to the right place even before the actual data is created

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I think it is okay not to do the optimization now

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are basically talking about the same idea ^^ but let's leave it for now, as premature optimization is the root of all evil, as we all know ;)

@samsja samsja merged commit 1ac338f into feat-rewrite-v2 Jan 12, 2023
@samsja samsja deleted the refactor-better-stacked-management branch January 12, 2023 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants