refactor: better management of stack mod#1008
Conversation
Signed-off-by: Sami Jaghouar <[email protected]>
Signed-off-by: Sami Jaghouar <[email protected]>
|
📝 Docs are deployed on https://ft-refactor-better-stacked-management--jina-docs.netlify.app 🎉 |
| columns[field_to_stack] = type_.__docarray_stack__(to_stack) # type: ignore # noqa: E501 | ||
|
|
||
| for field_name, column in columns.items(): | ||
| for doc, val in zip(docs, column): |
There was a problem hiding this comment.
This is doing a second iteration over all documents, I find that a bit concerning from a performance perspective. Is there a way to eagerly assign ever docs view on the fly, in the loop above?
There was a problem hiding this comment.
Not they are no way (that I can see at least).
First you collect the data, then you stack the data, then you put back the view in each document.
You cannot put back the view until you stack and you cannot stack until you collect everything. Therefore I think we always need two loop.
One thing though that is stupid in my implementation is that I should switch the two for loop line 170, 171
There was a problem hiding this comment.
One thing though that is stupid in my implementation is that I should switch the two for loop line 170, 171
Actually I think that this makes zero difference
There was a problem hiding this comment.
Okay I think there is a way to do only one loop.
We could:
- pre-allocate the data for the tensor column (we can predict the shape and the type by looking at the first row of the da)
- when collecting the actual data we could replace the data with a view to the pre alocated column
- when stacking put the data into the preallocate column in place. Therefore each document has already a view to the good part of the column and we don't need iterate again
There was a problem hiding this comment.
yeah probably makes no difference, but it is O(#columns x #documents) which does not sound great to me...
There was a problem hiding this comment.
what do you mean by pre-allocate?
There was a problem hiding this comment.
Here is the hacky idea that I had:
While doing the first loop over all the documents, we already know the index that every view of every document will have: it is the same index as the document has in the documentarray.
Thus we can modify its getattr directly there to index into the big tensor in the right place. Of course at this moment in time the big tensor doesnt exist yet, but that's not an issue, it will be created later.
what i dont like about this is the messing with getattr, but there is probably a cleaner way to follow the same general idea.
I think your idea is also along those lines?
There was a problem hiding this comment.
What you are saying is basically what we did before
My idea is just to pre-alocate so that we know in advance where the column data will stay so that we can get the view (pointer kind of ) to the right place even before the actual data is created
There was a problem hiding this comment.
tbh I think it is okay not to do the optimization now
There was a problem hiding this comment.
I think we are basically talking about the same idea ^^ but let's leave it for now, as premature optimization is the root of all evil, as we all know ;)
Signed-off-by: Sami Jaghouar [email protected]
Context:
To give some context, when a DocumentArray enter in stacked mode the tensor data from each document is moved to the column and the document does not own the data anymore.
What we were doing before this PR was to put None (or even recently to even delete the attribute) inside each document attributes data that has been moved to the DocumentArrayStacked columns, and only when accessing the DA stacked we were putting a view of the column in the data.
This cause several problems: this might lead to thread problem because we don't currently have any-lock to prevent access to the undefined data (still in the column but not in the document as a view) and the
stackoperation let the DocumentArray in a very weird state where it cannot really be use because the data is missing.This PR change this behavior. Now the data of each document is replace with a view to the column. This mean the DocumentArray is the same after and before the stacking, just that the data in the document is just a view but everything will work the same.