Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 5 additions & 14 deletions docarray/array/array_stacked.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,6 @@ def _create_columns(
if is_tensor_union(type_):
val = tensor_type.get_comp_backend().none_value()
columns_to_stack[field_to_stack].append(val)
delattr(doc, field_to_stack)

for field_to_stack, to_stack in columns_to_stack.items():

Expand All @@ -168,6 +167,10 @@ def _create_columns(
elif issubclass(type_, AbstractTensor):
columns[field_to_stack] = type_.__docarray_stack__(to_stack) # type: ignore # noqa: E501

for field_name, column in columns.items():
for doc, val in zip(docs, column):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing a second iteration over all documents, I find that a bit concerning from a performance perspective. Is there a way to eagerly assign ever docs view on the fly, in the loop above?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not they are no way (that I can see at least).
First you collect the data, then you stack the data, then you put back the view in each document.

You cannot put back the view until you stack and you cannot stack until you collect everything. Therefore I think we always need two loop.

One thing though that is stupid in my implementation is that I should switch the two for loop line 170, 171

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing though that is stupid in my implementation is that I should switch the two for loop line 170, 171

Actually I think that this makes zero difference

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think there is a way to do only one loop.

We could:

  • pre-allocate the data for the tensor column (we can predict the shape and the type by looking at the first row of the da)
  • when collecting the actual data we could replace the data with a view to the pre alocated column
  • when stacking put the data into the preallocate column in place. Therefore each document has already a view to the good part of the column and we don't need iterate again

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah probably makes no difference, but it is O(#columns x #documents) which does not sound great to me...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by pre-allocate?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the hacky idea that I had:

While doing the first loop over all the documents, we already know the index that every view of every document will have: it is the same index as the document has in the documentarray.
Thus we can modify its getattr directly there to index into the big tensor in the right place. Of course at this moment in time the big tensor doesnt exist yet, but that's not an issue, it will be created later.

what i dont like about this is the messing with getattr, but there is probably a cleaner way to follow the same general idea.
I think your idea is also along those lines?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are saying is basically what we did before

My idea is just to pre-alocate so that we know in advance where the column data will stay so that we can get the view (pointer kind of ) to the right place even before the actual data is created

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I think it is okay not to do the optimization now

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are basically talking about the same idea ^^ but let's leave it for now, as premature optimization is the root of all evil, as we all know ;)

setattr(doc, field_name, val)

return columns

def _get_array_attribute(
Expand Down Expand Up @@ -224,18 +227,6 @@ def __iter__(self):
for i in range(len(self)):
yield self[i]

def __getitem_without_columns__(self, item): # note this should handle slices
"""Return the document at the given index with the columns item put to None"""
doc = self._docs[item]
for field in self._columns.keys():
if hasattr(doc, field):
delattr(doc, field)
return doc

def __iter_without_columns__(self):
for i in range(len(self)):
yield self.__getitem_without_columns__(i)

def __len__(self):
return len(self._docs)

Expand All @@ -262,7 +253,7 @@ def to_protobuf(self) -> 'DocumentArrayStackedProto':
)

da_proto = DocumentArrayProto()
for doc in self.__iter_without_columns__():
for doc in self:
da_proto.docs.append(doc.to_protobuf())

columns_proto: Dict[str, UnionArrayProto] = dict()
Expand Down