Skip to content

bug: _ipython_display_ stacking tensors/embeddings #130

@davidbp

Description

@davidbp

When printing DocumentArray information in a jupyter notebook, which ends up calling _ipython_display which calls summary currently the codebase stacks embeddings/ tensors.

This does not work and provides ValueError: all input arrays must have the same shape

from docarray import DocumentArray,Document
import numpy as np
da = DocumentArray([Document(tensor=np.zeros(3)), Document(tensor=np.zeros(4))])
da._ipython_display_()

but this works as expected

In [4]: from docarray import DocumentArray,Document
   ...: import numpy as np
   ...: da = DocumentArray([Document(tensor=np.zeros(3)), Document(tensor=np.zeros(3))])
   ...: da._ipython_display_()
   ...: 
   ...: 
             Documents Summary             
                                           
  Length                 2                 
  Homogenous Documents   True              
  Common Attributes      ('id', 'tensor')  
                                           
                      Attributes Summary                       
                                                               
  Attribute   Data type      #Unique values   Has empty value  
 ───────────────────────────────────────────────────────────── 
  id          ('str',)       2                False            
  tensor      ('ndarray',)   2                False            
                                                               
          Storage Summary          
                                   
  Class     DocumentArrayInMemory  
  Backend   In Memory      

Why this happens

When plotting to a jupyter notebook _ipython_display_ is called which calls summary which calls
all_attrs_values = self._get_attributes(*all_attrs_names). If there are tensor or embedding fields then all_attrs_names contains them. This implies .tensors or .emdeddings can be called which will break since data can't be stacked.

Workaround

Never call .tensors or .emdeddings which is actually quite dangerous for big datasets because it will allocate the memory for all the vectors.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions