[Data] Support List Types for Unique Aggregator and encode_lists flag#58916
[Data] Support List Types for Unique Aggregator and encode_lists flag#58916richardliaw merged 10 commits intoray-project:masterfrom
Unique Aggregator and encode_lists flag#58916Conversation
Signed-off-by: Daniel Shin <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request adds support for non-hashable types to the Unique aggregator by using pickling, and introduces an encode_lists flag for more flexible handling of list data. The implementation is sound and includes a comprehensive set of tests. My review includes one suggestion to refactor the aggregate_block method for improved performance and readability.
Signed-off-by: Daniel Shin <[email protected]>
…ion; previous ignore_nulls was broken Signed-off-by: Daniel Shin <[email protected]>
Signed-off-by: Daniel Shin <[email protected]>
Signed-off-by: Daniel Shin <[email protected]>
| Std("B", alias_name="std_b", ignore_nulls=ignore_nulls), | ||
| Quantile("B", alias_name="quantile_b", ignore_nulls=ignore_nulls), | ||
| Unique("B", alias_name="unique_b"), | ||
| Unique("B", alias_name="unique_b", ignore_nulls=False), |
There was a problem hiding this comment.
this is to fix tests, because ignore_nulls was broken before, so we don't filter for nulls at all
Signed-off-by: Daniel Shin <[email protected]>
python/ray/data/aggregate.py
Outdated
| col = pc.list_flatten(col) | ||
| if self._ignore_nulls: | ||
| col = pc.drop_null(col) | ||
| pickled = [pickle.dumps(v.as_py()).hex() for v in col] |
There was a problem hiding this comment.
why do you need to do this pickle?
There was a problem hiding this comment.
PyArrow's unique implementation doesn't work with lists or object types.
This PR is part of a broader effort to optimize preprocessors like LabelEncoder by using Unique rather than iterating over the data in the driver
There was a problem hiding this comment.
@kyuds One concern I have is the performance implication of pickling everything. Lemme talk with my colleague, and I'll get back to you
There was a problem hiding this comment.
yeah this is really expensive, would recommend not doing this if possible
There was a problem hiding this comment.
@bveeramani , on further investigation, I think we can drop using pyarrow entirely (as it is not really friendly for heterogenous types that we need to support when encoding lists) and use pandas instead.
I'm thinking something on the lines of this
col = BlockAccessor.for_block(block).to_pandas()[self._target_col_name]
if pseudo_function_check_first_elem_is_list(col):
if self._encode_lists:
col = col.explode()
else:
col = pseudo_function_change_elements_to_tuple(col)
if self._ignore_nulls:
col = col.dropna()
unique_values = col.drop_duplicates().tolist()
Should support heterogenous types and everything.
cc @richardliaw was there a reason for implementing the Unique aggregator with pyarrow in the first place that I should be aware about?
There was a problem hiding this comment.
We can't use pandas it has to be Pyarrow for performance reasons
python/ray/data/aggregate.py
Outdated
| col = BlockAccessor.for_block(block).to_arrow().column(self._target_col_name) | ||
| return pac.unique(col).to_pylist() | ||
| if pa.types.is_list(col.type) and self._encode_lists: | ||
| col = pc.list_flatten(col) |
There was a problem hiding this comment.
That's the documented behavior for the encode_lists parameter:
ray/python/ray/data/preprocessors/encoder.py
Lines 101 to 104 in 76be448
There was a problem hiding this comment.
for when we want to calculate Unique over individual list elements instead of calculating Unique based on entire list.
[[1, 2], [1, 2], [1, 2, 3]] -> unique is [1, 2] and [1, 2, 3]
vs.
[[1, 2], [1, 2], [1, 2, 3]] -> unique is 1, 2, 3
This will be triggered by encode_lists flag
python/ray/data/aggregate.py
Outdated
| col = pc.list_flatten(col) | ||
| if self._ignore_nulls: | ||
| col = pc.drop_null(col) | ||
| pickled = [pickle.dumps(v.as_py()).hex() for v in col] |
There was a problem hiding this comment.
We can't use pandas it has to be Pyarrow for performance reasons
|
@richardliaw @bveeramani @alexeykudinkin , converted to use pyarrow for most cases, with special handling using pandas for when we need to calculate uniques over list types. This was done because there is no straight forward way to calculate unique lists in pyarrow (requires serializing/deserializing via json/pickle, which I assume will be even worse). |
Signed-off-by: kyuds <[email protected]>
Signed-off-by: kyuds <[email protected]>
Unique Aggregator and encode_lists flagUnique Aggregator and encode_lists flag
…flag (ray-project#58916) ## Description Basically the same idea as ray-project#58659 So `Unique` aggregator uses `pyarrow.compute.unique` function internally. This doesn't work with non-hashable types like lists. Similar to what I did for `ApproximateTopK`, we now use pickle to serialize and deserialize elements. Other improvements: - `ignore_nulls` flag didn't work at all. This flag now properly works - Had to force `ignore_nulls=False` for datasets `unique` api for backwards compatibility (we set `ignore_nulls` to `True` by default, so behavior for datasets `unique` api will change now that `ignore_nulls` actually works) ## Related issues This PR replaces ray-project#58538 ## Additional information [Design doc on my notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link) --------- Signed-off-by: Daniel Shin <[email protected]> Signed-off-by: kyuds <[email protected]> Signed-off-by: peterxcli <[email protected]>
Description
Basically the same idea as #58659
So
Uniqueaggregator usespyarrow.compute.uniquefunction internally. This doesn't work with non-hashable types like lists. Similar to what I did forApproximateTopK, we now use pickle to serialize and deserialize elements.Other improvements:
ignore_nullsflag didn't work at all. This flag now properly worksignore_nulls=Falsefor datasetsuniqueapi for backwards compatibility (we setignore_nullstoTrueby default, so behavior for datasetsuniqueapi will change now thatignore_nullsactually works)Related issues
This PR replaces #58538
Additional information
Design doc on my notion