Skip to content

[Data] Support List Types for Unique Aggregator and encode_lists flag#58916

Merged
richardliaw merged 10 commits intoray-project:masterfrom
kyuds:improve-unique-agg
Dec 9, 2025
Merged

[Data] Support List Types for Unique Aggregator and encode_lists flag#58916
richardliaw merged 10 commits intoray-project:masterfrom
kyuds:improve-unique-agg

Conversation

@kyuds
Copy link
Member

@kyuds kyuds commented Nov 22, 2025

Description

Basically the same idea as #58659

So Unique aggregator uses pyarrow.compute.unique function internally. This doesn't work with non-hashable types like lists. Similar to what I did for ApproximateTopK, we now use pickle to serialize and deserialize elements.

Other improvements:

  • ignore_nulls flag didn't work at all. This flag now properly works
  • Had to force ignore_nulls=False for datasets unique api for backwards compatibility (we set ignore_nulls to True by default, so behavior for datasets unique api will change now that ignore_nulls actually works)

Related issues

This PR replaces #58538

Additional information

Design doc on my notion

Signed-off-by: Daniel Shin <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for non-hashable types to the Unique aggregator by using pickling, and introduces an encode_lists flag for more flexible handling of list data. The implementation is sound and includes a comprehensive set of tests. My review includes one suggestion to refactor the aggregate_block method for improved performance and readability.

@ray-gardener ray-gardener bot added python Pull requests that update Python code data Ray Data-related issues community-contribution Contributed by the community labels Nov 22, 2025
Signed-off-by: Daniel Shin <[email protected]>
…ion; previous ignore_nulls was broken

Signed-off-by: Daniel Shin <[email protected]>
Signed-off-by: Daniel Shin <[email protected]>
Signed-off-by: Daniel Shin <[email protected]>
Std("B", alias_name="std_b", ignore_nulls=ignore_nulls),
Quantile("B", alias_name="quantile_b", ignore_nulls=ignore_nulls),
Unique("B", alias_name="unique_b"),
Unique("B", alias_name="unique_b", ignore_nulls=False),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to fix tests, because ignore_nulls was broken before, so we don't filter for nulls at all

Signed-off-by: Daniel Shin <[email protected]>
col = pc.list_flatten(col)
if self._ignore_nulls:
col = pc.drop_null(col)
pickled = [pickle.dumps(v.as_py()).hex() for v in col]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to do this pickle?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArrow's unique implementation doesn't work with lists or object types.

This PR is part of a broader effort to optimize preprocessors like LabelEncoder by using Unique rather than iterating over the data in the driver

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyuds One concern I have is the performance implication of pickling everything. Lemme talk with my colleague, and I'll get back to you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is really expensive, would recommend not doing this if possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bveeramani , on further investigation, I think we can drop using pyarrow entirely (as it is not really friendly for heterogenous types that we need to support when encoding lists) and use pandas instead.

I'm thinking something on the lines of this

col = BlockAccessor.for_block(block).to_pandas()[self._target_col_name]

if pseudo_function_check_first_elem_is_list(col):
    if self._encode_lists:
        col = col.explode()
    else:
        col = pseudo_function_change_elements_to_tuple(col)
if self._ignore_nulls:
    col = col.dropna()

unique_values = col.drop_duplicates().tolist()

Should support heterogenous types and everything.

cc @richardliaw was there a reason for implementing the Unique aggregator with pyarrow in the first place that I should be aware about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use pandas it has to be Pyarrow for performance reasons

col = BlockAccessor.for_block(block).to_arrow().column(self._target_col_name)
return pac.unique(col).to_pylist()
if pa.types.is_list(col.type) and self._encode_lists:
col = pc.list_flatten(col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why flattening?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the documented behavior for the encode_lists parameter:

encode_lists: If ``True``, encode list elements. If ``False``, encode
whole lists (i.e., replace each list with an integer). ``True``
by default.
output_columns: The names of the transformed columns. If None, the transformed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for when we want to calculate Unique over individual list elements instead of calculating Unique based on entire list.

[[1, 2], [1, 2], [1, 2, 3]] -> unique is [1, 2] and [1, 2, 3]
vs.
[[1, 2], [1, 2], [1, 2, 3]] -> unique is 1, 2, 3

This will be triggered by encode_lists flag

col = pc.list_flatten(col)
if self._ignore_nulls:
col = pc.drop_null(col)
pickled = [pickle.dumps(v.as_py()).hex() for v in col]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use pandas it has to be Pyarrow for performance reasons

@kyuds
Copy link
Member Author

kyuds commented Nov 27, 2025

@richardliaw @bveeramani @alexeykudinkin , converted to use pyarrow for most cases, with special handling using pandas for when we need to calculate uniques over list types. This was done because there is no straight forward way to calculate unique lists in pyarrow (requires serializing/deserializing via json/pickle, which I assume will be even worse).

Signed-off-by: kyuds <[email protected]>
@richardliaw richardliaw self-assigned this Dec 3, 2025
@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label Dec 3, 2025
@kyuds kyuds changed the title [Data] Support Non-Hashable Types for Unique Aggregator and encode_lists flag [Data] Support List Types for Unique Aggregator and encode_lists flag Dec 6, 2025
@richardliaw richardliaw merged commit 1180868 into ray-project:master Dec 9, 2025
7 checks passed
@kyuds kyuds deleted the improve-unique-agg branch December 10, 2025 01:23
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…flag (ray-project#58916)

## Description
Basically the same idea as ray-project#58659

So `Unique` aggregator uses `pyarrow.compute.unique` function
internally. This doesn't work with non-hashable types like lists.
Similar to what I did for `ApproximateTopK`, we now use pickle to
serialize and deserialize elements.

Other improvements:
- `ignore_nulls` flag didn't work at all. This flag now properly works
- Had to force `ignore_nulls=False` for datasets `unique` api for
backwards compatibility (we set `ignore_nulls` to `True` by default, so
behavior for datasets `unique` api will change now that `ignore_nulls`
actually works)

## Related issues
This PR replaces ray-project#58538

## Additional information
[Design doc on my
notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link)

---------

Signed-off-by: Daniel Shin <[email protected]>
Signed-off-by: kyuds <[email protected]>
Signed-off-by: peterxcli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants