[Data] Support List Types for `Unique` Aggregator and `encode_lists` flag by kyuds · Pull Request #58916 · ray-project/ray

kyuds · 2025-11-22T09:59:51Z

Description

Basically the same idea as #58659

So Unique aggregator uses pyarrow.compute.unique function internally. This doesn't work with non-hashable types like lists. Similar to what I did for ApproximateTopK, we now use pickle to serialize and deserialize elements.

Other improvements:

ignore_nulls flag didn't work at all. This flag now properly works
Had to force ignore_nulls=False for datasets unique api for backwards compatibility (we set ignore_nulls to True by default, so behavior for datasets unique api will change now that ignore_nulls actually works)

Related issues

This PR replaces #58538

Additional information

Design doc on my notion

Signed-off-by: Daniel Shin <[email protected]>

gemini-code-assist

Code Review

This pull request adds support for non-hashable types to the Unique aggregator by using pickling, and introduces an encode_lists flag for more flexible handling of list data. The implementation is sound and includes a comprehensive set of tests. My review includes one suggestion to refactor the aggregate_block method for improved performance and readability.

python/ray/data/aggregate.py

Signed-off-by: Daniel Shin <[email protected]>

…ion; previous ignore_nulls was broken Signed-off-by: Daniel Shin <[email protected]>

Signed-off-by: Daniel Shin <[email protected]>

python/ray/data/dataset.py

Signed-off-by: Daniel Shin <[email protected]>

kyuds · 2025-11-23T02:29:33Z

python/ray/data/tests/test_groupby_e2e.py

        Std("B", alias_name="std_b", ignore_nulls=ignore_nulls),
        Quantile("B", alias_name="quantile_b", ignore_nulls=ignore_nulls),
-        Unique("B", alias_name="unique_b"),
+        Unique("B", alias_name="unique_b", ignore_nulls=False),


this is to fix tests, because ignore_nulls was broken before, so we don't filter for nulls at all

Signed-off-by: Daniel Shin <[email protected]>

richardliaw · 2025-11-24T20:24:26Z

python/ray/data/aggregate.py

+            col = pc.list_flatten(col)
+        if self._ignore_nulls:
+            col = pc.drop_null(col)
+        pickled = [pickle.dumps(v.as_py()).hex() for v in col]


why do you need to do this pickle?

PyArrow's unique implementation doesn't work with lists or object types.

This PR is part of a broader effort to optimize preprocessors like LabelEncoder by using Unique rather than iterating over the data in the driver

@kyuds One concern I have is the performance implication of pickling everything. Lemme talk with my colleague, and I'll get back to you

yeah this is really expensive, would recommend not doing this if possible

@bveeramani , on further investigation, I think we can drop using pyarrow entirely (as it is not really friendly for heterogenous types that we need to support when encoding lists) and use pandas instead.

I'm thinking something on the lines of this

col = BlockAccessor.for_block(block).to_pandas()[self._target_col_name] if pseudo_function_check_first_elem_is_list(col): if self._encode_lists: col = col.explode() else: col = pseudo_function_change_elements_to_tuple(col) if self._ignore_nulls: col = col.dropna() unique_values = col.drop_duplicates().tolist()

Should support heterogenous types and everything.

cc @richardliaw was there a reason for implementing the Unique aggregator with pyarrow in the first place that I should be aware about?

We can't use pandas it has to be Pyarrow for performance reasons

alexeykudinkin · 2025-11-26T17:43:22Z

python/ray/data/aggregate.py

        col = BlockAccessor.for_block(block).to_arrow().column(self._target_col_name)
-        return pac.unique(col).to_pylist()
+        if pa.types.is_list(col.type) and self._encode_lists:
+            col = pc.list_flatten(col)


Why flattening?

That's the documented behavior for the encode_lists parameter:

ray/python/ray/data/preprocessors/encoder.py

Lines 101 to 104 in 76be448

encode_lists: If ``True``, encode list elements. If ``False``, encode

whole lists (i.e., replace each list with an integer). ``True``

by default.

output_columns: The names of the transformed columns. If None, the transformed

for when we want to calculate Unique over individual list elements instead of calculating Unique based on entire list.

[[1, 2], [1, 2], [1, 2, 3]] -> unique is [1, 2] and [1, 2, 3] vs. [[1, 2], [1, 2], [1, 2, 3]] -> unique is 1, 2, 3

This will be triggered by encode_lists flag

alexeykudinkin · 2025-11-26T17:44:18Z

python/ray/data/aggregate.py

+            col = pc.list_flatten(col)
+        if self._ignore_nulls:
+            col = pc.drop_null(col)
+        pickled = [pickle.dumps(v.as_py()).hex() for v in col]


We can't use pandas it has to be Pyarrow for performance reasons

Signed-off-by: kyuds <[email protected]>

kyuds · 2025-11-27T01:29:51Z

@richardliaw @bveeramani @alexeykudinkin , converted to use pyarrow for most cases, with special handling using pandas for when we need to calculate uniques over list types. This was done because there is no straight forward way to calculate unique lists in pyarrow (requires serializing/deserializing via json/pickle, which I assume will be even worse).

Signed-off-by: kyuds <[email protected]>

python/ray/data/aggregate.py

Signed-off-by: kyuds <[email protected]>

python/ray/data/aggregate.py

…flag (ray-project#58916) ## Description Basically the same idea as ray-project#58659 So `Unique` aggregator uses `pyarrow.compute.unique` function internally. This doesn't work with non-hashable types like lists. Similar to what I did for `ApproximateTopK`, we now use pickle to serialize and deserialize elements. Other improvements: - `ignore_nulls` flag didn't work at all. This flag now properly works - Had to force `ignore_nulls=False` for datasets `unique` api for backwards compatibility (we set `ignore_nulls` to `True` by default, so behavior for datasets `unique` api will change now that `ignore_nulls` actually works) ## Related issues This PR replaces ray-project#58538 ## Additional information [Design doc on my notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link) --------- Signed-off-by: Daniel Shin <[email protected]> Signed-off-by: kyuds <[email protected]> Signed-off-by: peterxcli <[email protected]>

improve unique aggregator

a2784a0

Signed-off-by: Daniel Shin <[email protected]>

kyuds requested a review from a team as a code owner November 22, 2025 09:59

kyuds mentioned this pull request Nov 22, 2025

[Data] Allow Unique and ApproximateTopK to encode list values #58538

Closed

gemini-code-assist bot reviewed Nov 22, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

ray-gardener bot added python Pull requests that update Python code data Ray Data-related issues community-contribution Contributed by the community labels Nov 22, 2025

kyuds added 3 commits November 23, 2025 01:28

simplify null exclusions

eb5699b

Signed-off-by: Daniel Shin <[email protected]>

for backwards compat, default to ignore_nulls=False for dataset funct…

e445abb

…ion; previous ignore_nulls was broken Signed-off-by: Daniel Shin <[email protected]>

fix doc test

057e5b1

Signed-off-by: Daniel Shin <[email protected]>

cursor bot reviewed Nov 22, 2025

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

backwards compat

f547543

Signed-off-by: Daniel Shin <[email protected]>

kyuds commented Nov 23, 2025

View reviewed changes

fix test

2da5fc7

Signed-off-by: Daniel Shin <[email protected]>

richardliaw reviewed Nov 24, 2025

View reviewed changes

alexeykudinkin reviewed Nov 26, 2025

View reviewed changes

kyuds added 2 commits November 27, 2025 09:38

Merge branch 'master' into improve-unique-agg

8177efc

use pyarrow when possible, no serialization

d36624b

Signed-off-by: kyuds <[email protected]>

fix bug with pyarrow behavior

a6f268b

Signed-off-by: kyuds <[email protected]>

cursor bot reviewed Nov 27, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

fix cursor

b0ff9d0

Signed-off-by: kyuds <[email protected]>

richardliaw self-assigned this Dec 3, 2025

richardliaw added the go add ONLY when ready to merge, run all tests label Dec 3, 2025

richardliaw reviewed Dec 5, 2025

View reviewed changes

python/ray/data/aggregate.py Show resolved Hide resolved

kyuds mentioned this pull request Dec 6, 2025

[data] ignore_nulls don't work in unique aggregation #59032

Closed

kyuds changed the title ~~[Data] Support Non-Hashable Types for Unique Aggregator and encode_lists flag~~ [Data] Support List Types for Unique Aggregator and encode_lists flag Dec 6, 2025

richardliaw approved these changes Dec 9, 2025

View reviewed changes

richardliaw merged commit 1180868 into ray-project:master Dec 9, 2025
7 checks passed

kyuds deleted the improve-unique-agg branch December 10, 2025 01:23

kyuds mentioned this pull request Dec 10, 2025

[Data] unique aggregation considers ignore_nulls #59089

Closed

	encode_lists: If ``True``, encode list elements. If ``False``, encode
	whole lists (i.e., replace each list with an integer). ``True``
	by default.
	output_columns: The names of the transformed columns. If None, the transformed

Conversation

kyuds commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyuds commented Nov 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kyuds commented Nov 22, 2025 •

edited

Loading