[Data] Fuse StreamingRepartition with MapBatches operators to scale collate by xinyuangui2 · Pull Request #59108 · ray-project/ray

xinyuangui2 · 2025-12-02T07:43:11Z

Why are these changes needed?

This PR adds operator fusion support for StreamingRepartition and MapBatches operators. When batch_size matches target_num_rows_per_block, these operators can be fused together to reduce scheduling overhead and improve performance.

What changes were included?

Fusion Rules:

MapBatches -> StreamingRepartition: Fuses when batch_size == target_num_rows_per_block

Both orders will ensure the map_batch's function receive the correct number of rows.
For (MapBatches -> StreamingRepartition), we also ensure the output rows is batch_size.

Fusion Behavior:

Fused operators don't fuse further with surrounding map operators
Example: map -> s_r -> map -> map results in (map -> s_r)-> (map -> map)
Example: s_r -> map -> (map -> s_r), fused operators don't fuse further.

The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock. Signed-off-by: Xinyuan <[email protected]>

Signed-off-by: xgui <[email protected]>

xinyuangui2 · 2025-12-02T18:35:37Z

python/ray/data/_internal/output_buffer.py

        return (
            self._max_num_rows_per_block() is not None
-            and block.num_rows()
-            >= MAX_SAFE_ROWS_PER_BLOCK_FACTOR * self._max_num_rows_per_block()


For the case when batch_size == 10, blocks: [14], we want to get 2 result blocks: [10], [4] instead of [14].

this makes sense. but why the original streaming repartition PR didn't catch this?

The unittests didn't capture this edge case.

Signed-off-by: xgui <[email protected]>

python/ray/data/_internal/logical/rules/operator_fusion.py

Signed-off-by: xgui <[email protected]>

python/ray/air/_internal/torch_utils.py

python/ray/data/_internal/logical/rules/operator_fusion.py

srinathk10

Minor comments. LGTM otherwise.

python/ray/data/_internal/execution/operators/map_transformer.py

python/ray/data/_internal/logical/rules/operator_fusion.py

python/ray/data/tests/test_torch_tensor_utils.py

python/ray/data/tests/test_repartition_e2e.py

python/ray/data/tests/test_operator_fusion.py

Signed-off-by: xgui <[email protected]>

python/ray/data/_internal/execution/operators/map_transformer.py

raulchen · 2025-12-02T23:58:54Z

python/ray/data/_internal/output_buffer.py

        return (
            self._max_num_rows_per_block() is not None
-            and block.num_rows()
-            >= MAX_SAFE_ROWS_PER_BLOCK_FACTOR * self._max_num_rows_per_block()


this makes sense. but why the original streaming repartition PR didn't catch this?

raulchen · 2025-12-03T00:53:56Z

python/ray/data/_internal/logical/rules/operator_fusion.py

    def apply(self, plan: PhysicalPlan) -> PhysicalPlan:
        self._op_map = plan.op_map.copy()
+        # Firstly fuse StreamingRepartition with MapBatches.
+        fused_dag = self._fuse_streaming_repartition_operators_in_dag(plan.dag)


we should fuse map ops first. so that multiple maps can potentially be fused with streaming repartition.

It's kinda tricky to do so. Because the map fusion returns one MapOperator without batch_size.

ray/python/ray/data/_internal/logical/rules/operator_fusion.py

Lines 318 to 329 in 5328b32

op = MapOperator.create(

up_op.get_map_transformer().fuse(down_op.get_map_transformer()),

input_op,

up_op.data_context,

target_max_block_size_override=target_max_block_size,

name=name,

compute_strategy=compute,

min_rows_per_bundle=min_rows_per_bundled_input,

map_task_kwargs=map_task_kwargs,

ray_remote_args=ray_remote_args,

ray_remote_args_fn=ray_remote_args_fn,

)

The batch_size information is hidden inside the transform_fn. That's why I put streaming_repartition fusion before map_fusion.

ok, can you add a comment explaining this?

nvm, saw that

@xinyuangui2 why don't we fuse this at logical operator level instead of physical?

raulchen · 2025-12-03T01:01:34Z

python/ray/data/_internal/logical/rules/operator_fusion.py

+            # For now, we don't want to over-fuse StreamingRepartition with other map operators,
+            # so the result operator does not support further fusion.
+            supports_fusion=False,
+        )


when are we dealing with the target block num rows?

This is handled by transform_fn. As long as the streaming_repartition's transform_fn is the last one, the target block num rows can be guaranteed.

python/ray/data/_internal/logical/rules/operator_fusion.py

raulchen

approved by mistake.

Signed-off-by: xgui <[email protected]>

xinyuangui2 · 2025-12-04T01:34:25Z

Resolved #58837

…ollate (ray-project#59108) ## Why are these changes needed? This PR adds operator fusion support for `StreamingRepartition` and `MapBatches` operators. When `batch_size` matches `target_num_rows_per_block`, these operators can be fused together to reduce scheduling overhead and improve performance. ## What changes were included? **Fusion Rules:** - `MapBatches -> StreamingRepartition`: Fuses when `batch_size == target_num_rows_per_block` Both orders will ensure the map_batch's function receive the correct number of rows. For (MapBatches -> StreamingRepartition), we also ensure the output rows is `batch_size`. **Fusion Behavior:** - Fused operators don't fuse further with surrounding map operators - Example: `map -> s_r -> map -> map` results in `(map -> s_r)-> (map -> map)` - Example: `s_r -> map -> (map -> s_r)`, fused operators don't fuse further. --------- Signed-off-by: Xinyuan <[email protected]> Signed-off-by: xgui <[email protected]> Signed-off-by: peterxcli <[email protected]>

xinyuangui2 and others added 13 commits November 17, 2025 16:47

Avoid lock if serialization result is cached

de4f17f

The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock. Signed-off-by: Xinyuan <[email protected]>

Merge branch 'ray-project:master' into master

003b4ab

Merge branch 'ray-project:master' into master

93ab9d2

Merge branch 'ray-project:master' into master

e2cd6b8

Merge branch 'ray-project:master' into master

136ec12

Merge branch 'ray-project:master' into master

dc4258f

Merge branch 'ray-project:master' into master

80f2246

Merge branch 'ray-project:master' into master

52fb570

Merge branch 'ray-project:master' into master

3c42af4

Merge branch 'ray-project:master' into master

87333fe

operator fusion for streaming_repartition

faf9b84

Signed-off-by: xgui <[email protected]>

repartition test

8b25ccc

Signed-off-by: xgui <[email protected]>

add operator fusion test

74d9d61

Signed-off-by: xgui <[email protected]>

xinyuangui2 commented Dec 2, 2025

View reviewed changes

fix comments

bf6d664

Signed-off-by: xgui <[email protected]>

xinyuangui2 changed the title ~~Fuse streaming repartition~~ [Data] Fuse StreamingRepartition with MapBatches operators to scale collate Dec 2, 2025

xinyuangui2 marked this pull request as ready for review December 2, 2025 18:50

xinyuangui2 requested review from a team as code owners December 2, 2025 18:50

xinyuangui2 requested a review from raulchen December 2, 2025 18:50

cursor bot reviewed Dec 2, 2025

View reviewed changes

python/ray/data/_internal/logical/rules/operator_fusion.py Outdated Show resolved Hide resolved

ray-gardener bot added the data Ray Data-related issues label Dec 2, 2025

update StreamingRepartition

cad76aa

Signed-off-by: xgui <[email protected]>

cursor bot reviewed Dec 2, 2025

View reviewed changes

python/ray/air/_internal/torch_utils.py Show resolved Hide resolved

python/ray/data/_internal/logical/rules/operator_fusion.py Show resolved Hide resolved

srinathk10 approved these changes Dec 2, 2025

View reviewed changes

fix udf

e30c223

Signed-off-by: xgui <[email protected]>

raulchen approved these changes Dec 3, 2025

View reviewed changes

raulchen requested changes Dec 3, 2025

View reviewed changes

xinyuangui2 added 2 commits December 3, 2025 02:06

resolve comments

49fe798

Signed-off-by: xgui <[email protected]>

edge case for streaming_repartition

c6e8b05

Signed-off-by: xgui <[email protected]>

remove unnecessary comments

9d8758b

Signed-off-by: xgui <[email protected]>

xinyuangui2 requested a review from raulchen December 3, 2025 16:27

raulchen approved these changes Dec 3, 2025

View reviewed changes

raulchen enabled auto-merge (squash) December 3, 2025 22:19

github-actions bot added the go add ONLY when ready to merge, run all tests label Dec 3, 2025

raulchen merged commit 7df47e9 into ray-project:master Dec 3, 2025
8 checks passed

xinyuangui2 mentioned this pull request Dec 4, 2025

[Data] Support Exact Batch Size Enforcement in map_batches to Enable Collate Functions in Ray Data Pipeline #58837

Closed

	op = MapOperator.create(
	up_op.get_map_transformer().fuse(down_op.get_map_transformer()),
	input_op,
	up_op.data_context,
	target_max_block_size_override=target_max_block_size,
	name=name,
	compute_strategy=compute,
	min_rows_per_bundle=min_rows_per_bundled_input,
	map_task_kwargs=map_task_kwargs,
	ray_remote_args=ray_remote_args,
	ray_remote_args_fn=ray_remote_args_fn,
	)

Conversation

xinyuangui2 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

What changes were included?

Uh oh!

xinyuangui2 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srinathk10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

raulchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xinyuangui2 commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xinyuangui2 commented Dec 2, 2025 •

edited

Loading

xinyuangui2 Dec 2, 2025 •

edited

Loading