Skip to content

[Data] Improve Streaming Repartition#58728

Merged
raulchen merged 5 commits intoray-project:masterfrom
owenowenisme:data/improvement-on-streaming-repartition
Nov 19, 2025
Merged

[Data] Improve Streaming Repartition#58728
raulchen merged 5 commits intoray-project:masterfrom
owenowenisme:data/improvement-on-streaming-repartition

Conversation

@owenowenisme
Copy link
Member

@owenowenisme owenowenisme commented Nov 18, 2025

Description

  • Document the internal logic of Streaming Repartition Implementation
  • Add num_rows_per_block to Streaming Repartition name

Related issues

Additional information

@owenowenisme owenowenisme added the go add ONLY when ready to merge, run all tests label Nov 18, 2025
Signed-off-by: You-Cheng Lin <[email protected]>
@owenowenisme owenowenisme marked this pull request as ready for review November 18, 2025 06:09
@owenowenisme owenowenisme requested a review from a team as a code owner November 18, 2025 06:09
@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Nov 18, 2025
Copy link
Contributor

@xinyuangui2 xinyuangui2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think using BatchMapTransformFn with disabled_shaping is easier to understand than OutputBlockSizeOption.target_num_rows_per_block

2. Whenever the pending total reaches the target row count, try to build a ready bundle.
3. Determine the slice needed from the final bundle so the ready bundle holds an exact multiple of the target rows.
4. Submit that ready bundle to a remote map task; the task slices each block according to the slice metadata stored in the RefBundle (the bundle now contains n × target rows for n ≥ 1).
5. We configured the `OutputBlockSizeOption.target_num_rows_per_block` to the target number of rows per block in plan_streaming_repartition_op so the output buffer further splits the n × target rows into n blocks of exactly the target size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add comments to MAX_SAFE_ROWS_PER_BLOCK_FACTOR saying it should be < 2

owenowenisme and others added 3 commits November 19, 2025 08:28
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
@raulchen raulchen merged commit 736ba10 into ray-project:master Nov 19, 2025
6 checks passed
400Ping pushed a commit to 400Ping/ray that referenced this pull request Nov 21, 2025
## Description
 - Document the internal logic of Streaming Repartition Implementation
 - Add `num_rows_per_block` to Streaming Repartition name

## Related issues


## Additional information

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Co-authored-by: Xinyuan <[email protected]>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
## Description
 - Document the internal logic of Streaming Repartition Implementation
 - Add `num_rows_per_block` to Streaming Repartition name

## Related issues

## Additional information

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Co-authored-by: Xinyuan <[email protected]>
Signed-off-by: YK <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
## Description
 - Document the internal logic of Streaming Repartition Implementation
 - Add `num_rows_per_block` to Streaming Repartition name

## Related issues


## Additional information

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Co-authored-by: Xinyuan <[email protected]>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description
 - Document the internal logic of Streaming Repartition Implementation
 - Add `num_rows_per_block` to Streaming Repartition name

## Related issues

## Additional information

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Co-authored-by: Xinyuan <[email protected]>
Signed-off-by: peterxcli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants