Skip to content

[Data] feat: add streaming train test split implementation#56803

Merged
richardliaw merged 3 commits intoray-project:masterfrom
martinbomio:streaming-train-test-split
Oct 3, 2025
Merged

[Data] feat: add streaming train test split implementation#56803
richardliaw merged 3 commits intoray-project:masterfrom
martinbomio:streaming-train-test-split

Conversation

@martinbomio
Copy link
Contributor

@martinbomio martinbomio commented Sep 22, 2025

Why are these changes needed?

To allow for a streaming implementation of train/test split

Related issue number

Closes #56780

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests.

  • API (Dataset):
    • Add Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs) for streaming train/test splits.
      • Implements per-batch bucketing via map_batches and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on hash_column.
      • Validates inputs (test_size in (0,1), required hash_column for hash, disallow seed with hash, etc.).
      • Returns two Datasets (train, test).
  • Tests:
    • Add large-scale tests for hash and random splits verifying proportions and disjointness.
    • Add param validation tests covering invalid test_size, split_type, hash_column, and seed combinations.

Written by Cursor Bugbot for commit 24f678c. This will update automatically on new commits. Configure here.

)
return ds_length

@ConsumptionAPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, this isn't a consumption API

@martinbomio martinbomio force-pushed the streaming-train-test-split branch from 4787091 to 28943f3 Compare September 22, 2025 23:26
Examples with Bernoulli split:

>>> import ray
>>> ds = ray.data.range(8, override_num_blocks=1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to override the num_blocks when the data is very small, otherwise it will create one block per record and the bernulli split will put all records into the same bucket

@martinbomio martinbomio force-pushed the streaming-train-test-split branch from 28943f3 to 9b23914 Compare September 22, 2025 23:32
@richardliaw richardliaw marked this pull request as ready for review September 23, 2025 00:10
@richardliaw richardliaw requested a review from a team as a code owner September 23, 2025 00:10
return ds_length

@PublicAPI(stability="alpha", pi_group=SMJ_API_GROUP)
def streaming_train_test_split(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding:
What was the rationale for having a separate API for streaming? Was introducing a streaming bool flag and split_type not sufficient in the existing train_test_split API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any of the other APIs to be extended this way, so thought it would be better to create a new one. I also didn't want to support everything that is supported in the train_test_split implementation, at least not on the first version of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought more about this - the other API is public/stable and has a ConsumptionAPI tag + the kwargs are noncomposable, so it made sense to separate it for now

@martinbomio martinbomio force-pushed the streaming-train-test-split branch from 68d14bd to 57d6a6e Compare September 23, 2025 01:29
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Sep 23, 2025
# Each Ray task that processes a batch has a unique, stable ID (UUID-like).
# We grab the last 16 hex digits of the task_id to create a 32-bit integer.
# This ensures *different tasks* don't reuse the same RNG sequence.
task_id_str = ray.get_runtime_context().get_task_id()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall, map_batches() may reuse the Ray tasks. So different batches (of the same block) may get the same seed here, which should be easy to test.. In random_sample() we initialize the Generator once per task and store it in a TaskContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, so you are saying that is better to just save the rng and simplify this logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that works pretty nicely! thank you!

@martinbomio martinbomio force-pushed the streaming-train-test-split branch from 57d6a6e to 1d6c030 Compare September 24, 2025 01:58
@martinbomio
Copy link
Contributor Author

@goutamvenkat-anyscale @wingkitlee0 mind taking another look?

@PublicAPI(stability="alpha", pi_group=SMJ_API_GROUP)
def streaming_train_test_split(
self,
test_proportion: float,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we use test_size to follow train_test_split API.

Comment on lines +2479 to +2481
The split type can be either "hash" or "bernoulli".
- "hash": The dataset is split into train and test subsets based on the hash of the key column.
- "bernoulli": The dataset is split into train and test subsets based on the Bernoulli distribution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest putting Bernoulli distribution first as it's the default. We could add that this is the simple per-row random selection (in case people didn't use the name Bernoulli, like myself..)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah maybe not bernoulli but just random

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

test_proportion: float,
*,
split_type: Literal["hash", "bernoulli"] = "bernoulli",
key_column: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this key_column has nothing to do with stratify (which is in the train_test_split API). Do you see this API can be extended to stratifying split as well? It seems to be do-able with a groupby.

I suggest we call this hash_column or split_column to distinguish from key, which is commonly used in other APIs (groupby, repartition). It's also more obvious to not use it with "bernoulli" mode..

Later if we add stratify arg we do not need to deal with the confusion of stratify vs key_column.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,

@martinbomio martinbomio force-pushed the streaming-train-test-split branch from 1d6c030 to d35dc0a Compare September 25, 2025 00:54
cursor[bot]

This comment was marked as outdated.

@martinbomio martinbomio force-pushed the streaming-train-test-split branch from d35dc0a to 4a9390c Compare September 25, 2025 13:01
@martinbomio
Copy link
Contributor Author

@richardliaw @wingkitlee0 I am seeing an error on the build on docs, which look unrelated. Mind taking a quick look

Comment on lines +2509 to +2510
ignored for Bernoulli split.
seed: The seed to use for the Bernoulli split. Ignored for hash split.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs should be updated to say random instead of Bernouilli

if hash_column is not None and split_type == "random":
raise ValueError("hash_column is not supported for random split")

def random_split(batch: pa.Table):
Copy link
Contributor

@srinathk10 srinathk10 Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When seed is provided, we would still need to ground this with key column, similar to hash for determinism?

Copy link
Contributor Author

@martinbomio martinbomio Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand on this? it doesn't seem necessary to do this and it will require a unique key, which is not always available

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think you need to do that. You can seed to generate indices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinbomio We can use preserve_order flag in ExecutionOptions to guarantee determinism here.

@martinbomio martinbomio force-pushed the streaming-train-test-split branch 2 times, most recently from ead808e to 9e4c4c6 Compare September 25, 2025 20:05
cursor[bot]

This comment was marked as outdated.

@martinbomio martinbomio force-pushed the streaming-train-test-split branch 2 times, most recently from e91c8a7 to 51d39e1 Compare September 25, 2025 20:13
Copy link
Contributor

@wingkitlee0 wingkitlee0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

)
return ds_length

@PublicAPI(stability="alpha", pi_group=SMJ_API_GROUP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo for api_group.

This should fix the doc and show this API in the shuffling section.

- "random": The dataset is split into random train and test subsets.
- "hash": The dataset is split into train and test subsets based on the hash of the key column.

Important: Make sure to set the `preserve_order` flag in the `ExecutionOptions` to True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional. you can add .. note:: or ..tip:: to highlight this. See map_batches.

@martinbomio martinbomio force-pushed the streaming-train-test-split branch from 7b2b6cd to 8a078cd Compare September 26, 2025 02:22
@martinbomio martinbomio force-pushed the streaming-train-test-split branch from 1ef5e9d to 2afa0b0 Compare September 29, 2025 00:23
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Martin Bomio <[email protected]>

add comments and docs for bernoulli_split function

Signed-off-by: Martin Bomio <[email protected]>

store per task rnd

Signed-off-by: Martin Bomio <[email protected]>

rename bernulli to random

Signed-off-by: Martin Bomio <[email protected]>

add note about determinism on random_split

Signed-off-by: Martin Bomio <[email protected]>

fix example for random split

Signed-off-by: Martin Bomio <[email protected]>
Signed-off-by: Martin Bomio <[email protected]>
@martinbomio martinbomio force-pushed the streaming-train-test-split branch from f8770ea to e8798e0 Compare September 30, 2025 13:28
@martinbomio
Copy link
Contributor Author

@richardliaw build is finally green

@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Sep 30, 2025
@richardliaw richardliaw merged commit 8ddc461 into ray-project:master Oct 3, 2025
7 checks passed
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
>
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
>
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
Signed-off-by: Seiji Eicher <[email protected]>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
> 
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
> 
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
> 
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
> 
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
> 
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
>
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
Signed-off-by: Josh Kodi <[email protected]>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
> 
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
> 
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
>
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
## Why are these changes needed?

To allow for a streaming implementation of train/test split

## Related issue number

Closes ray-project#56780

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adds Dataset.streaming_train_test_split for streaming train/test
splits (random or hash-based) with validations and tests.
>
> - **API (Dataset)**:
> - Add `Dataset.streaming_train_test_split(test_size,
split_type="random"|"hash", hash_column=None, seed=None,
**ray_remote_kwargs)` for streaming train/test splits.
> - Implements per-batch bucketing via `map_batches` and splits via
filtering; supports deterministic random seeding per task and hash-based
partitioning on `hash_column`.
> - Validates inputs (`test_size` in (0,1), required `hash_column` for
hash, disallow `seed` with hash, etc.).
>     - Returns two `Dataset`s (`train`, `test`).
> - **Tests**:
> - Add large-scale tests for hash and random splits verifying
proportions and disjointness.
> - Add param validation tests covering invalid `test_size`,
`split_type`, `hash_column`, and `seed` combinations.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
24f678c. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Martin Bomio <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data]: Streaming train/test split

5 participants