[Data] feat: add streaming train test split implementation#56803
[Data] feat: add streaming train test split implementation#56803richardliaw merged 3 commits intoray-project:masterfrom
Conversation
python/ray/data/dataset.py
Outdated
| ) | ||
| return ds_length | ||
|
|
||
| @ConsumptionAPI |
There was a problem hiding this comment.
nit, this isn't a consumption API
4787091 to
28943f3
Compare
python/ray/data/dataset.py
Outdated
| Examples with Bernoulli split: | ||
|
|
||
| >>> import ray | ||
| >>> ds = ray.data.range(8, override_num_blocks=1) |
There was a problem hiding this comment.
I had to override the num_blocks when the data is very small, otherwise it will create one block per record and the bernulli split will put all records into the same bucket
28943f3 to
9b23914
Compare
| return ds_length | ||
|
|
||
| @PublicAPI(stability="alpha", pi_group=SMJ_API_GROUP) | ||
| def streaming_train_test_split( |
There was a problem hiding this comment.
For my own understanding:
What was the rationale for having a separate API for streaming? Was introducing a streaming bool flag and split_type not sufficient in the existing train_test_split API?
There was a problem hiding this comment.
I didn't see any of the other APIs to be extended this way, so thought it would be better to create a new one. I also didn't want to support everything that is supported in the train_test_split implementation, at least not on the first version of it.
There was a problem hiding this comment.
I also thought more about this - the other API is public/stable and has a ConsumptionAPI tag + the kwargs are noncomposable, so it made sense to separate it for now
68d14bd to
57d6a6e
Compare
python/ray/data/dataset.py
Outdated
| # Each Ray task that processes a batch has a unique, stable ID (UUID-like). | ||
| # We grab the last 16 hex digits of the task_id to create a 32-bit integer. | ||
| # This ensures *different tasks* don't reuse the same RNG sequence. | ||
| task_id_str = ray.get_runtime_context().get_task_id() |
There was a problem hiding this comment.
As I recall, map_batches() may reuse the Ray tasks. So different batches (of the same block) may get the same seed here, which should be easy to test.. In random_sample() we initialize the Generator once per task and store it in a TaskContext.
There was a problem hiding this comment.
ah I see, so you are saying that is better to just save the rng and simplify this logic?
There was a problem hiding this comment.
yeah that works pretty nicely! thank you!
57d6a6e to
1d6c030
Compare
|
@goutamvenkat-anyscale @wingkitlee0 mind taking another look? |
python/ray/data/dataset.py
Outdated
| @PublicAPI(stability="alpha", pi_group=SMJ_API_GROUP) | ||
| def streaming_train_test_split( | ||
| self, | ||
| test_proportion: float, |
There was a problem hiding this comment.
I suggest we use test_size to follow train_test_split API.
python/ray/data/dataset.py
Outdated
| The split type can be either "hash" or "bernoulli". | ||
| - "hash": The dataset is split into train and test subsets based on the hash of the key column. | ||
| - "bernoulli": The dataset is split into train and test subsets based on the Bernoulli distribution. |
There was a problem hiding this comment.
I suggest putting Bernoulli distribution first as it's the default. We could add that this is the simple per-row random selection (in case people didn't use the name Bernoulli, like myself..)
There was a problem hiding this comment.
yeah maybe not bernoulli but just random
python/ray/data/dataset.py
Outdated
| test_proportion: float, | ||
| *, | ||
| split_type: Literal["hash", "bernoulli"] = "bernoulli", | ||
| key_column: Optional[str] = None, |
There was a problem hiding this comment.
So this key_column has nothing to do with stratify (which is in the train_test_split API). Do you see this API can be extended to stratifying split as well? It seems to be do-able with a groupby.
I suggest we call this hash_column or split_column to distinguish from key, which is commonly used in other APIs (groupby, repartition). It's also more obvious to not use it with "bernoulli" mode..
Later if we add stratify arg we do not need to deal with the confusion of stratify vs key_column.
Thoughts?
1d6c030 to
d35dc0a
Compare
d35dc0a to
4a9390c
Compare
|
@richardliaw @wingkitlee0 I am seeing an error on the build on docs, which look unrelated. Mind taking a quick look |
python/ray/data/dataset.py
Outdated
| ignored for Bernoulli split. | ||
| seed: The seed to use for the Bernoulli split. Ignored for hash split. |
There was a problem hiding this comment.
Docs should be updated to say random instead of Bernouilli
| if hash_column is not None and split_type == "random": | ||
| raise ValueError("hash_column is not supported for random split") | ||
|
|
||
| def random_split(batch: pa.Table): |
There was a problem hiding this comment.
When seed is provided, we would still need to ground this with key column, similar to hash for determinism?
There was a problem hiding this comment.
can you expand on this? it doesn't seem necessary to do this and it will require a unique key, which is not always available
There was a problem hiding this comment.
Yeah I don't think you need to do that. You can seed to generate indices.
There was a problem hiding this comment.
@martinbomio We can use preserve_order flag in ExecutionOptions to guarantee determinism here.
ead808e to
9e4c4c6
Compare
e91c8a7 to
51d39e1
Compare
python/ray/data/dataset.py
Outdated
| ) | ||
| return ds_length | ||
|
|
||
| @PublicAPI(stability="alpha", pi_group=SMJ_API_GROUP) |
There was a problem hiding this comment.
Typo for api_group.
This should fix the doc and show this API in the shuffling section.
python/ray/data/dataset.py
Outdated
| - "random": The dataset is split into random train and test subsets. | ||
| - "hash": The dataset is split into train and test subsets based on the hash of the key column. | ||
|
|
||
| Important: Make sure to set the `preserve_order` flag in the `ExecutionOptions` to True |
There was a problem hiding this comment.
optional. you can add .. note:: or ..tip:: to highlight this. See map_batches.
7b2b6cd to
8a078cd
Compare
1ef5e9d to
2afa0b0
Compare
Signed-off-by: Martin Bomio <[email protected]> add comments and docs for bernoulli_split function Signed-off-by: Martin Bomio <[email protected]> store per task rnd Signed-off-by: Martin Bomio <[email protected]> rename bernulli to random Signed-off-by: Martin Bomio <[email protected]> add note about determinism on random_split Signed-off-by: Martin Bomio <[email protected]> fix example for random split Signed-off-by: Martin Bomio <[email protected]>
Signed-off-by: Martin Bomio <[email protected]>
f8770ea to
e8798e0
Compare
|
@richardliaw build is finally green |
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]> Signed-off-by: Seiji Eicher <[email protected]>
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]>
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]>
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]>
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]>
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]>
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]> Signed-off-by: Josh Kodi <[email protected]>
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]>
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]>
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
## Why are these changes needed? To allow for a streaming implementation of train/test split ## Related issue number Closes ray-project#56780 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests. > > - **API (Dataset)**: > - Add `Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)` for streaming train/test splits. > - Implements per-batch bucketing via `map_batches` and splits via filtering; supports deterministic random seeding per task and hash-based partitioning on `hash_column`. > - Validates inputs (`test_size` in (0,1), required `hash_column` for hash, disallow `seed` with hash, etc.). > - Returns two `Dataset`s (`train`, `test`). > - **Tests**: > - Add large-scale tests for hash and random splits verifying proportions and disjointness. > - Add param validation tests covering invalid `test_size`, `split_type`, `hash_column`, and `seed` combinations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 24f678c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Martin Bomio <[email protected]> Signed-off-by: Future-Outlier <[email protected]>
Why are these changes needed?
To allow for a streaming implementation of train/test split
Related issue number
Closes #56780
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Adds Dataset.streaming_train_test_split for streaming train/test splits (random or hash-based) with validations and tests.
Dataset.streaming_train_test_split(test_size, split_type="random"|"hash", hash_column=None, seed=None, **ray_remote_kwargs)for streaming train/test splits.map_batchesand splits via filtering; supports deterministic random seeding per task and hash-based partitioning onhash_column.test_sizein (0,1), requiredhash_columnfor hash, disallowseedwith hash, etc.).Datasets (train,test).test_size,split_type,hash_column, andseedcombinations.Written by Cursor Bugbot for commit 24f678c. This will update automatically on new commits. Configure here.