[data] remove metadata for hashing + truncate warning logs#56093
Conversation
Signed-off-by: iamjustinhsu <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request introduces two valuable improvements. First, it correctly centralizes the logic for removing metadata from pyarrow schemas within the unify_schemas function. This ensures schemas are hashable, which is essential for deduplication and overall correctness, addressing a shortcoming of the previous implementation. Second, it enhances the logging by truncating long schema representations in warning messages, which significantly improves readability. The changes are well-justified, correctly implemented, and positively impact both the robustness and user experience of the library. The code is clean and I have no suggestions for improvement.
python/ray/data/_internal/execution/streaming_executor_state.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/streaming_executor_state.py
Outdated
Show resolved
Hide resolved
Signed-off-by: iamjustinhsu <[email protected]>
| except Exception as e: | ||
| # Unsure if there are cases where schemas are NOT hashable | ||
| logger.warning(f"Failed to hash the schemas (for deduplication): {e}") | ||
| logger.debug(f"Failed to hash the schemas (for deduplication): {e}") |
There was a problem hiding this comment.
changing this debug because this can bloat the logs
Signed-off-by: iamjustinhsu <[email protected]>
…ct#56093) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? We need schemas to be hashable for schema deduplication. We previously removed metadata in Refbundle Creation, however, it can be called without a refbundle. For example, it can happen in delegating block builder ``` def concat( blocks: List["pyarrow.Table"], *, promote_types: bool = False ) -> "pyarrow.Table": ``` or implicity called when calling `count` on a dataset ``` def _cached_output_metadata # will grab all the metadata(including schema), not just rows ``` or in `BlockOutputBuffer` - This PR also reduces the log warning to truncate too. To centralize, added it in unify_schemas <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: sampan <[email protected]>
…ct#56093) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? We need schemas to be hashable for schema deduplication. We previously removed metadata in Refbundle Creation, however, it can be called without a refbundle. For example, it can happen in delegating block builder ``` def concat( blocks: List["pyarrow.Table"], *, promote_types: bool = False ) -> "pyarrow.Table": ``` or implicity called when calling `count` on a dataset ``` def _cached_output_metadata # will grab all the metadata(including schema), not just rows ``` or in `BlockOutputBuffer` - This PR also reduces the log warning to truncate too. To centralize, added it in unify_schemas <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: jugalshah291 <[email protected]>
…ct#56093) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? We need schemas to be hashable for schema deduplication. We previously removed metadata in Refbundle Creation, however, it can be called without a refbundle. For example, it can happen in delegating block builder ``` def concat( blocks: List["pyarrow.Table"], *, promote_types: bool = False ) -> "pyarrow.Table": ``` or implicity called when calling `count` on a dataset ``` def _cached_output_metadata # will grab all the metadata(including schema), not just rows ``` or in `BlockOutputBuffer` - This PR also reduces the log warning to truncate too. To centralize, added it in unify_schemas <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: yenhong.wong <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? We need schemas to be hashable for schema deduplication. We previously removed metadata in Refbundle Creation, however, it can be called without a refbundle. For example, it can happen in delegating block builder ``` def concat( blocks: List["pyarrow.Table"], *, promote_types: bool = False ) -> "pyarrow.Table": ``` or implicity called when calling `count` on a dataset ``` def _cached_output_metadata # will grab all the metadata(including schema), not just rows ``` or in `BlockOutputBuffer` - This PR also reduces the log warning to truncate too. To centralize, added it in unify_schemas <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>
…ct#56093) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? We need schemas to be hashable for schema deduplication. We previously removed metadata in Refbundle Creation, however, it can be called without a refbundle. For example, it can happen in delegating block builder ``` def concat( blocks: List["pyarrow.Table"], *, promote_types: bool = False ) -> "pyarrow.Table": ``` or implicity called when calling `count` on a dataset ``` def _cached_output_metadata # will grab all the metadata(including schema), not just rows ``` or in `BlockOutputBuffer` - This PR also reduces the log warning to truncate too. To centralize, added it in unify_schemas <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>
Why are these changes needed?
We need schemas to be hashable for schema deduplication. We previously removed metadata in Refbundle Creation, however, it can be called without a refbundle. For example, it can happen in delegating block builder
or implicity called when calling
counton a datasetor in
BlockOutputBufferTo centralize, added it in unify_schemas
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.