Skip to content

[data] remove metadata for hashing + truncate warning logs#56093

Merged
alexeykudinkin merged 3 commits intoray-project:masterfrom
iamjustinhsu:jhsu/suppress-warnings-in-schema-divergence
Sep 3, 2025
Merged

[data] remove metadata for hashing + truncate warning logs#56093
alexeykudinkin merged 3 commits intoray-project:masterfrom
iamjustinhsu:jhsu/suppress-warnings-in-schema-divergence

Conversation

@iamjustinhsu
Copy link
Contributor

Why are these changes needed?

We need schemas to be hashable for schema deduplication. We previously removed metadata in Refbundle Creation, however, it can be called without a refbundle. For example, it can happen in delegating block builder

def concat(
    blocks: List["pyarrow.Table"], *, promote_types: bool = False
) -> "pyarrow.Table":

or implicity called when calling count on a dataset

def _cached_output_metadata
# will grab all the metadata(including schema), not just rows 

or in BlockOutputBuffer

  • This PR also reduces the log warning to truncate too.
    To centralize, added it in unify_schemas

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@iamjustinhsu iamjustinhsu requested a review from a team as a code owner August 29, 2025 20:13
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two valuable improvements. First, it correctly centralizes the logic for removing metadata from pyarrow schemas within the unify_schemas function. This ensures schemas are hashable, which is essential for deduplication and overall correctness, addressing a shortcoming of the previous implementation. Second, it enhances the logging by truncating long schema representations in warning messages, which significantly improves readability. The changes are well-justified, correctly implemented, and positively impact both the robustness and user experience of the library. The code is clean and I have no suggestions for improvement.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Aug 30, 2025
Signed-off-by: iamjustinhsu <[email protected]>
except Exception as e:
# Unsure if there are cases where schemas are NOT hashable
logger.warning(f"Failed to hash the schemas (for deduplication): {e}")
logger.debug(f"Failed to hash the schemas (for deduplication): {e}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing this debug because this can bloat the logs

Signed-off-by: iamjustinhsu <[email protected]>
@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Sep 2, 2025
@alexeykudinkin alexeykudinkin merged commit 3143071 into ray-project:master Sep 3, 2025
6 checks passed
@iamjustinhsu iamjustinhsu deleted the jhsu/suppress-warnings-in-schema-divergence branch September 3, 2025 23:15
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…ct#56093)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
We need schemas to be hashable for schema deduplication. We previously
removed metadata in Refbundle Creation, however, it can be called
without a refbundle. For example, it can happen in delegating block
builder
```
def concat(
    blocks: List["pyarrow.Table"], *, promote_types: bool = False
) -> "pyarrow.Table":
```
or implicity called when calling `count` on a dataset
```
def _cached_output_metadata
# will grab all the metadata(including schema), not just rows
```
or in `BlockOutputBuffer`

- This PR also reduces the log warning to truncate too.
To centralize, added it in unify_schemas
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: sampan <[email protected]>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…ct#56093)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
We need schemas to be hashable for schema deduplication. We previously
removed metadata in Refbundle Creation, however, it can be called
without a refbundle. For example, it can happen in delegating block
builder
```
def concat(
    blocks: List["pyarrow.Table"], *, promote_types: bool = False
) -> "pyarrow.Table":
```
or implicity called when calling `count` on a dataset
```
def _cached_output_metadata
# will grab all the metadata(including schema), not just rows
```
or in `BlockOutputBuffer`

- This PR also reduces the log warning to truncate too.
To centralize, added it in unify_schemas
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: jugalshah291 <[email protected]>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
…ct#56093)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
We need schemas to be hashable for schema deduplication. We previously
removed metadata in Refbundle Creation, however, it can be called
without a refbundle. For example, it can happen in delegating block
builder
```
def concat(
    blocks: List["pyarrow.Table"], *, promote_types: bool = False
) -> "pyarrow.Table":
```
or implicity called when calling `count` on a dataset
```
def _cached_output_metadata
# will grab all the metadata(including schema), not just rows
```
or in `BlockOutputBuffer`

- This PR also reduces the log warning to truncate too.
To centralize, added it in unify_schemas
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: yenhong.wong <[email protected]>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
We need schemas to be hashable for schema deduplication. We previously
removed metadata in Refbundle Creation, however, it can be called
without a refbundle. For example, it can happen in delegating block
builder
```
def concat(
    blocks: List["pyarrow.Table"], *, promote_types: bool = False
) -> "pyarrow.Table":
```
or implicity called when calling `count` on a dataset
```
def _cached_output_metadata
# will grab all the metadata(including schema), not just rows
```
or in `BlockOutputBuffer`

- This PR also reduces the log warning to truncate too.
To centralize, added it in unify_schemas
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ct#56093)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
We need schemas to be hashable for schema deduplication. We previously
removed metadata in Refbundle Creation, however, it can be called
without a refbundle. For example, it can happen in delegating block
builder
```
def concat(
    blocks: List["pyarrow.Table"], *, promote_types: bool = False
) -> "pyarrow.Table":
```
or implicity called when calling `count` on a dataset
```
def _cached_output_metadata
# will grab all the metadata(including schema), not just rows 
```
or in `BlockOutputBuffer`

- This PR also reduces the log warning to truncate too.
To centralize, added it in unify_schemas
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants