Skip to content

[data] Tensor Type __repr__ should be custom tensor types#56457

Merged
bveeramani merged 18 commits intoray-project:masterfrom
iamjustinhsu:jhsu/do-not-impersonate-schema-type
Sep 23, 2025
Merged

[data] Tensor Type __repr__ should be custom tensor types#56457
bveeramani merged 18 commits intoray-project:masterfrom
iamjustinhsu:jhsu/do-not-impersonate-schema-type

Conversation

@iamjustinhsu
Copy link
Contributor

@iamjustinhsu iamjustinhsu commented Sep 11, 2025

Why are these changes needed?

When specifying batch_format="pandas" in map_batches, we convert to and from pandas blocks. With tensor extensions, we impersonate the types as numpy arrays, when they should be objects. This can cause confusion + lead to random errors in conversion, since pyarrow will use the dtype to reconstruct the object

Old PR: iamjustinhsu#3

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: iamjustinhsu <[email protected]>
@iamjustinhsu iamjustinhsu requested review from a team as code owners September 11, 2025 17:48
Signed-off-by: iamjustinhsu <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the string representation of TensorDtype to show TensorDtype(...) instead of numpy.ndarray(...). It also introduces a DataContext flag, pandas_block_ignore_metadata, to handle incorrect behavior in older pyarrow versions when converting Arrow tables with extension types to pandas DataFrames. A new test for string tensor roundtrips is also added.

My main feedback is that the new DataContext flag should have a version-dependent default value to automatically apply the fix for users with older pyarrow versions. The rest of the changes look good.

@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Sep 11, 2025
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Sep 11, 2025
Signed-off-by: iamjustinhsu <[email protected]>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner September 12, 2025 17:58
Signed-off-by: iamjustinhsu <[email protected]>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/do-not-impersonate-schema-type branch from 1b419fa to 05874cc Compare September 12, 2025 18:00
@iamjustinhsu iamjustinhsu changed the title [data] Tensor Type __repr__ should be TensorDtype [data] Tensor Type __repr__ should be custom tensor types Sep 12, 2025
Signed-off-by: iamjustinhsu <[email protected]>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner September 12, 2025 19:26
Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: iamjustinhsu <[email protected]>
Copy link
Contributor

@JasonLi1909 JasonLi1909 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamped 👍

Copy link
Contributor

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the change @iamjustinhsu. Dumb question: Are there any changes in our Ray data pipelines?

@bveeramani bveeramani merged commit b89ca5f into ray-project:master Sep 23, 2025
5 checks passed
@iamjustinhsu iamjustinhsu deleted the jhsu/do-not-impersonate-schema-type branch September 23, 2025 20:32
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…t#56457)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
When specifying batch_format="pandas" in map_batches, we convert to and
from pandas blocks. With tensor extensions, we impersonate the types as
numpy arrays, when they should be objects. This can cause confusion +
lead to random errors in conversion, since pyarrow will use the dtype to
reconstruct the object

Old PR: iamjustinhsu#3
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: zac <[email protected]>
elliot-barn pushed a commit that referenced this pull request Sep 24, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
When specifying batch_format="pandas" in map_batches, we convert to and
from pandas blocks. With tensor extensions, we impersonate the types as
numpy arrays, when they should be objects. This can cause confusion +
lead to random errors in conversion, since pyarrow will use the dtype to
reconstruct the object

Old PR: iamjustinhsu#3
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
…t#56457)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
When specifying batch_format="pandas" in map_batches, we convert to and
from pandas blocks. With tensor extensions, we impersonate the types as
numpy arrays, when they should be objects. This can cause confusion +
lead to random errors in conversion, since pyarrow will use the dtype to
reconstruct the object

Old PR: iamjustinhsu#3
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: Marco Stephan <[email protected]>
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
When specifying batch_format="pandas" in map_batches, we convert to and
from pandas blocks. With tensor extensions, we impersonate the types as
numpy arrays, when they should be objects. This can cause confusion +
lead to random errors in conversion, since pyarrow will use the dtype to
reconstruct the object

Old PR: iamjustinhsu#3
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…t#56457)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
When specifying batch_format="pandas" in map_batches, we convert to and
from pandas blocks. With tensor extensions, we impersonate the types as
numpy arrays, when they should be objects. This can cause confusion +
lead to random errors in conversion, since pyarrow will use the dtype to
reconstruct the object

Old PR: iamjustinhsu#3
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…t#56457)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
When specifying batch_format="pandas" in map_batches, we convert to and
from pandas blocks. With tensor extensions, we impersonate the types as
numpy arrays, when they should be objects. This can cause confusion +
lead to random errors in conversion, since pyarrow will use the dtype to
reconstruct the object

Old PR: iamjustinhsu#3
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…t#56457)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
When specifying batch_format="pandas" in map_batches, we convert to and
from pandas blocks. With tensor extensions, we impersonate the types as
numpy arrays, when they should be objects. This can cause confusion +
lead to random errors in conversion, since pyarrow will use the dtype to
reconstruct the object

Old PR: iamjustinhsu#3
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…t#56457)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
When specifying batch_format="pandas" in map_batches, we convert to and
from pandas blocks. With tensor extensions, we impersonate the types as
numpy arrays, when they should be objects. This can cause confusion +
lead to random errors in conversion, since pyarrow will use the dtype to
reconstruct the object

Old PR: iamjustinhsu#3
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants