[data] Tensor Type __repr__ should be custom tensor types by iamjustinhsu · Pull Request #56457 · ray-project/ray

iamjustinhsu · 2025-09-11T17:48:57Z

Why are these changes needed?

When specifying batch_format="pandas" in map_batches, we convert to and from pandas blocks. With tensor extensions, we impersonate the types as numpy arrays, when they should be objects. This can cause confusion + lead to random errors in conversion, since pyarrow will use the dtype to reconstruct the object

Old PR: iamjustinhsu#3

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: iamjustinhsu <[email protected]>

…nhsu/ray into jhsu/do-not-impersonate-schema-type

Signed-off-by: iamjustinhsu <[email protected]>

gemini-code-assist

Code Review

This pull request updates the string representation of TensorDtype to show TensorDtype(...) instead of numpy.ndarray(...). It also introduces a DataContext flag, pandas_block_ignore_metadata, to handle incorrect behavior in older pyarrow versions when converting Arrow tables with extension types to pandas DataFrames. A new test for string tensor roundtrips is also added.

My main feedback is that the new DataContext flag should have a version-dependent default value to automatically apply the fix for users with older pyarrow versions. The rest of the changes look good.

python/ray/data/context.py

Signed-off-by: iamjustinhsu <[email protected]>

…/do-not-impersonate-schema-type

Signed-off-by: iamjustinhsu <[email protected]>

JasonLi1909

stamped 👍

simonsays1980

LGTM. Thanks for the change @iamjustinhsu. Dumb question: Are there any changes in our Ray data pipelines?

…t#56457)   ## Why are these changes needed? When specifying batch_format="pandas" in map_batches, we convert to and from pandas blocks. With tensor extensions, we impersonate the types as numpy arrays, when they should be objects. This can cause confusion + lead to random errors in conversion, since pyarrow will use the dtype to reconstruct the object Old PR: iamjustinhsu#3  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: zac <[email protected]>

## Why are these changes needed? When specifying batch_format="pandas" in map_batches, we convert to and from pandas blocks. With tensor extensions, we impersonate the types as numpy arrays, when they should be objects. This can cause confusion + lead to random errors in conversion, since pyarrow will use the dtype to reconstruct the object Old PR: iamjustinhsu#3  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: elliot-barn <[email protected]>

…t#56457)   ## Why are these changes needed? When specifying batch_format="pandas" in map_batches, we convert to and from pandas blocks. With tensor extensions, we impersonate the types as numpy arrays, when they should be objects. This can cause confusion + lead to random errors in conversion, since pyarrow will use the dtype to reconstruct the object Old PR: iamjustinhsu#3  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: Marco Stephan <[email protected]>

## Why are these changes needed? When specifying batch_format="pandas" in map_batches, we convert to and from pandas blocks. With tensor extensions, we impersonate the types as numpy arrays, when they should be objects. This can cause confusion + lead to random errors in conversion, since pyarrow will use the dtype to reconstruct the object Old PR: iamjustinhsu#3  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: elliot-barn <[email protected]>

…t#56457)   ## Why are these changes needed? When specifying batch_format="pandas" in map_batches, we convert to and from pandas blocks. With tensor extensions, we impersonate the types as numpy arrays, when they should be objects. This can cause confusion + lead to random errors in conversion, since pyarrow will use the dtype to reconstruct the object Old PR: iamjustinhsu#3  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>

…t#56457)   ## Why are these changes needed? When specifying batch_format="pandas" in map_batches, we convert to and from pandas blocks. With tensor extensions, we impersonate the types as numpy arrays, when they should be objects. This can cause confusion + lead to random errors in conversion, since pyarrow will use the dtype to reconstruct the object Old PR: iamjustinhsu#3  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>

…t#56457)   ## Why are these changes needed? When specifying batch_format="pandas" in map_batches, we convert to and from pandas blocks. With tensor extensions, we impersonate the types as numpy arrays, when they should be objects. This can cause confusion + lead to random errors in conversion, since pyarrow will use the dtype to reconstruct the object Old PR: iamjustinhsu#3  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]> Signed-off-by: Future-Outlier <[email protected]>

iamjustinhsu added 11 commits September 9, 2025 17:23

[data] ignore metadata for pandas block

1f7dcec

Signed-off-by: iamjustinhsu <[email protected]>

test

ed0a738

Signed-off-by: iamjustinhsu <[email protected]>

[data] Tensor Type __repr__ should be object

4092b7a

Signed-off-by: iamjustinhsu <[email protected]>

add comment

d49f3c3

Signed-off-by: iamjustinhsu <[email protected]>

lint

f6fc010

Signed-off-by: iamjustinhsu <[email protected]>

add config param

4c9a4e1

Signed-off-by: iamjustinhsu <[email protected]>

rename

1c341ad

Signed-off-by: iamjustinhsu <[email protected]>

fix

4190dc5

Signed-off-by: iamjustinhsu <[email protected]>

Merge branch 'jhsu/fix-tensor-strings' of https://github.com/iamjusti…

66e6ae3

…nhsu/ray into jhsu/do-not-impersonate-schema-type

env_bool

e727999

Signed-off-by: iamjustinhsu <[email protected]>

comment

e655da0

Signed-off-by: iamjustinhsu <[email protected]>

iamjustinhsu requested review from a team as code owners September 11, 2025 17:48

rebase

fcecbf8

Signed-off-by: iamjustinhsu <[email protected]>

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

python/ray/data/context.py Show resolved Hide resolved

iamjustinhsu added the go add ONLY when ready to merge, run all tests label Sep 11, 2025

ray-gardener bot added the data Ray Data-related issues label Sep 11, 2025

fix

09f5d90

Signed-off-by: iamjustinhsu <[email protected]>

iamjustinhsu requested a review from a team as a code owner September 12, 2025 17:58

fix docs

05874cc

Signed-off-by: iamjustinhsu <[email protected]>

iamjustinhsu force-pushed the jhsu/do-not-impersonate-schema-type branch from 1b419fa to 05874cc Compare September 12, 2025 18:00

iamjustinhsu changed the title ~~[data] Tensor Type __repr__ should be TensorDtype~~ [data] Tensor Type __repr__ should be custom tensor types Sep 12, 2025

more docs

6b4dc4d

Signed-off-by: iamjustinhsu <[email protected]>

iamjustinhsu requested a review from a team as a code owner September 12, 2025 19:26

iamjustinhsu added 3 commits September 12, 2025 13:59

doc test

c6a5241

Signed-off-by: iamjustinhsu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into jhsu…

e450e1f

…/do-not-impersonate-schema-type

doc

48fd55c

Signed-off-by: iamjustinhsu <[email protected]>

alexeykudinkin approved these changes Sep 16, 2025

View reviewed changes

JasonLi1909 approved these changes Sep 18, 2025

View reviewed changes

simonsays1980 approved these changes Sep 22, 2025

View reviewed changes

bveeramani merged commit b89ca5f into ray-project:master Sep 23, 2025
5 checks passed

iamjustinhsu deleted the jhsu/do-not-impersonate-schema-type branch September 23, 2025 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Tensor Type repr should be custom tensor types#56457

[data] Tensor Type repr should be custom tensor types#56457
bveeramani merged 18 commits intoray-project:masterfrom
iamjustinhsu:jhsu/do-not-impersonate-schema-type

iamjustinhsu commented Sep 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

JasonLi1909 left a comment

Uh oh!

simonsays1980 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

iamjustinhsu commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

JasonLi1909 left a comment

Choose a reason for hiding this comment

Uh oh!

simonsays1980 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iamjustinhsu commented Sep 11, 2025 •

edited

Loading