Skip to content

[Data] - Add DataType to Expression class#55915

Merged
alexeykudinkin merged 24 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/datatype_expr
Sep 4, 2025
Merged

[Data] - Add DataType to Expression class#55915
alexeykudinkin merged 24 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/datatype_expr

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

Why are these changes needed?

Add return_dtype to all Expr types.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner August 25, 2025 23:05
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new DataType class to represent data types within Ray Data expressions, and integrates it into the Expr class hierarchy. The new DataType class provides a unified abstraction over PyArrow, NumPy, and Python types, with methods for conversion and type inference.

The implementation is a good first step. I've identified a few areas for improvement in the new DataType class to enhance its robustness and correctness, particularly around type conversion fallbacks and exception handling. My comments focus on ensuring the new DataType class is as reliable and performant as possible.

@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Aug 25, 2025
Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V <[email protected]>
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)

# Type checking methods
def is_arrow_type(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could use a field like self._internal_type_kind or similar to store the kind of type on construction and add some type of enum to check against rather than having to constantly call these functions / the underlying isinstance function.


def __post_init__(self):
"""Validate the _internal_type after initialization."""
if not isinstance(self._internal_type, (pa.DataType, np.dtype, type)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to match support of batch UDFs: ie we'd have to support PA, Numpy, Pandas, Python (not a fan of the latter but since we already support it)

Copy link
Contributor Author

@goutamvenkat-anyscale goutamvenkat-anyscale Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if I add Pandas support in a separate PR? I've added a TODO for this.

Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V <[email protected]>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 4, 2025 01:23
@github-actions github-actions bot disabled auto-merge September 4, 2025 03:28
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 4, 2025 04:12
Signed-off-by: Goutam V. <[email protected]>
@github-actions github-actions bot disabled auto-merge September 4, 2025 17:56
@alexeykudinkin alexeykudinkin merged commit 5022884 into ray-project:master Sep 4, 2025
5 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/datatype_expr branch September 4, 2025 19:17
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Add return_dtype to all Expr types.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V. <[email protected]>
Signed-off-by: sampan <[email protected]>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Add return_dtype to all Expr types.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V. <[email protected]>
Signed-off-by: jugalshah291 <[email protected]>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Add return_dtype to all Expr types.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V. <[email protected]>
Signed-off-by: yenhong.wong <[email protected]>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Add return_dtype to all Expr types.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V. <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>
snorkelopstesting4-web pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_55915_6b89ff95-a69e-45de-bced-b95a4351578a that referenced this pull request Oct 22, 2025
Original PR #55915 by goutamvenkat-anyscale
Original: ray-project/ray#55915
snorkelopstesting3-bot added a commit to snorkel-marlin-repos/ray-project_ray_pr_55915_6b89ff95-a69e-45de-bced-b95a4351578a that referenced this pull request Oct 22, 2025
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Add return_dtype to all Expr types.


## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Goutam V <[email protected]>
Signed-off-by: Goutam V. <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants