[Data] Refine batch typing in docs by owenowenisme · Pull Request #58971 · ray-project/ray

owenowenisme · 2025-11-25T14:03:07Z

Description

PyArrow Batch Format Integration: The documentation now explicitly includes pyarrow.Table as a supported batch format for Ray Data operations, enhancing flexibility for users.
Batch Format Independence Clarification: Added text to clarify that the chosen batch format (e.g., NumPy, Pandas, PyArrow) is independent of how Ray Data stores its underlying blocks, providing a clearer understanding of data handling.
Enhanced Examples and Guidance: New examples demonstrating pyarrow batch usage with take_batch and map_batches have been added, along with strategic guidance on when to choose numpy, pandas, or pyarrow formats for different use cases.
Glossary Update: The glossary definition for 'Batch format' has been expanded to include its independence from internal block representation and now features a pyarrow example for iter_batches.

Related issues

Closes #58615

Additional information

Signed-off-by: You-Cheng Lin <[email protected]>

owenowenisme · 2025-11-25T14:03:50Z

/gemini summary

gemini-code-assist

Code Review

This pull request refines the documentation for batch typing in Ray Data, primarily by adding information and examples for the pyarrow batch format. The changes are generally good and improve the documentation's completeness. I've found a few minor issues: a typo in an example output, an incorrect representation of a pyarrow.Table in a doctest, and an outdated type hint for DataBatch. I've left specific comments for these. Additionally, please note that the DataBatch type hint on line 225 of doc/source/data/transforming-data.rst also needs to be updated to include pyarrow.Table for consistency.

doc/source/data/inspecting-data.rst

doc/source/data/transforming-data.rst

doc/source/ray-references/glossary.rst

gemini-code-assist · 2025-11-25T14:05:18Z

Summary of Changes

This pull request significantly enhances the Ray Data documentation by refining the explanation and usage of batch formats. It introduces pyarrow.Table as a first-class batch format, clarifies its relationship with internal data storage, and provides comprehensive examples and strategic guidance for users to select the most appropriate batch format for their data processing needs.

Highlights

PyArrow Batch Format Integration: The documentation now explicitly includes pyarrow.Table as a supported batch format for Ray Data operations, enhancing flexibility for users.
Batch Format Independence Clarification: Added text to clarify that the chosen batch format (e.g., NumPy, Pandas, PyArrow) is independent of how Ray Data stores its underlying blocks, providing a clearer understanding of data handling.
Enhanced Examples and Guidance: New examples demonstrating pyarrow batch usage with take_batch and map_batches have been added, along with strategic guidance on when to choose numpy, pandas, or pyarrow formats for different use cases.
Glossary Update: The glossary definition for 'Batch format' has been expanded to include its independence from internal block representation and now features a pyarrow example for iter_batches.

Changelog

doc/source/data/inspecting-data.rst
- Clarified that batch format is independent of internal block representation.
- Added a pyarrow example for Dataset.take_batch().
doc/source/data/transforming-data.rst
- Updated the description of supported batch formats to include pyarrow Tables.
- Added a pyarrow example for Dataset.map_batches().
- Introduced a new section "Choosing the right batch format" with recommendations for numpy, pandas, and pyarrow.
- Updated the DataBatch type hint description to include pyarrow.Table.
doc/source/ray-references/glossary.rst
- Expanded the "Batch format" definition to emphasize its independence from internal block storage.
- Modified the ray.data.range example to 15 and added a pyarrow example for iter_batches.

Activity

owenowenisme requested a summary of the pull request.
gemini-code-assist[bot] provided feedback on a minor typo in a pyarrow example output for sepal width (cm).
gemini-code-assist[bot] suggested updating the DataBatch type hint in transforming-data.rst to include pyarrow.Table.
gemini-code-assist[bot] pointed out an incorrect string representation for pyarrow.Table in a glossary example and suggested a more accurate format.

alexeykudinkin · 2025-11-25T18:38:09Z

doc/source/data/transforming-data.rst

+* Use numpy in ``map_batches`` when your batch function needs numeric or tensor-style operations.
+* Use pandas in ``map_batches`` when your batch function needs a DataFrame API, such as for tabular cleaning, joins, grouping, or row/column-wise transforms.
+* Use pyarrow in ``map_batches`` when your batch function benefits from columnar processing, high-performance I/O, or zero-copy conversion to other systems.


Suggested change

* Use numpy in ``map_batches`` when your batch function needs numeric or tensor-style operations.

* Use pandas in ``map_batches`` when your batch function needs a DataFrame API, such as for tabular cleaning, joins, grouping, or row/column-wise transforms.

* Use pyarrow in ``map_batches`` when your batch function benefits from columnar processing, high-performance I/O, or zero-copy conversion to other systems.

When choosing appropriate batch format for your ``map_batches`` primary consideration is a trade off of convenience vs performance:

1. Batches serve as a sliding window into the underlying block -- your UDF will be invoked with a subset of rows (making up a batch) of the underlying block.

2. Depending on the batch format such view can either be zero-copy (when batch format matches block type) or copying (for ex, when using batch format differing from the block type).

For ex, if you prefer to work with Panda's or Numpy batches you can specify either ``batch_format="pandas"`` or ``batch_format="numpy"`` (default) which might copy the underlying data when converting it from the underlying block type (for ex, if the block type is Arrow).

Note that, by default block type is Arrow (what most Ray Data readers are producing). However, Ray Data strives to minimize amount of data conversions: for ex, if your ``map_batches`` operation returns Panda's batches then these batches will be combined into blocks *without* conversion and propagated further as Panda's blocks.

Depending on the batch format such view can either be zero-copy (when batch format matches block type) or copying (for ex, when using batch format differing from the block type).

Does copying happen everytime if batch format and block type dont match? or are there combinations where it doesnt happen?

@alexeykudinkin Updated, but shouldn't we have some instruction to show the use cases for each format?

Does copying happen everytime if batch format and block type dont match? or are there combinations where it doesnt happen?

Yeah, every time batch format and block type don't match the conversion is happened

@alexeykudinkin On second thought we should not expose too much internal detail here, users only care about which format to choose, I added "Use numpy in map_batches..." back and summarize zero-copy detail into

Note that Ray Data uses zero-copy when the batch format matches the underlying block type (for example, Arrow blocks with ``batch_format="pyarrow"``). When they differ, Ray Data copies the data during conversion.

OK i fixed this.

@owenowenisme we shouldn't provide confusing guidance to the users and instead we'd clearly call out considerations they might not be aware of:

Users can figure out for themselves when they prefer to use numpy/pandas/pyarrow based on personal preferences

Our goal here is to guide them to explain the effects of their choice (ie copying, potential type-system misalignment when going pyarrow > numpy, etc)

I think we need to have both; I think you need to mention the copy implications, but you also need to mention that beyond that, you can use any of the formats (and usually people use X format for Y purpose)

The reason for the latter is because I think early users will often appreciate the explicit guidance, even if it seems obvious

I believe that is what the current text has, which I made the edit on, but let's align

Richard, we're totally on the same page here. However, our current guidance is preferential and is at best hand-waivy.

My point was simply that rather that we'd guide users to weigh in on tradeoffs of their decisions by putting out some of these hard-to-know aspects front and center for the user to ultimately to take a choice.

If you feel my framing was too technical happy to discuss alternatives, but i want to make sure we're not confusing users with our guidance here.

Signed-off-by: You-Cheng Lin <[email protected]>

doc/source/data/transforming-data.rst

Signed-off-by: You-Cheng Lin <[email protected]>

owenowenisme · 2025-11-26T05:04:56Z

@richardliaw Will add polars in another pr
Since its not related to batch
#58996

bveeramani

Overall LGTM

bveeramani · 2025-11-26T06:37:55Z

doc/source/data/inspecting-data.rst

+            print(batch)
+
+        .. testoutput::
+            :options: +MOCK


Any reason we can't actually test this?

I'm just copying pandas test code and convert to pyarrow.
I didn't thought about it. Do you think we should add numpy & pandas test back?

Ah, gotcha. I don't recall the historical context for why we're mocking this.

I think for simplicity/not slow down this PR, I think we should remove +MOCK for pyarrow only

doc/source/data/transforming-data.rst

bveeramani · 2025-11-26T06:42:29Z

doc/source/data/transforming-data.rst

+    Note that Ray Data uses zero-copy when the batch format matches the underlying block type (for example, Arrow blocks with ``batch_format="pyarrow"``). When they differ, Ray Data copies the data during conversion.
+

 The user defined function you pass to :meth:`~ray.data.Dataset.map_batches` is more flexible. Because you can represent batches


Maybe out-of-scope for this PR, but I felt like the previous paragraph didn;t really transition to this paragraph at all

Do you think its better to move this part to batch in glossary?

Not if we want people to read this. My guess is that few people actually read the glossary

Okay I'll just remove it then.

(To clarify, I think this list is useful and we should keep it somewhere)

OK i made it flow better.

Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]>

Signed-off-by: You-Cheng Lin <[email protected]>

cursor · 2025-11-26T17:40:19Z

doc/source/data/transforming-data.rst

+            import ray
+
+            def drop_nas(batch: pa.Table) -> pa.Table:
+                return pc.drop_null(batch)


Bug: Incorrect PyArrow function usage on Table

The pc.drop_null() function operates on individual arrays or chunked arrays, not on pa.Table objects. This means the function won't work as intended to drop rows with null values (analogous to batch.dropna() in pandas). The correct approach would be using pc.filter() with a mask or iterating over columns, not directly calling pc.drop_null() on the table.

bveeramani

LGTM

bveeramani · 2025-11-26T19:11:27Z

doc/source/data/transforming-data.rst

+    Note that Ray Data uses zero-copy when the batch format matches the underlying block type (for example, Arrow blocks with ``batch_format="pyarrow"``). When they differ, Ray Data copies the data during conversion.
+

 The user defined function you pass to :meth:`~ray.data.Dataset.map_batches` is more flexible. Because you can represent batches


(To clarify, I think this list is useful and we should keep it somewhere)

Signed-off-by: Richard Liaw <[email protected]>

doc/source/data/transforming-data.rst

Signed-off-by: Richard Liaw <[email protected]>

Signed-off-by: You-Cheng Lin <[email protected]>

alexeykudinkin

#58971 (comment)

doc/source/data/transforming-data.rst

Signed-off-by: Richard Liaw <[email protected]>

doc/source/data/transforming-data.rst

Signed-off-by: Richard Liaw <[email protected]>

Signed-off-by: Alexey Kudinkin <[email protected]>

Signed-off-by: You-Cheng Lin <[email protected]>

richardliaw

don't merge yet - will chat sync

Signed-off-by: Richard Liaw <[email protected]>

Addressed

## Description - PyArrow Batch Format Integration: The documentation now explicitly includes `pyarrow.Table` as a supported batch format for Ray Data operations, enhancing flexibility for users. - Batch Format Independence Clarification: Added text to clarify that the chosen batch format (e.g., NumPy, Pandas, PyArrow) is independent of how Ray Data stores its underlying blocks, providing a clearer understanding of data handling. - Enhanced Examples and Guidance: New examples demonstrating pyarrow batch usage with `take_batch` and `map_batches` have been added, along with strategic guidance on when to choose numpy, pandas, or pyarrow formats for different use cases. - Glossary Update: The glossary definition for 'Batch format' has been expanded to include its independence from internal block representation and now features a pyarrow example for `iter_batches`. ## Related issues Closes ray-project#58615 ## Additional information --------- Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Signed-off-by: Alexey Kudinkin <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Signed-off-by: peterxcli <[email protected]>

owenowenisme added 7 commits November 25, 2025 21:25

add pyarrow example in inspect-data docs

323442f

Signed-off-by: You-Cheng Lin <[email protected]>

add pyarrow to transform data

af29e1e

Signed-off-by: You-Cheng Lin <[email protected]>

add pyarrow to glossary

c446373

Signed-off-by: You-Cheng Lin <[email protected]>

add description batch format is not the same as block

a10b809

Signed-off-by: You-Cheng Lin <[email protected]>

update Configuring batch format description

5e03792

Signed-off-by: You-Cheng Lin <[email protected]>

add guide of choosing batch format

d443ac7

Signed-off-by: You-Cheng Lin <[email protected]>

update

d85598d

Signed-off-by: You-Cheng Lin <[email protected]>

owenowenisme requested review from a team as code owners November 25, 2025 14:03

owenowenisme added docs An issue or change related to documentation data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Nov 25, 2025

gemini-code-assist bot reviewed Nov 25, 2025

View reviewed changes

doc/source/data/inspecting-data.rst Show resolved Hide resolved

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved

doc/source/ray-references/glossary.rst Show resolved Hide resolved

alexeykudinkin reviewed Nov 25, 2025

View reviewed changes

update

fa992b2

Signed-off-by: You-Cheng Lin <[email protected]>

richardliaw reviewed Nov 26, 2025

View reviewed changes

doc/source/data/transforming-data.rst Show resolved Hide resolved

owenowenisme and others added 2 commits November 26, 2025 13:00

revert internal details

fe186d9

Signed-off-by: You-Cheng Lin <[email protected]>

Merge branch 'master' into data/refine-batch-typing-doc

4321350

bveeramani reviewed Nov 26, 2025

View reviewed changes

owenowenisme and others added 5 commits November 26, 2025 14:53

Update doc/source/data/transforming-data.rst

8508d38

Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]>

add tip

4756a75

Signed-off-by: You-Cheng Lin <[email protected]>

remove mock for pyarrow

2e0e94f

Signed-off-by: You-Cheng Lin <[email protected]>

remove internal detail of zero-copy

f35684c

Signed-off-by: You-Cheng Lin <[email protected]>

remove trailing space

a7f5e82

Signed-off-by: You-Cheng Lin <[email protected]>

cursor bot reviewed Nov 26, 2025

View reviewed changes

bveeramani approved these changes Nov 26, 2025

View reviewed changes

Update transforming-data.rst

486e47b

Signed-off-by: Richard Liaw <[email protected]>

richardliaw reviewed Nov 27, 2025

View reviewed changes

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved

Apply suggestion from @richardliaw

4ee05ed

Signed-off-by: Richard Liaw <[email protected]>

richardliaw enabled auto-merge (squash) November 27, 2025 00:28

fix linting

c768022

Signed-off-by: You-Cheng Lin <[email protected]>

github-actions bot disabled auto-merge November 27, 2025 02:46

alexeykudinkin requested changes Nov 27, 2025

View reviewed changes

richardliaw reviewed Dec 2, 2025

View reviewed changes

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved

Apply suggestion from @richardliaw

2e1066f

Signed-off-by: Richard Liaw <[email protected]>

richardliaw approved these changes Dec 2, 2025

View reviewed changes

richardliaw reviewed Dec 2, 2025

View reviewed changes

doc/source/data/transforming-data.rst Outdated Show resolved Hide resolved

richardliaw and others added 2 commits December 2, 2025 15:56

Apply suggestion from @richardliaw

9373a1b

Signed-off-by: Richard Liaw <[email protected]>

Tidying up

81e5272

Signed-off-by: Alexey Kudinkin <[email protected]>

alexeykudinkin approved these changes Dec 3, 2025

View reviewed changes

alexeykudinkin enabled auto-merge (squash) December 3, 2025 02:19

github-actions bot disabled auto-merge December 3, 2025 07:26

fix linting

8cc030c

Signed-off-by: You-Cheng Lin <[email protected]>

owenowenisme force-pushed the data/refine-batch-typing-doc branch from cded09e to 8cc030c Compare December 3, 2025 07:39

Merge branch 'master' into data/refine-batch-typing-doc

96f2b4e

richardliaw previously requested changes Dec 3, 2025

View reviewed changes

richardliaw added 3 commits December 3, 2025 12:14

update

be55973

Signed-off-by: Richard Liaw <[email protected]>

vale

1d3b691

Signed-off-by: Richard Liaw <[email protected]>

typo

d88fd2c

Signed-off-by: Richard Liaw <[email protected]>

alexeykudinkin approved these changes Dec 3, 2025

View reviewed changes

alexeykudinkin enabled auto-merge (squash) December 3, 2025 21:22

alexeykudinkin requested a review from richardliaw December 3, 2025 21:23

alexeykudinkin merged commit 1235600 into ray-project:master Dec 3, 2025
7 checks passed

daiping8 mentioned this pull request Dec 4, 2025

[CI][Doc] Update the example output of the dataset iterator. #59169

Merged

owenowenisme mentioned this pull request Dec 4, 2025

[Data] Add polars usage in Ray Data docs #58996

Closed

-* Use numpy in ``map_batches`` when your batch function needs numeric or tensor-style operations.
-* Use pandas in ``map_batches`` when your batch function needs a DataFrame API, such as for tabular cleaning, joins, grouping, or row/column-wise transforms.
-* Use pyarrow in ``map_batches`` when your batch function benefits from columnar processing, high-performance I/O, or zero-copy conversion to other systems.
+When choosing appropriate batch format for your ``map_batches`` primary consideration is a trade off of convenience vs performance:
+. Batches serve as a sliding window into the underlying block -- your UDF will be invoked with a subset of rows (making up a batch) of the underlying block.
+. Depending on the batch format such view can either be zero-copy (when batch format matches block type) or copying (for ex, when using batch format differing from the block type).
+For ex, if you prefer to work with Panda's or Numpy batches you can specify either ``batch_format="pandas"`` or ``batch_format="numpy"`` (default) which might copy the underlying data when converting it from the underlying block type (for ex, if the block type is Arrow).
+Note that, by default block type is Arrow (what most Ray Data readers are producing). However, Ray Data strives to minimize amount of data conversions: for ex, if your ``map_batches`` operation returns Panda's batches then these batches will be combined into blocks *without* conversion and propagated further as Panda's blocks.

		Note that Ray Data uses zero-copy when the batch format matches the underlying block type (for example, Arrow blocks with ``batch_format="pyarrow"``). When they differ, Ray Data copies the data during conversion.


		The user defined function you pass to :meth:`~ray.data.Dataset.map_batches` is more flexible. Because you can represent batches

Conversation

owenowenisme commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

owenowenisme commented Nov 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 25, 2025

Summary of Changes

Highlights

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

owenowenisme commented Nov 26, 2025

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Nov 26, 2025

Choose a reason for hiding this comment

Bug: Incorrect PyArrow function usage on Table

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

owenowenisme commented Nov 25, 2025 •

edited

Loading