Skip to content

[Data] Refine batch typing in docs#58971

Merged
alexeykudinkin merged 26 commits intoray-project:masterfrom
owenowenisme:data/refine-batch-typing-doc
Dec 3, 2025
Merged

[Data] Refine batch typing in docs#58971
alexeykudinkin merged 26 commits intoray-project:masterfrom
owenowenisme:data/refine-batch-typing-doc

Conversation

@owenowenisme
Copy link
Member

@owenowenisme owenowenisme commented Nov 25, 2025

Description

  • PyArrow Batch Format Integration: The documentation now explicitly includes pyarrow.Table as a supported batch format for Ray Data operations, enhancing flexibility for users.
  • Batch Format Independence Clarification: Added text to clarify that the chosen batch format (e.g., NumPy, Pandas, PyArrow) is independent of how Ray Data stores its underlying blocks, providing a clearer understanding of data handling.
  • Enhanced Examples and Guidance: New examples demonstrating pyarrow batch usage with take_batch and map_batches have been added, along with strategic guidance on when to choose numpy, pandas, or pyarrow formats for different use cases.
  • Glossary Update: The glossary definition for 'Batch format' has been expanded to include its independence from internal block representation and now features a pyarrow example for iter_batches.

Related issues

Closes #58615

Additional information

@owenowenisme owenowenisme requested review from a team as code owners November 25, 2025 14:03
@owenowenisme
Copy link
Member Author

/gemini summary

@owenowenisme owenowenisme added docs An issue or change related to documentation data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Nov 25, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refines the documentation for batch typing in Ray Data, primarily by adding information and examples for the pyarrow batch format. The changes are generally good and improve the documentation's completeness. I've found a few minor issues: a typo in an example output, an incorrect representation of a pyarrow.Table in a doctest, and an outdated type hint for DataBatch. I've left specific comments for these. Additionally, please note that the DataBatch type hint on line 225 of doc/source/data/transforming-data.rst also needs to be updated to include pyarrow.Table for consistency.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request significantly enhances the Ray Data documentation by refining the explanation and usage of batch formats. It introduces pyarrow.Table as a first-class batch format, clarifies its relationship with internal data storage, and provides comprehensive examples and strategic guidance for users to select the most appropriate batch format for their data processing needs.

Highlights

  • PyArrow Batch Format Integration: The documentation now explicitly includes pyarrow.Table as a supported batch format for Ray Data operations, enhancing flexibility for users.
  • Batch Format Independence Clarification: Added text to clarify that the chosen batch format (e.g., NumPy, Pandas, PyArrow) is independent of how Ray Data stores its underlying blocks, providing a clearer understanding of data handling.
  • Enhanced Examples and Guidance: New examples demonstrating pyarrow batch usage with take_batch and map_batches have been added, along with strategic guidance on when to choose numpy, pandas, or pyarrow formats for different use cases.
  • Glossary Update: The glossary definition for 'Batch format' has been expanded to include its independence from internal block representation and now features a pyarrow example for iter_batches.
Changelog
  • doc/source/data/inspecting-data.rst
    • Clarified that batch format is independent of internal block representation.
    • Added a pyarrow example for Dataset.take_batch().
  • doc/source/data/transforming-data.rst
    • Updated the description of supported batch formats to include pyarrow Tables.
    • Added a pyarrow example for Dataset.map_batches().
    • Introduced a new section "Choosing the right batch format" with recommendations for numpy, pandas, and pyarrow.
    • Updated the DataBatch type hint description to include pyarrow.Table.
  • doc/source/ray-references/glossary.rst
    • Expanded the "Batch format" definition to emphasize its independence from internal block storage.
    • Modified the ray.data.range example to 15 and added a pyarrow example for iter_batches.
Activity
  • owenowenisme requested a summary of the pull request.
  • gemini-code-assist[bot] provided feedback on a minor typo in a pyarrow example output for sepal width (cm).
  • gemini-code-assist[bot] suggested updating the DataBatch type hint in transforming-data.rst to include pyarrow.Table.
  • gemini-code-assist[bot] pointed out an incorrect string representation for pyarrow.Table in a glossary example and suggested a more accurate format.

Comment on lines +202 to +204
* Use numpy in ``map_batches`` when your batch function needs numeric or tensor-style operations.
* Use pandas in ``map_batches`` when your batch function needs a DataFrame API, such as for tabular cleaning, joins, grouping, or row/column-wise transforms.
* Use pyarrow in ``map_batches`` when your batch function benefits from columnar processing, high-performance I/O, or zero-copy conversion to other systems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Use numpy in ``map_batches`` when your batch function needs numeric or tensor-style operations.
* Use pandas in ``map_batches`` when your batch function needs a DataFrame API, such as for tabular cleaning, joins, grouping, or row/column-wise transforms.
* Use pyarrow in ``map_batches`` when your batch function benefits from columnar processing, high-performance I/O, or zero-copy conversion to other systems.
When choosing appropriate batch format for your ``map_batches`` primary consideration is a trade off of convenience vs performance:
1. Batches serve as a sliding window into the underlying block -- your UDF will be invoked with a subset of rows (making up a batch) of the underlying block.
2. Depending on the batch format such view can either be zero-copy (when batch format matches block type) or copying (for ex, when using batch format differing from the block type).
For ex, if you prefer to work with Panda's or Numpy batches you can specify either ``batch_format="pandas"`` or ``batch_format="numpy"`` (default) which might copy the underlying data when converting it from the underlying block type (for ex, if the block type is Arrow).
Note that, by default block type is Arrow (what most Ray Data readers are producing). However, Ray Data strives to minimize amount of data conversions: for ex, if your ``map_batches`` operation returns Panda's batches then these batches will be combined into blocks *without* conversion and propagated further as Panda's blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the batch format such view can either be zero-copy (when batch format matches block type) or copying (for ex, when using batch format differing from the block type).

Does copying happen everytime if batch format and block type dont match? or are there combinations where it doesnt happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin Updated, but shouldn't we have some instruction to show the use cases for each format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does copying happen everytime if batch format and block type dont match? or are there combinations where it doesnt happen?

Yeah, every time batch format and block type don't match the conversion is happened

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin On second thought we should not expose too much internal detail here, users only care about which format to choose, I added "Use numpy in map_batches..." back and summarize zero-copy detail into

 Note that Ray Data uses zero-copy when the batch format matches the underlying block type (for example, Arrow blocks with ``batch_format="pyarrow"``). When they differ, Ray Data copies the data during conversion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK i fixed this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@owenowenisme we shouldn't provide confusing guidance to the users and instead we'd clearly call out considerations they might not be aware of:

  • Users can figure out for themselves when they prefer to use numpy/pandas/pyarrow based on personal preferences
  • Our goal here is to guide them to explain the effects of their choice (ie copying, potential type-system misalignment when going pyarrow > numpy, etc)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have both; I think you need to mention the copy implications, but you also need to mention that beyond that, you can use any of the formats (and usually people use X format for Y purpose)

The reason for the latter is because I think early users will often appreciate the explicit guidance, even if it seems obvious

I believe that is what the current text has, which I made the edit on, but let's align

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Richard, we're totally on the same page here. However, our current guidance is preferential and is at best hand-waivy.

My point was simply that rather that we'd guide users to weigh in on tradeoffs of their decisions by putting out some of these hard-to-know aspects front and center for the user to ultimately to take a choice.

If you feel my framing was too technical happy to discuss alternatives, but i want to make sure we're not confusing users with our guidance here.

Signed-off-by: You-Cheng Lin <[email protected]>
@owenowenisme
Copy link
Member Author

@richardliaw Will add polars in another pr
Since its not related to batch
#58996

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

print(batch)

.. testoutput::
:options: +MOCK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we can't actually test this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just copying pandas test code and convert to pyarrow.
I didn't thought about it. Do you think we should add numpy & pandas test back?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. I don't recall the historical context for why we're mocking this.

I think for simplicity/not slow down this PR, I think we should remove +MOCK for pyarrow only

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Note that Ray Data uses zero-copy when the batch format matches the underlying block type (for example, Arrow blocks with ``batch_format="pyarrow"``). When they differ, Ray Data copies the data during conversion.


The user defined function you pass to :meth:`~ray.data.Dataset.map_batches` is more flexible. Because you can represent batches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe out-of-scope for this PR, but I felt like the previous paragraph didn;t really transition to this paragraph at all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think its better to move this part to batch in glossary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not if we want people to read this. My guess is that few people actually read the glossary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll just remove it then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(To clarify, I think this list is useful and we should keep it somewhere)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK i made it flow better.

owenowenisme and others added 5 commits November 26, 2025 14:53
Co-authored-by: Balaji Veeramani <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
import ray

def drop_nas(batch: pa.Table) -> pa.Table:
return pc.drop_null(batch)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect PyArrow function usage on Table

The pc.drop_null() function operates on individual arrays or chunked arrays, not on pa.Table objects. This means the function won't work as intended to drop rows with null values (analogous to batch.dropna() in pandas). The correct approach would be using pc.filter() with a mask or iterating over columns, not directly calling pc.drop_null() on the table.

Fix in Cursor Fix in Web

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Note that Ray Data uses zero-copy when the batch format matches the underlying block type (for example, Arrow blocks with ``batch_format="pyarrow"``). When they differ, Ray Data copies the data during conversion.


The user defined function you pass to :meth:`~ray.data.Dataset.map_batches` is more flexible. Because you can represent batches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(To clarify, I think this list is useful and we should keep it somewhere)

Signed-off-by: Richard Liaw <[email protected]>
@richardliaw richardliaw enabled auto-merge (squash) November 27, 2025 00:28
Signed-off-by: You-Cheng Lin <[email protected]>
@github-actions github-actions bot disabled auto-merge November 27, 2025 02:46
Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

richardliaw and others added 2 commits December 2, 2025 15:56
Signed-off-by: Alexey Kudinkin <[email protected]>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) December 3, 2025 02:19
@github-actions github-actions bot disabled auto-merge December 3, 2025 07:26
Signed-off-by: You-Cheng Lin <[email protected]>
@owenowenisme owenowenisme force-pushed the data/refine-batch-typing-doc branch from cded09e to 8cc030c Compare December 3, 2025 07:39
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't merge yet - will chat sync

Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) December 3, 2025 21:22
@alexeykudinkin alexeykudinkin merged commit 1235600 into ray-project:master Dec 3, 2025
7 checks passed
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description

- PyArrow Batch Format Integration: The documentation now explicitly
includes `pyarrow.Table` as a supported batch format for Ray Data
operations, enhancing flexibility for users.
- Batch Format Independence Clarification: Added text to clarify that
the chosen batch format (e.g., NumPy, Pandas, PyArrow) is independent of
how Ray Data stores its underlying blocks, providing a clearer
understanding of data handling.
- Enhanced Examples and Guidance: New examples demonstrating pyarrow
batch usage with `take_batch` and `map_batches` have been added, along
with strategic guidance on when to choose numpy, pandas, or pyarrow
formats for different use cases.
- Glossary Update: The glossary definition for 'Batch format' has been
expanded to include its independence from internal block representation
and now features a pyarrow example for `iter_batches`.
## Related issues
Closes ray-project#58615

## Additional information

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Alexey Kudinkin <[email protected]>
Signed-off-by: peterxcli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[data] Improve documentation around typing and batch formats

5 participants