[Data] Refine batch typing in docs#58971
Conversation
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
|
/gemini summary |
There was a problem hiding this comment.
Code Review
This pull request refines the documentation for batch typing in Ray Data, primarily by adding information and examples for the pyarrow batch format. The changes are generally good and improve the documentation's completeness. I've found a few minor issues: a typo in an example output, an incorrect representation of a pyarrow.Table in a doctest, and an outdated type hint for DataBatch. I've left specific comments for these. Additionally, please note that the DataBatch type hint on line 225 of doc/source/data/transforming-data.rst also needs to be updated to include pyarrow.Table for consistency.
Summary of ChangesThis pull request significantly enhances the Ray Data documentation by refining the explanation and usage of batch formats. It introduces Highlights
Changelog
Activity
|
| * Use numpy in ``map_batches`` when your batch function needs numeric or tensor-style operations. | ||
| * Use pandas in ``map_batches`` when your batch function needs a DataFrame API, such as for tabular cleaning, joins, grouping, or row/column-wise transforms. | ||
| * Use pyarrow in ``map_batches`` when your batch function benefits from columnar processing, high-performance I/O, or zero-copy conversion to other systems. |
There was a problem hiding this comment.
| * Use numpy in ``map_batches`` when your batch function needs numeric or tensor-style operations. | |
| * Use pandas in ``map_batches`` when your batch function needs a DataFrame API, such as for tabular cleaning, joins, grouping, or row/column-wise transforms. | |
| * Use pyarrow in ``map_batches`` when your batch function benefits from columnar processing, high-performance I/O, or zero-copy conversion to other systems. | |
| When choosing appropriate batch format for your ``map_batches`` primary consideration is a trade off of convenience vs performance: | |
| 1. Batches serve as a sliding window into the underlying block -- your UDF will be invoked with a subset of rows (making up a batch) of the underlying block. | |
| 2. Depending on the batch format such view can either be zero-copy (when batch format matches block type) or copying (for ex, when using batch format differing from the block type). | |
| For ex, if you prefer to work with Panda's or Numpy batches you can specify either ``batch_format="pandas"`` or ``batch_format="numpy"`` (default) which might copy the underlying data when converting it from the underlying block type (for ex, if the block type is Arrow). | |
| Note that, by default block type is Arrow (what most Ray Data readers are producing). However, Ray Data strives to minimize amount of data conversions: for ex, if your ``map_batches`` operation returns Panda's batches then these batches will be combined into blocks *without* conversion and propagated further as Panda's blocks. |
There was a problem hiding this comment.
Depending on the batch format such view can either be zero-copy (when batch format matches block type) or copying (for ex, when using batch format differing from the block type).
Does copying happen everytime if batch format and block type dont match? or are there combinations where it doesnt happen?
There was a problem hiding this comment.
@alexeykudinkin Updated, but shouldn't we have some instruction to show the use cases for each format?
There was a problem hiding this comment.
Does copying happen everytime if batch format and block type dont match? or are there combinations where it doesnt happen?
Yeah, every time batch format and block type don't match the conversion is happened
There was a problem hiding this comment.
@alexeykudinkin On second thought we should not expose too much internal detail here, users only care about which format to choose, I added "Use numpy in map_batches..." back and summarize zero-copy detail into
Note that Ray Data uses zero-copy when the batch format matches the underlying block type (for example, Arrow blocks with ``batch_format="pyarrow"``). When they differ, Ray Data copies the data during conversion.
There was a problem hiding this comment.
@owenowenisme we shouldn't provide confusing guidance to the users and instead we'd clearly call out considerations they might not be aware of:
- Users can figure out for themselves when they prefer to use numpy/pandas/pyarrow based on personal preferences
- Our goal here is to guide them to explain the effects of their choice (ie copying, potential type-system misalignment when going pyarrow > numpy, etc)
There was a problem hiding this comment.
I think we need to have both; I think you need to mention the copy implications, but you also need to mention that beyond that, you can use any of the formats (and usually people use X format for Y purpose)
The reason for the latter is because I think early users will often appreciate the explicit guidance, even if it seems obvious
I believe that is what the current text has, which I made the edit on, but let's align
There was a problem hiding this comment.
Richard, we're totally on the same page here. However, our current guidance is preferential and is at best hand-waivy.
My point was simply that rather that we'd guide users to weigh in on tradeoffs of their decisions by putting out some of these hard-to-know aspects front and center for the user to ultimately to take a choice.
If you feel my framing was too technical happy to discuss alternatives, but i want to make sure we're not confusing users with our guidance here.
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
|
@richardliaw Will add polars in another pr |
doc/source/data/inspecting-data.rst
Outdated
| print(batch) | ||
|
|
||
| .. testoutput:: | ||
| :options: +MOCK |
There was a problem hiding this comment.
Any reason we can't actually test this?
There was a problem hiding this comment.
I'm just copying pandas test code and convert to pyarrow.
I didn't thought about it. Do you think we should add numpy & pandas test back?
There was a problem hiding this comment.
Ah, gotcha. I don't recall the historical context for why we're mocking this.
I think for simplicity/not slow down this PR, I think we should remove +MOCK for pyarrow only
| Note that Ray Data uses zero-copy when the batch format matches the underlying block type (for example, Arrow blocks with ``batch_format="pyarrow"``). When they differ, Ray Data copies the data during conversion. | ||
|
|
||
|
|
||
| The user defined function you pass to :meth:`~ray.data.Dataset.map_batches` is more flexible. Because you can represent batches |
There was a problem hiding this comment.
Maybe out-of-scope for this PR, but I felt like the previous paragraph didn;t really transition to this paragraph at all
There was a problem hiding this comment.
Do you think its better to move this part to batch in glossary?
There was a problem hiding this comment.
Not if we want people to read this. My guess is that few people actually read the glossary
There was a problem hiding this comment.
Okay I'll just remove it then.
There was a problem hiding this comment.
(To clarify, I think this list is useful and we should keep it somewhere)
There was a problem hiding this comment.
OK i made it flow better.
Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
| import ray | ||
|
|
||
| def drop_nas(batch: pa.Table) -> pa.Table: | ||
| return pc.drop_null(batch) |
There was a problem hiding this comment.
Bug: Incorrect PyArrow function usage on Table
The pc.drop_null() function operates on individual arrays or chunked arrays, not on pa.Table objects. This means the function won't work as intended to drop rows with null values (analogous to batch.dropna() in pandas). The correct approach would be using pc.filter() with a mask or iterating over columns, not directly calling pc.drop_null() on the table.
| Note that Ray Data uses zero-copy when the batch format matches the underlying block type (for example, Arrow blocks with ``batch_format="pyarrow"``). When they differ, Ray Data copies the data during conversion. | ||
|
|
||
|
|
||
| The user defined function you pass to :meth:`~ray.data.Dataset.map_batches` is more flexible. Because you can represent batches |
There was a problem hiding this comment.
(To clarify, I think this list is useful and we should keep it somewhere)
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
cded09e to
8cc030c
Compare
richardliaw
left a comment
There was a problem hiding this comment.
don't merge yet - will chat sync
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
## Description - PyArrow Batch Format Integration: The documentation now explicitly includes `pyarrow.Table` as a supported batch format for Ray Data operations, enhancing flexibility for users. - Batch Format Independence Clarification: Added text to clarify that the chosen batch format (e.g., NumPy, Pandas, PyArrow) is independent of how Ray Data stores its underlying blocks, providing a clearer understanding of data handling. - Enhanced Examples and Guidance: New examples demonstrating pyarrow batch usage with `take_batch` and `map_batches` have been added, along with strategic guidance on when to choose numpy, pandas, or pyarrow formats for different use cases. - Glossary Update: The glossary definition for 'Batch format' has been expanded to include its independence from internal block representation and now features a pyarrow example for `iter_batches`. ## Related issues Closes ray-project#58615 ## Additional information --------- Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Signed-off-by: Alexey Kudinkin <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Alexey Kudinkin <[email protected]> Signed-off-by: peterxcli <[email protected]>
Description
pyarrow.Tableas a supported batch format for Ray Data operations, enhancing flexibility for users.take_batchandmap_batcheshave been added, along with strategic guidance on when to choose numpy, pandas, or pyarrow formats for different use cases.iter_batches.Related issues
Closes #58615
Additional information