Cache PyArrow schema operations by xinyuangui2 · Pull Request #58583 · ray-project/ray

xinyuangui2 · 2025-11-13T02:50:00Z

Description

This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns.

Main Changes

Caching for Tensor Type Serialization/Deserialization: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations.

Performance Impact

This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from 0.30s to 0.11s (~63% improvement).

Without cache:

We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle.

With cache

The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore.

Signed-off-by: xgui <[email protected]>

xinyuangui2 · 2025-11-13T22:37:59Z

python/ray/data/_internal/arrow_ops/transform_pyarrow.py


    # Remove metadata for hashability
-    schemas[0].remove_metadata()
+    schemas[0] = schemas[0].remove_metadata()


https://github.com/apache/arrow/blob/cd23a765442bdbaaef43d0e4b239094fb01e37ae/cpp/src/arrow/type.cc#L2475

This remove_metadata() doesn't mutate

I think we cannot remove metadata in place. Otherwise it would fail some release tests:

[2025-11-14T23:48:47Z] raise ValueError(msg.format(self.feature_names, feature_names)) [2025-11-14T23:48:47Z] ValueError: feature_names mismatch: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'feature_25', 'feature_26', 'feature_27', 'feature_28', 'feature_29', 'feature_30', 'feature_31', 'feature_32', 'feature_33', 'feature_34', 'feature_35', 'feature_36', 'feature_37', 'feature_38', 'feature_39', 'partition'] ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'feature_25', 'feature_26', 'feature_27', 'feature_28', 'feature_29', 'feature_30', 'feature_31', 'feature_32', 'feature_33', 'feature_34', 'feature_35', 'feature_36', 'feature_37', 'feature_38', 'feature_39', '__index_level_0__', 'partition'] [2025-11-14T23:48:47Z] training data did not have the following fields: __index_level_0__

in https://buildkite.com/ray-project/premerge/builds/53867#019a84b7-88cd-4186-8a4e-b89b9e4604e1

I updated this a bit @goutamvenkat-anyscale

xinyuangui2 · 2025-11-13T22:38:25Z

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

    # NOTE: Type promotions aren't available in Arrow < 14.0
    subset_blocks = []
    for block in blocks:
-        cols_to_select = [


Profiler shows col_name in block.schema.names is heavy. We use set here.

Signed-off-by: Xinyuan <[email protected]>

Signed-off-by: xgui <[email protected]>

This reverts commit aece4fd.

Signed-off-by: Xinyuan <[email protected]>

Signed-off-by: xgui <[email protected]>

python/ray/air/util/tensor_extensions/arrow.py

Signed-off-by: xgui <[email protected]>

python/ray/air/util/tensor_extensions/arrow.py

Signed-off-by: Xinyuan <[email protected]>

srinathk10

LGTM

python/ray/air/util/tensor_extensions/arrow.py

Signed-off-by: xgui <[email protected]>

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

Signed-off-by: xgui <[email protected]>

alexeykudinkin · 2025-11-18T00:22:01Z

python/ray/air/util/tensor_extensions/arrow.py

+        with self._cache_lock:
+            if self._serialize_cache is None:
+                self._serialize_cache = self._arrow_ext_serialize_compute()
+            return self._serialize_cache


If it's serialized you can skip the lock

## Description This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns. ### Main Changes - **Caching for Tensor Type Serialization/Deserialization**: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations. ### Performance Impact This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from **0.30s to 0.11s** (~63% improvement). #### Without cache: <img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM" src="proxy.php?url=https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728" /> We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle. #### With cache <img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM" src="proxy.php?url=https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131" /> The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore. --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Xinyuan <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

## Description This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns. ### Main Changes - **Caching for Tensor Type Serialization/Deserialization**: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations. ### Performance Impact This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from **0.30s to 0.11s** (~63% improvement). #### Without cache: <img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM" src="proxy.php?url=https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728" /> We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle. #### With cache <img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM" src="proxy.php?url=https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131" /> The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore. --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Xinyuan <[email protected]> Signed-off-by: YK <[email protected]>

## Description This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns. ### Main Changes - **Caching for Tensor Type Serialization/Deserialization**: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations. ### Performance Impact This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from **0.30s to 0.11s** (~63% improvement). #### Without cache: <img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM" src="proxy.php?url=https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728" /> We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle. #### With cache <img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM" src="proxy.php?url=https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131" /> The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore. --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Xinyuan <[email protected]>

## Description This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns. ### Main Changes - **Caching for Tensor Type Serialization/Deserialization**: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations. ### Performance Impact This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from **0.30s to 0.11s** (~63% improvement). #### Without cache: <img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM" src="proxy.php?url=https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728" /> We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle. #### With cache <img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM" src="proxy.php?url=https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131" /> The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore. --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Xinyuan <[email protected]> Signed-off-by: Future-Outlier <[email protected]>

## Description This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns. ### Main Changes - **Caching for Tensor Type Serialization/Deserialization**: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations. ### Performance Impact This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from **0.30s to 0.11s** (~63% improvement). #### Without cache: <img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM" src="proxy.php?url=https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728" /> We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle. #### With cache <img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM" src="proxy.php?url=https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131" /> The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore. --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Xinyuan <[email protected]> Signed-off-by: peterxcli <[email protected]>

xinyuangui2 added 8 commits November 13, 2025 02:48

several optimizations

4ae14df

Signed-off-by: xgui <[email protected]>

fix

46a609e

Signed-off-by: xgui <[email protected]>

fix init

80c2196

Signed-off-by: xgui <[email protected]>

cache pyarrow schema

b78b54f

Signed-off-by: xgui <[email protected]>

add ttl to cache

859f444

Signed-off-by: xgui <[email protected]>

fix doc

853def5

Signed-off-by: xgui <[email protected]>

add threadcachettl unittest

de12bbb

Signed-off-by: xgui <[email protected]>

add tests for caches

519df7e

Signed-off-by: xgui <[email protected]>

xinyuangui2 commented Nov 13, 2025

View reviewed changes

xinyuangui2 and others added 4 commits November 13, 2025 15:40

Merge branch 'master' into cache-pyarrow-schema

5659817

revert remove_metadata change

aece4fd

Signed-off-by: Xinyuan <[email protected]>

fix key

55631ad

Signed-off-by: xgui <[email protected]>

Revert "revert remove_metadata change"

19ea48f

This reverts commit aece4fd.

xinyuangui2 changed the title ~~Several optimizations to arrow schema operations~~ Cache PyArrow schema operations Nov 14, 2025

xinyuangui2 marked this pull request as ready for review November 14, 2025 06:03

xinyuangui2 requested review from a team as code owners November 14, 2025 06:03

xinyuangui2 requested review from goutamvenkat-anyscale, raulchen and srinathk10 November 14, 2025 06:05

xinyuangui2 and others added 4 commits November 13, 2025 22:22

Update arrow.py

203896b

Signed-off-by: Xinyuan <[email protected]>

per-class deserialization cache

89d51ec

Signed-off-by: xgui <[email protected]>

per-class deserialization cache

b2c97bc

Signed-off-by: xgui <[email protected]>

Merge branch 'ray-project:master' into cache-pyarrow-schema

f739297

ray-gardener bot added the data Ray Data-related issues label Nov 14, 2025

raulchen reviewed Nov 14, 2025

View reviewed changes

python/ray/air/util/tensor_extensions/arrow.py Outdated Show resolved Hide resolved

xinyuangui2 and others added 3 commits November 14, 2025 10:37

Merge branch 'master' into cache-pyarrow-schema

aa9c99e

use lru cache

ffb059a

Signed-off-by: xgui <[email protected]>

add one more unittest to ensure different classes are not affected

62d597a

Signed-off-by: xgui <[email protected]>

xinyuangui2 added 2 commits November 14, 2025 20:00

update pydoc

25761e7

Signed-off-by: xgui <[email protected]>

update type

dc63bae

Signed-off-by: xgui <[email protected]>

xinyuangui2 requested a review from raulchen November 14, 2025 21:24

Merge branch 'master' into cache-pyarrow-schema

4c509fb

raulchen approved these changes Nov 14, 2025

View reviewed changes

python/ray/air/util/tensor_extensions/arrow.py Outdated Show resolved Hide resolved

raulchen added the go add ONLY when ready to merge, run all tests label Nov 14, 2025

Reduce ARROW_EXTENSION_SERIALIZATION_CACHE_MAXSIZE

e066bcf

Signed-off-by: Xinyuan <[email protected]>

srinathk10 reviewed Nov 14, 2025

View reviewed changes

python/ray/air/util/tensor_extensions/arrow.py Outdated Show resolved Hide resolved

doc fix

b1269f5

Signed-off-by: xgui <[email protected]>

cursor bot reviewed Nov 14, 2025

View reviewed changes

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

xinyuangui2 added 2 commits November 15, 2025 00:49

avoid in-place mutate inside the func

5e3c6ec

Signed-off-by: xgui <[email protected]>

add developer annotation

05e5669

Signed-off-by: xgui <[email protected]>

raulchen merged commit bd8491e into ray-project:master Nov 17, 2025
6 checks passed

alexeykudinkin reviewed Nov 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache PyArrow schema operations#58583

Cache PyArrow schema operations#58583
raulchen merged 26 commits intoray-project:masterfrom
xinyuangui2:cache-pyarrow-schema

xinyuangui2 commented Nov 13, 2025 •

edited

Loading

Uh oh!

xinyuangui2 Nov 13, 2025

Uh oh!

goutamvenkat-anyscale Nov 14, 2025

Uh oh!

xinyuangui2 Nov 15, 2025

Uh oh!

xinyuangui2 Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

srinathk10 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

xinyuangui2 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Main Changes

Performance Impact

Without cache:

With cache

Uh oh!

xinyuangui2 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

xinyuangui2 Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

xinyuangui2 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

srinathk10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xinyuangui2 commented Nov 13, 2025 •

edited

Loading