[Data] Improve numerical stability in scalers by handling near-zero values by slfan1989 · Pull Request #60488 · ray-project/ray

slfan1989 · 2026-01-26T05:43:26Z

Description

This PR improves numerical stability in preprocessor scalers (StandardScaler and MinMaxScaler) by extending division-by-zero handling to also cover near-zero values.

Current behavior:
The scalers only check for exact zero values (e.g., std == 0 or diff == 0), which can lead to numerical instability when dealing with near-zero values (e.g., std = 1e-10). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range.

Changes made:

Added _EPSILON = 1e-8 constant to define near-zero threshold (following sklearn's approach)
Updated StandardScaler._transform_pandas() and _scale_column() to use < _EPSILON instead of == 0
Updated MinMaxScaler._transform_pandas() similarly
Added comprehensive test cases covering near-zero and exact-zero edge cases

Impact:
This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases.

Related issues

Addresses TODO comments in python/ray/data/preprocessors/scaler.py:

Line 117: # TODO: extend this to handle near-zero values.
Line 271: # TODO: extend this to handle near-zero values.

Additional information

Implementation Details

Epsilon Value Selection:
The threshold _EPSILON = 1e-8 was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero.

Modified Methods:

StandardScaler._transform_pandas() - Pandas transformation path
StandardScaler._scale_column() - PyArrow transformation path
MinMaxScaler._transform_pandas() - Pandas transformation path

Backward Compatibility:
✅ For normal data (variance/range > 1e-8), behavior is identical to before
✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8)
✅ All existing tests pass without modification

Test Coverage

Added three new test cases:

test_standard_scaler_near_zero_std() - Tests data with std ≈ 4.7e-11
test_min_max_scaler_near_zero_range() - Tests data with range ≈ 1e-10
test_standard_scaler_exact_zero_std() - Regression test for exact zero case

…alues This commit addresses TODO comments in scaler.py by extending division-by-zero handling to also cover near-zero values, preventing numerical instability. Changes: - Add _EPSILON constant (1e-8) for near-zero threshold - Update StandardScaler to handle near-zero standard deviations - Update MinMaxScaler to handle near-zero ranges - Add comprehensive test cases for edge cases Fixes numerical instability when scaling columns with very small variance or range, similar to sklearn's approach. Signed-off-by: slfan1989 <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a good improvement for numerical stability in StandardScaler and MinMaxScaler by handling near-zero divisors using an epsilon threshold. The changes are well-implemented and include thorough tests for the modified scalers.

My review identifies that MaxAbsScaler and RobustScaler in the same file have similar numerical stability issues with near-zero divisors but were not updated. I've added a comment suggesting to apply the same epsilon-based handling to them for consistency and completeness. This would make the scaler implementations more robust across the board.

gemini-code-assist · 2026-01-26T05:45:01Z

python/ray/data/preprocessors/scaler.py

+            # Handle division by zero and near-zero values for numerical stability.
+            # If range is very small (constant or near-constant column),
+            # treat it as 1 to avoid numerical instability.
+            if diff < _EPSILON:


This change is great for improving the numerical stability of MinMaxScaler.

For completeness and consistency, could you also apply similar logic to MaxAbsScaler and RobustScaler? They also perform divisions and could benefit from this epsilon-based handling to avoid instability with near-zero divisors.

For MaxAbsScaler (L386), the check could be changed from if s_abs_max == 0: to if s_abs_max < _EPSILON:.

For RobustScaler (L542), the check could be changed from if diff == 0: to if diff < _EPSILON:.

Applying these changes would make all scalers in this module more robust. If you decide to add these, please also add corresponding tests for these scalers.

bveeramani · 2026-01-27T18:39:56Z

lgtm. ty for the contribution

…alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** ✅ For normal data (variance/range > 1e-8), behavior is **identical** to before ✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8) ✅ All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈ 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈ 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <[email protected]> Signed-off-by: jinbum-kim <[email protected]>

…alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** ✅ For normal data (variance/range > 1e-8), behavior is **identical** to before ✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8) ✅ All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈ 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈ 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <[email protected]>

…alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** ✅ For normal data (variance/range > 1e-8), behavior is **identical** to before ✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8) ✅ All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈ 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈ 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <[email protected]> Signed-off-by: 400Ping <[email protected]>

…alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** ✅ For normal data (variance/range > 1e-8), behavior is **identical** to before ✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8) ✅ All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈ 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈ 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <[email protected]> Signed-off-by: Adel Nour <[email protected]>

…alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** ✅ For normal data (variance/range > 1e-8), behavior is **identical** to before ✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8) ✅ All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈ 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈ 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <[email protected]> Signed-off-by: peterxcli <[email protected]>

slfan1989 requested a review from a team as a code owner January 26, 2026 05:43

gemini-code-assist bot reviewed Jan 26, 2026

View reviewed changes

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 26, 2026

bveeramani approved these changes Jan 27, 2026

View reviewed changes

bveeramani enabled auto-merge (squash) January 27, 2026 18:39

github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 27, 2026

bveeramani merged commit 2cd56f0 into ray-project:master Jan 27, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Improve numerical stability in scalers by handling near-zero values#60488

[Data] Improve numerical stability in scalers by handling near-zero values#60488
bveeramani merged 1 commit intoray-project:masterfrom
slfan1989:data/scaler-epsilon-handling

slfan1989 commented Jan 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 26, 2026

Uh oh!

bveeramani commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

slfan1989 commented Jan 26, 2026

Description

Related issues

Additional information

Implementation Details

Test Coverage

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

bveeramani commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants