Skip to content

[Data] Improve numerical stability in scalers by handling near-zero values#60488

Merged
bveeramani merged 1 commit intoray-project:masterfrom
slfan1989:data/scaler-epsilon-handling
Jan 27, 2026
Merged

[Data] Improve numerical stability in scalers by handling near-zero values#60488
bveeramani merged 1 commit intoray-project:masterfrom
slfan1989:data/scaler-epsilon-handling

Conversation

@slfan1989
Copy link
Contributor

Description

This PR improves numerical stability in preprocessor scalers (StandardScaler and MinMaxScaler) by extending division-by-zero handling to also cover near-zero values.

Current behavior:
The scalers only check for exact zero values (e.g., std == 0 or diff == 0), which can lead to numerical instability when dealing with near-zero values (e.g., std = 1e-10). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range.

Changes made:

  • Added _EPSILON = 1e-8 constant to define near-zero threshold (following sklearn's approach)
  • Updated StandardScaler._transform_pandas() and _scale_column() to use < _EPSILON instead of == 0
  • Updated MinMaxScaler._transform_pandas() similarly
  • Added comprehensive test cases covering near-zero and exact-zero edge cases

Impact:
This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases.

Related issues

Addresses TODO comments in python/ray/data/preprocessors/scaler.py:

  • Line 117: # TODO: extend this to handle near-zero values.
  • Line 271: # TODO: extend this to handle near-zero values.

Additional information

Implementation Details

Epsilon Value Selection:
The threshold _EPSILON = 1e-8 was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero.

Modified Methods:

  1. StandardScaler._transform_pandas() - Pandas transformation path
  2. StandardScaler._scale_column() - PyArrow transformation path
  3. MinMaxScaler._transform_pandas() - Pandas transformation path

Backward Compatibility:
✅ For normal data (variance/range > 1e-8), behavior is identical to before
✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8)
✅ All existing tests pass without modification

Test Coverage

Added three new test cases:

  1. test_standard_scaler_near_zero_std() - Tests data with std ≈ 4.7e-11
  2. test_min_max_scaler_near_zero_range() - Tests data with range ≈ 1e-10
  3. test_standard_scaler_exact_zero_std() - Regression test for exact zero case

…alues

This commit addresses TODO comments in scaler.py by extending
division-by-zero handling to also cover near-zero values, preventing
numerical instability.

Changes:
- Add _EPSILON constant (1e-8) for near-zero threshold
- Update StandardScaler to handle near-zero standard deviations
- Update MinMaxScaler to handle near-zero ranges
- Add comprehensive test cases for edge cases

Fixes numerical instability when scaling columns with very small
variance or range, similar to sklearn's approach.

Signed-off-by: slfan1989 <[email protected]>
@slfan1989 slfan1989 requested a review from a team as a code owner January 26, 2026 05:43
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a good improvement for numerical stability in StandardScaler and MinMaxScaler by handling near-zero divisors using an epsilon threshold. The changes are well-implemented and include thorough tests for the modified scalers.

My review identifies that MaxAbsScaler and RobustScaler in the same file have similar numerical stability issues with near-zero divisors but were not updated. I've added a comment suggesting to apply the same epsilon-based handling to them for consistency and completeness. This would make the scaler implementations more robust across the board.

# Handle division by zero and near-zero values for numerical stability.
# If range is very small (constant or near-constant column),
# treat it as 1 to avoid numerical instability.
if diff < _EPSILON:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change is great for improving the numerical stability of MinMaxScaler.

For completeness and consistency, could you also apply similar logic to MaxAbsScaler and RobustScaler? They also perform divisions and could benefit from this epsilon-based handling to avoid instability with near-zero divisors.

  • For MaxAbsScaler (L386), the check could be changed from if s_abs_max == 0: to if s_abs_max < _EPSILON:.
  • For RobustScaler (L542), the check could be changed from if diff == 0: to if diff < _EPSILON:.

Applying these changes would make all scalers in this module more robust. If you decide to add these, please also add corresponding tests for these scalers.

@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 26, 2026
@bveeramani bveeramani enabled auto-merge (squash) January 27, 2026 18:39
@bveeramani
Copy link
Member

lgtm. ty for the contribution

@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 27, 2026
@bveeramani bveeramani merged commit 2cd56f0 into ray-project:master Jan 27, 2026
7 checks passed
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
…alues (ray-project#60488)

## Description

This PR improves numerical stability in preprocessor scalers
(`StandardScaler` and `MinMaxScaler`) by extending division-by-zero
handling to also cover near-zero values.

**Current behavior:**
The scalers only check for exact zero values (e.g., `std == 0` or `diff
== 0`), which can lead to numerical instability when dealing with
near-zero values (e.g., `std = 1e-10`). This is a common edge case in
real-world data preprocessing where columns have extremely small
variance or range.

**Changes made:**
- Added `_EPSILON = 1e-8` constant to define near-zero threshold
(following sklearn's approach)
- Updated `StandardScaler._transform_pandas()` and `_scale_column()` to
use `< _EPSILON` instead of `== 0`
- Updated `MinMaxScaler._transform_pandas()` similarly
- Added comprehensive test cases covering near-zero and exact-zero edge
cases

**Impact:**
This change prevents numerical instability (NaN/inf values) when scaling
columns with very small but non-zero variance/range, while maintaining
backward compatibility for normal use cases.

## Related issues

Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`:
- Line 117: `# TODO: extend this to handle near-zero values.`
- Line 271: `# TODO: extend this to handle near-zero values.`

## Additional information

### Implementation Details

**Epsilon Value Selection:**
The threshold `_EPSILON = 1e-8` was chosen to align with
industry-standard practices (e.g., sklearn, numpy). This value
effectively handles floating-point precision issues without incorrectly
treating legitimate small variances as zero.

**Modified Methods:**
1. `StandardScaler._transform_pandas()` - Pandas transformation path
2. `StandardScaler._scale_column()` - PyArrow transformation path
3. `MinMaxScaler._transform_pandas()` - Pandas transformation path

**Backward Compatibility:**
✅ For normal data (variance/range > 1e-8), behavior is **identical** to
before
✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8)
✅ All existing tests pass without modification

### Test Coverage

Added three new test cases:
1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈
4.7e-11
2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈
1e-10
3. `test_standard_scaler_exact_zero_std()` - Regression test for exact
zero case

Signed-off-by: slfan1989 <[email protected]>
Signed-off-by: jinbum-kim <[email protected]>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jan 29, 2026
…alues (ray-project#60488)

## Description

This PR improves numerical stability in preprocessor scalers
(`StandardScaler` and `MinMaxScaler`) by extending division-by-zero
handling to also cover near-zero values.

**Current behavior:**  
The scalers only check for exact zero values (e.g., `std == 0` or `diff
== 0`), which can lead to numerical instability when dealing with
near-zero values (e.g., `std = 1e-10`). This is a common edge case in
real-world data preprocessing where columns have extremely small
variance or range.

**Changes made:**
- Added `_EPSILON = 1e-8` constant to define near-zero threshold
(following sklearn's approach)
- Updated `StandardScaler._transform_pandas()` and `_scale_column()` to
use `< _EPSILON` instead of `== 0`
- Updated `MinMaxScaler._transform_pandas()` similarly
- Added comprehensive test cases covering near-zero and exact-zero edge
cases

**Impact:**  
This change prevents numerical instability (NaN/inf values) when scaling
columns with very small but non-zero variance/range, while maintaining
backward compatibility for normal use cases.

## Related issues

Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`:
- Line 117: `# TODO: extend this to handle near-zero values.`
- Line 271: `# TODO: extend this to handle near-zero values.`

## Additional information

### Implementation Details

**Epsilon Value Selection:**  
The threshold `_EPSILON = 1e-8` was chosen to align with
industry-standard practices (e.g., sklearn, numpy). This value
effectively handles floating-point precision issues without incorrectly
treating legitimate small variances as zero.

**Modified Methods:**
1. `StandardScaler._transform_pandas()` - Pandas transformation path
2. `StandardScaler._scale_column()` - PyArrow transformation path
3. `MinMaxScaler._transform_pandas()` - Pandas transformation path

**Backward Compatibility:**  
✅ For normal data (variance/range > 1e-8), behavior is **identical** to
before
✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8)
✅ All existing tests pass without modification

### Test Coverage

Added three new test cases:
1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈
4.7e-11
2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈
1e-10
3. `test_standard_scaler_exact_zero_std()` - Regression test for exact
zero case

Signed-off-by: slfan1989 <[email protected]>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
…alues (ray-project#60488)

## Description

This PR improves numerical stability in preprocessor scalers
(`StandardScaler` and `MinMaxScaler`) by extending division-by-zero
handling to also cover near-zero values.

**Current behavior:**
The scalers only check for exact zero values (e.g., `std == 0` or `diff
== 0`), which can lead to numerical instability when dealing with
near-zero values (e.g., `std = 1e-10`). This is a common edge case in
real-world data preprocessing where columns have extremely small
variance or range.

**Changes made:**
- Added `_EPSILON = 1e-8` constant to define near-zero threshold
(following sklearn's approach)
- Updated `StandardScaler._transform_pandas()` and `_scale_column()` to
use `< _EPSILON` instead of `== 0`
- Updated `MinMaxScaler._transform_pandas()` similarly
- Added comprehensive test cases covering near-zero and exact-zero edge
cases

**Impact:**
This change prevents numerical instability (NaN/inf values) when scaling
columns with very small but non-zero variance/range, while maintaining
backward compatibility for normal use cases.

## Related issues

Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`:
- Line 117: `# TODO: extend this to handle near-zero values.`
- Line 271: `# TODO: extend this to handle near-zero values.`

## Additional information

### Implementation Details

**Epsilon Value Selection:**
The threshold `_EPSILON = 1e-8` was chosen to align with
industry-standard practices (e.g., sklearn, numpy). This value
effectively handles floating-point precision issues without incorrectly
treating legitimate small variances as zero.

**Modified Methods:**
1. `StandardScaler._transform_pandas()` - Pandas transformation path
2. `StandardScaler._scale_column()` - PyArrow transformation path
3. `MinMaxScaler._transform_pandas()` - Pandas transformation path

**Backward Compatibility:**
✅ For normal data (variance/range > 1e-8), behavior is **identical** to
before
✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8)
✅ All existing tests pass without modification

### Test Coverage

Added three new test cases:
1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈
4.7e-11
2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈
1e-10
3. `test_standard_scaler_exact_zero_std()` - Regression test for exact
zero case

Signed-off-by: slfan1989 <[email protected]>
Signed-off-by: 400Ping <[email protected]>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…alues (ray-project#60488)

## Description

This PR improves numerical stability in preprocessor scalers
(`StandardScaler` and `MinMaxScaler`) by extending division-by-zero
handling to also cover near-zero values.

**Current behavior:**
The scalers only check for exact zero values (e.g., `std == 0` or `diff
== 0`), which can lead to numerical instability when dealing with
near-zero values (e.g., `std = 1e-10`). This is a common edge case in
real-world data preprocessing where columns have extremely small
variance or range.

**Changes made:**
- Added `_EPSILON = 1e-8` constant to define near-zero threshold
(following sklearn's approach)
- Updated `StandardScaler._transform_pandas()` and `_scale_column()` to
use `< _EPSILON` instead of `== 0`
- Updated `MinMaxScaler._transform_pandas()` similarly
- Added comprehensive test cases covering near-zero and exact-zero edge
cases

**Impact:**
This change prevents numerical instability (NaN/inf values) when scaling
columns with very small but non-zero variance/range, while maintaining
backward compatibility for normal use cases.

## Related issues

Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`:
- Line 117: `# TODO: extend this to handle near-zero values.`
- Line 271: `# TODO: extend this to handle near-zero values.`

## Additional information

### Implementation Details

**Epsilon Value Selection:**
The threshold `_EPSILON = 1e-8` was chosen to align with
industry-standard practices (e.g., sklearn, numpy). This value
effectively handles floating-point precision issues without incorrectly
treating legitimate small variances as zero.

**Modified Methods:**
1. `StandardScaler._transform_pandas()` - Pandas transformation path
2. `StandardScaler._scale_column()` - PyArrow transformation path
3. `MinMaxScaler._transform_pandas()` - Pandas transformation path

**Backward Compatibility:**
✅ For normal data (variance/range > 1e-8), behavior is **identical** to
before
✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8)
✅ All existing tests pass without modification

### Test Coverage

Added three new test cases:
1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈
4.7e-11
2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈
1e-10
3. `test_standard_scaler_exact_zero_std()` - Regression test for exact
zero case

Signed-off-by: slfan1989 <[email protected]>
Signed-off-by: Adel Nour <[email protected]>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…alues (ray-project#60488)

## Description

This PR improves numerical stability in preprocessor scalers
(`StandardScaler` and `MinMaxScaler`) by extending division-by-zero
handling to also cover near-zero values.

**Current behavior:**
The scalers only check for exact zero values (e.g., `std == 0` or `diff
== 0`), which can lead to numerical instability when dealing with
near-zero values (e.g., `std = 1e-10`). This is a common edge case in
real-world data preprocessing where columns have extremely small
variance or range.

**Changes made:**
- Added `_EPSILON = 1e-8` constant to define near-zero threshold
(following sklearn's approach)
- Updated `StandardScaler._transform_pandas()` and `_scale_column()` to
use `< _EPSILON` instead of `== 0`
- Updated `MinMaxScaler._transform_pandas()` similarly
- Added comprehensive test cases covering near-zero and exact-zero edge
cases

**Impact:**
This change prevents numerical instability (NaN/inf values) when scaling
columns with very small but non-zero variance/range, while maintaining
backward compatibility for normal use cases.

## Related issues

Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`:
- Line 117: `# TODO: extend this to handle near-zero values.`
- Line 271: `# TODO: extend this to handle near-zero values.`

## Additional information

### Implementation Details

**Epsilon Value Selection:**
The threshold `_EPSILON = 1e-8` was chosen to align with
industry-standard practices (e.g., sklearn, numpy). This value
effectively handles floating-point precision issues without incorrectly
treating legitimate small variances as zero.

**Modified Methods:**
1. `StandardScaler._transform_pandas()` - Pandas transformation path
2. `StandardScaler._scale_column()` - PyArrow transformation path
3. `MinMaxScaler._transform_pandas()` - Pandas transformation path

**Backward Compatibility:**
✅ For normal data (variance/range > 1e-8), behavior is **identical** to
before
✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8)
✅ All existing tests pass without modification

### Test Coverage

Added three new test cases:
1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈
4.7e-11
2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈
1e-10
3. `test_standard_scaler_exact_zero_std()` - Regression test for exact
zero case

Signed-off-by: slfan1989 <[email protected]>
Signed-off-by: peterxcli <[email protected]>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…alues (ray-project#60488)

## Description

This PR improves numerical stability in preprocessor scalers
(`StandardScaler` and `MinMaxScaler`) by extending division-by-zero
handling to also cover near-zero values.

**Current behavior:**
The scalers only check for exact zero values (e.g., `std == 0` or `diff
== 0`), which can lead to numerical instability when dealing with
near-zero values (e.g., `std = 1e-10`). This is a common edge case in
real-world data preprocessing where columns have extremely small
variance or range.

**Changes made:**
- Added `_EPSILON = 1e-8` constant to define near-zero threshold
(following sklearn's approach)
- Updated `StandardScaler._transform_pandas()` and `_scale_column()` to
use `< _EPSILON` instead of `== 0`
- Updated `MinMaxScaler._transform_pandas()` similarly
- Added comprehensive test cases covering near-zero and exact-zero edge
cases

**Impact:**
This change prevents numerical instability (NaN/inf values) when scaling
columns with very small but non-zero variance/range, while maintaining
backward compatibility for normal use cases.

## Related issues

Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`:
- Line 117: `# TODO: extend this to handle near-zero values.`
- Line 271: `# TODO: extend this to handle near-zero values.`

## Additional information

### Implementation Details

**Epsilon Value Selection:**
The threshold `_EPSILON = 1e-8` was chosen to align with
industry-standard practices (e.g., sklearn, numpy). This value
effectively handles floating-point precision issues without incorrectly
treating legitimate small variances as zero.

**Modified Methods:**
1. `StandardScaler._transform_pandas()` - Pandas transformation path
2. `StandardScaler._scale_column()` - PyArrow transformation path
3. `MinMaxScaler._transform_pandas()` - Pandas transformation path

**Backward Compatibility:**
✅ For normal data (variance/range > 1e-8), behavior is **identical** to
before
✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8)
✅ All existing tests pass without modification

### Test Coverage

Added three new test cases:
1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈
4.7e-11
2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈
1e-10
3. `test_standard_scaler_exact_zero_std()` - Regression test for exact
zero case

Signed-off-by: slfan1989 <[email protected]>
Signed-off-by: peterxcli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants