[Data] Improve numerical stability in scalers by handling near-zero values#60488
Conversation
…alues This commit addresses TODO comments in scaler.py by extending division-by-zero handling to also cover near-zero values, preventing numerical instability. Changes: - Add _EPSILON constant (1e-8) for near-zero threshold - Update StandardScaler to handle near-zero standard deviations - Update MinMaxScaler to handle near-zero ranges - Add comprehensive test cases for edge cases Fixes numerical instability when scaling columns with very small variance or range, similar to sklearn's approach. Signed-off-by: slfan1989 <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request introduces a good improvement for numerical stability in StandardScaler and MinMaxScaler by handling near-zero divisors using an epsilon threshold. The changes are well-implemented and include thorough tests for the modified scalers.
My review identifies that MaxAbsScaler and RobustScaler in the same file have similar numerical stability issues with near-zero divisors but were not updated. I've added a comment suggesting to apply the same epsilon-based handling to them for consistency and completeness. This would make the scaler implementations more robust across the board.
| # Handle division by zero and near-zero values for numerical stability. | ||
| # If range is very small (constant or near-constant column), | ||
| # treat it as 1 to avoid numerical instability. | ||
| if diff < _EPSILON: |
There was a problem hiding this comment.
This change is great for improving the numerical stability of MinMaxScaler.
For completeness and consistency, could you also apply similar logic to MaxAbsScaler and RobustScaler? They also perform divisions and could benefit from this epsilon-based handling to avoid instability with near-zero divisors.
- For
MaxAbsScaler(L386), the check could be changed fromif s_abs_max == 0:toif s_abs_max < _EPSILON:. - For
RobustScaler(L542), the check could be changed fromif diff == 0:toif diff < _EPSILON:.
Applying these changes would make all scalers in this module more robust. If you decide to add these, please also add corresponding tests for these scalers.
|
lgtm. ty for the contribution |
…alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** ✅ For normal data (variance/range > 1e-8), behavior is **identical** to before ✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8) ✅ All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈ 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈ 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <[email protected]> Signed-off-by: jinbum-kim <[email protected]>
…alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** ✅ For normal data (variance/range > 1e-8), behavior is **identical** to before ✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8) ✅ All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈ 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈ 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <[email protected]>
…alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** ✅ For normal data (variance/range > 1e-8), behavior is **identical** to before ✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8) ✅ All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈ 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈ 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <[email protected]> Signed-off-by: 400Ping <[email protected]>
…alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** ✅ For normal data (variance/range > 1e-8), behavior is **identical** to before ✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8) ✅ All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈ 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈ 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <[email protected]> Signed-off-by: Adel Nour <[email protected]>
…alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** ✅ For normal data (variance/range > 1e-8), behavior is **identical** to before ✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8) ✅ All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈ 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈ 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <[email protected]> Signed-off-by: peterxcli <[email protected]>
…alues (ray-project#60488) ## Description This PR improves numerical stability in preprocessor scalers (`StandardScaler` and `MinMaxScaler`) by extending division-by-zero handling to also cover near-zero values. **Current behavior:** The scalers only check for exact zero values (e.g., `std == 0` or `diff == 0`), which can lead to numerical instability when dealing with near-zero values (e.g., `std = 1e-10`). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range. **Changes made:** - Added `_EPSILON = 1e-8` constant to define near-zero threshold (following sklearn's approach) - Updated `StandardScaler._transform_pandas()` and `_scale_column()` to use `< _EPSILON` instead of `== 0` - Updated `MinMaxScaler._transform_pandas()` similarly - Added comprehensive test cases covering near-zero and exact-zero edge cases **Impact:** This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases. ## Related issues Addresses TODO comments in `python/ray/data/preprocessors/scaler.py`: - Line 117: `# TODO: extend this to handle near-zero values.` - Line 271: `# TODO: extend this to handle near-zero values.` ## Additional information ### Implementation Details **Epsilon Value Selection:** The threshold `_EPSILON = 1e-8` was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero. **Modified Methods:** 1. `StandardScaler._transform_pandas()` - Pandas transformation path 2. `StandardScaler._scale_column()` - PyArrow transformation path 3. `MinMaxScaler._transform_pandas()` - Pandas transformation path **Backward Compatibility:** ✅ For normal data (variance/range > 1e-8), behavior is **identical** to before ✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8) ✅ All existing tests pass without modification ### Test Coverage Added three new test cases: 1. `test_standard_scaler_near_zero_std()` - Tests data with std ≈ 4.7e-11 2. `test_min_max_scaler_near_zero_range()` - Tests data with range ≈ 1e-10 3. `test_standard_scaler_exact_zero_std()` - Regression test for exact zero case Signed-off-by: slfan1989 <[email protected]> Signed-off-by: peterxcli <[email protected]>
Description
This PR improves numerical stability in preprocessor scalers (
StandardScalerandMinMaxScaler) by extending division-by-zero handling to also cover near-zero values.Current behavior:
The scalers only check for exact zero values (e.g.,
std == 0ordiff == 0), which can lead to numerical instability when dealing with near-zero values (e.g.,std = 1e-10). This is a common edge case in real-world data preprocessing where columns have extremely small variance or range.Changes made:
_EPSILON = 1e-8constant to define near-zero threshold (following sklearn's approach)StandardScaler._transform_pandas()and_scale_column()to use< _EPSILONinstead of== 0MinMaxScaler._transform_pandas()similarlyImpact:
This change prevents numerical instability (NaN/inf values) when scaling columns with very small but non-zero variance/range, while maintaining backward compatibility for normal use cases.
Related issues
Addresses TODO comments in
python/ray/data/preprocessors/scaler.py:# TODO: extend this to handle near-zero values.# TODO: extend this to handle near-zero values.Additional information
Implementation Details
Epsilon Value Selection:
The threshold
_EPSILON = 1e-8was chosen to align with industry-standard practices (e.g., sklearn, numpy). This value effectively handles floating-point precision issues without incorrectly treating legitimate small variances as zero.Modified Methods:
StandardScaler._transform_pandas()- Pandas transformation pathStandardScaler._scale_column()- PyArrow transformation pathMinMaxScaler._transform_pandas()- Pandas transformation pathBackward Compatibility:
✅ For normal data (variance/range > 1e-8), behavior is identical to before
✅ Only triggers new logic for extreme edge cases (variance/range < 1e-8)
✅ All existing tests pass without modification
Test Coverage
Added three new test cases:
test_standard_scaler_near_zero_std()- Tests data with std ≈ 4.7e-11test_min_max_scaler_near_zero_range()- Tests data with range ≈ 1e-10test_standard_scaler_exact_zero_std()- Regression test for exact zero case