[SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit by DenineLu · Pull Request #48470 · apache/spark

DenineLu · 2024-10-15T05:13:03Z

What changes were proposed in this pull request?

After SPARK-40194, the current behavior of the split function is as follows:

select split('hello', 'h', 1) // result is ["hello"]
select split('hello', '-', 1) // result is ["hello"]
select split('hello', '', 1)  // result is ["h"]

select split('1A2A3A4', 'A', 3) // result is ["1","2","3A4"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2"]

However, according to the function's description, when the limit is greater than zero, the last element of the split result should contain the remaining part of the input string.

Arguments:
      * str - a string expression to split.
      * regex - a string representing a regular expression. The regex string should be a Java regular expression.
      * limit - an integer expression which controls the number of times the regex is applied.
          * limit > 0: The resulting array's length will not be more than `limit`, and the resulting array's last entry will contain all input beyond the last matched regex.
          * limit <= 0: `regex` will be applied as many times as possible, and the resulting array can be of any size.

So, the split function produces incorrect results with an empty regex and a limit. The correct result should be:

select split('hello', '', 1)    // result is ["hello"]

select split('1A2A3A4', '', 3)  // result is ["1","A","2A3A4"]

Why are the changes needed?

Fix correctness issue.

Does this PR introduce any user-facing change?

Yes.
When the empty regex parameter is provided along with a limit parameter greater than 0, the output of the split function changes.
Before this patch

select split('hello', '', 1)          // result is ["h"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2"]

After this patch

select split('hello', '', 1)          // result is ["hello"]
select split('1A2A3A4', '', 3)  // result is ["1","A","2A3A4"]

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

TongWei1105 · 2024-10-15T06:13:49Z

cc @wangyum @cloud-fan

wangyum · 2024-10-15T07:50:00Z

cc @vitaliili-db

uros-db

thanks for making this change - however, please add collation-related tests as well, see:

test("StringSplit expression with collated strings")

in CollationRegexpExpressionsSuite.scala

DenineLu · 2024-10-17T08:08:32Z

thanks for making this change - however, please add collation-related tests as well, see:
test("StringSplit expression with collated strings")
in CollationRegexpExpressionsSuite.scala

Thank you for your guidance. The relevant tests have been added.

...c/test/scala/org/apache/spark/sql/catalyst/expressions/CollationRegexpExpressionsSuite.scala

sql/core/src/test/resources/sql-tests/inputs/string-functions.sql

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

uros-db · 2024-10-17T11:44:28Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+      for (int i = 0; i < newLimit - 1; i++) {
        int currCharNumBytes = numBytesForFirstByte(input[byteIndex]);
-        result[charIndex++] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);
+        result[i] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);


Suggested change

result[i] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);

result[charIndex] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);

uros-db · 2024-10-17T11:46:18Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+        result[i] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);
        byteIndex += currCharNumBytes;
      }
+      result[newLimit - 1] = UTF8String.fromBytes(input, byteIndex, numBytes() - byteIndex);


is ArrayIndexOutOfBoundsException possible here?
what if newLimit=0 (i.e. numChars()=0, limit=-1)

is ArrayIndexOutOfBoundsException possible here? what if newLimit=0 (i.e. numChars()=0, limit=-1)

no, this code block will only be entered when the following conditions are met.

if (numBytes() != 0 && pattern.numBytes() == 0)

uros-db · 2024-10-17T11:52:28Z

sql/core/src/test/resources/sql-tests/inputs/string-functions.sql

 SELECT split('hello', '');
+SELECT split('hello', '', 1);
+SELECT split('hello', '', 3);
 SELECT split('', '');


I would also prefer to see:

SELECT split('', '', -1); SELECT split('', '', 0); SELECT split('', '', 1);

here, for more complete testing

Thanks, already added.

uros-db · 2024-10-18T09:20:40Z

...c/test/scala/org/apache/spark/sql/catalyst/expressions/CollationRegexpExpressionsSuite.scala

Suggested change

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), -1),

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), 0),

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1A2B3C"), 1),

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2B3C"), 3),

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3C"), 5),

StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), 100),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), -1),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), 0),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1A2B3C"), 1),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2B3C"), 3),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3C"), 5),

StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), 100),

sorry, I meant to request using collation (other than UTF8_BINARY) here

In the current situation, if UTF8_LCASE is applied to an empty string, the condition here will not be met because the value of pattern after being collated by collationAwareRegex is(?ui), meaning that #37631 does not support truncating the trailing empty string when the pattern is (?ui).

public UTF8String[] split(UTF8String pattern, int limit) { // For the empty `pattern` a `split` function ignores trailing empty strings unless original // string is empty. if (numBytes() != 0 && pattern.numBytes() == 0) {

Therefore, it seems that the result is not what we want when limit <= 0.

select split('1A2B3C', '(?ui)', -1); // result is ["1", "A", "2", "B", "3", "C", ""] select split('1A2B3C', '(?ui)', 0); // result is ["1", "A", "2", "B", "3", "C", ""] select split('1A2B3C', '(?ui)', 1); // result is ["1A2B3C"] select split('1A2B3C', '(?ui)', 3); // result is ["1", "A", "2B3C"] select split('1A2B3C', '(?ui)', 6); // result is ["1", "A", "2", "B", "3", "C"] select split('1A2B3C', '(?ui)', 100); // result is ["1", "A", "2", "B", "3", "C"]

When the pattern is "(?ui)", a simple and direct approach can be taken to correct the result.

public UTF8String[] split(UTF8String pattern, int limit) { // For the empty `pattern` a `split` function ignores trailing empty strings unless original // string is empty. if (numBytes() != 0 && (pattern.numBytes() == 0 || lowercaseRegexPrefix.equals(pattern))) {

However, when the pattern is "(?ui)(?ui)" or "(?ui)(?ui)(?ui)", the result still contains a trailing empty string, and I haven't thought of an efficient way to match and resolve it. Should we consider this a reasonable situation?
Additionally, do you think it is necessary to check in the CollationSupport.lowercaseRegex method whether the regex already has a (?ui) prefix?

uros-db

left just one more comment, otherwise lgtm (mostly focusing on collation behaviour)

adding @vitaliili-db @cloud-fan to carefully review, extending on #37631

uros-db · 2024-10-21T09:17:42Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

how about: instead of checking whether the pattern equals to the (?ui) prefix, we modify the collation implementation (prefixing logic) to avoid appending the prefix at all in the case that pattern is an empty string

I agree with what you're saying, but should we consider that the user's pattern itself might be (?ui) and is unrelated to prefixing logic?

that is an interesting observation, although in that case I don't see why the user's pattern can't be any other flag modifier combination, such as: (?m), (?s), (?x), (?a)

taking this into consideration, there is really nothing special about lowercaseRegexPrefix. instead, you should look for a library method that can discern whether a pattern is "functionally" empty, instead of doing a manual check against lowercaseRegexPrefix

Thank you for your explanation. It looks like there’s no way to validate this "weird" situation without losing performance. I made changes according to your advice. Thanks again.

uros-db

passing on to @vitaliili-db @cloud-fan @MaxGekk for further review

see apache/spark#48470

github-actions bot added the SQL label Oct 15, 2024

DenineLu changed the title ~~[][] Fix the split function with limit being cut off incorrectly~~ [SPARK-49968][SQL] Fix the split function with limit being cut off incorrectly Oct 15, 2024

DenineLu force-pushed the fix_split_on_empty_regex branch 5 times, most recently from d611ac7 to 458fa12 Compare October 15, 2024 06:04

DenineLu changed the title ~~[SPARK-49968][SQL] Fix the split function with limit being cut off incorrectly~~ [SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit Oct 15, 2024

DenineLu force-pushed the fix_split_on_empty_regex branch from 458fa12 to fc9b461 Compare October 15, 2024 08:46

uros-db suggested changes Oct 16, 2024

View reviewed changes

DenineLu requested a review from uros-db October 17, 2024 11:34

uros-db reviewed Oct 17, 2024

View reviewed changes

...c/test/scala/org/apache/spark/sql/catalyst/expressions/CollationRegexpExpressionsSuite.scala Show resolved Hide resolved

uros-db reviewed Oct 17, 2024

View reviewed changes

sql/core/src/test/resources/sql-tests/inputs/string-functions.sql Show resolved Hide resolved

uros-db reviewed Oct 17, 2024

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java Outdated Show resolved Hide resolved

uros-db reviewed Oct 17, 2024

View reviewed changes

DenineLu force-pushed the fix_split_on_empty_regex branch 2 times, most recently from 9abf1c4 to 4ebf280 Compare October 18, 2024 03:02

DenineLu requested a review from uros-db October 18, 2024 06:32

uros-db reviewed Oct 18, 2024

View reviewed changes

uros-db approved these changes Oct 18, 2024

View reviewed changes

DenineLu force-pushed the fix_split_on_empty_regex branch from 3d295d8 to 5eb4889 Compare October 21, 2024 09:06

uros-db reviewed Oct 21, 2024

View reviewed changes

DenineLu force-pushed the fix_split_on_empty_regex branch from 5eb4889 to 7b240c3 Compare October 21, 2024 11:04

DenineLu requested a review from uros-db October 22, 2024 01:53

uros-db approved these changes Oct 22, 2024

View reviewed changes

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 25, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

d9f93ed

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 26, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

2d7d72b

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 26, 2025

[4.1.0] Exclude SPLIT test in VeloxStringFunctionsSuite

b7a6737

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 26, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

6495bdb

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 29, 2025

[4.1.0] Exclude SPLIT test in VeloxStringFunctionsSuite

5819819

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 29, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

b92e1e6

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 30, 2025

[4.1.0] Exclude SPLIT test in VeloxStringFunctionsSuite

a451fff

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 30, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

3339d3b

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in VeloxStringFunctionsSuite

03fa9a4

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

9d6a39b

see apache/spark#48470

baibaichen added a commit to apache/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in VeloxStringFunctionsSuite

e49505f

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in VeloxStringFunctionsSuite

9005d29

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

79d75e2

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in VeloxStringFunctionsSuite

9c83f84

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

1cf65ec

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

a23210d

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

6802976

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

8c71909

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

e69a307

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

8c71d4f

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

ce4e7ef

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

fd50819

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

e87f1aa

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude SPLIT test in VeloxStringFunctionsSuite

c994fa5

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude split test in VeloxStringFunctionsSuite

031b12e

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude SPLIT test in GlutenRegexpExpressionsSuite

d4fa99b

see apache/spark#48470

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[4.1.0] Exclude split test in GlutenRegexpExpressionsSuite

586c8f1

see apache/spark#48470

This was referenced Jan 5, 2026

[GLUTEN-11346][CORE][VL] Add Spark 4.1 Shim Layer apache/gluten#11347

Merged

[GLUTEN-11343][CORE][VL] Support Spark 4.1 UT apache/gluten#11353

Merged

baibaichen mentioned this pull request Jan 13, 2026

[VL] Track on Spark-4.1.x failed unit tests apache/gluten#11400

Open

	result[i] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);
	result[charIndex] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes);

-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), -1),
-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), 0),
-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1A2B3C"), 1),
-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2B3C"), 3),
-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3C"), 5),
-      StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), 100),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), -1),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), 0),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1A2B3C"), 1),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2B3C"), 3),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3C"), 5),
+      StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), 100),

Conversation

DenineLu commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

TongWei1105 commented Oct 15, 2024

Uh oh!

wangyum commented Oct 15, 2024

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

DenineLu commented Oct 17, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

DenineLu commented Oct 15, 2024 •

edited

Loading