[SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit#48470
[SPARK-49968][SQL] The split function produces incorrect results with an empty regex and a limit#48470DenineLu wants to merge 9 commits intoapache:masterfrom
Conversation
d611ac7 to
458fa12
Compare
458fa12 to
fc9b461
Compare
uros-db
left a comment
There was a problem hiding this comment.
thanks for making this change - however, please add collation-related tests as well, see:
test("StringSplit expression with collated strings")
in CollationRegexpExpressionsSuite.scala
Thank you for your guidance. The relevant tests have been added. |
...c/test/scala/org/apache/spark/sql/catalyst/expressions/CollationRegexpExpressionsSuite.scala
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Outdated
Show resolved
Hide resolved
| for (int i = 0; i < newLimit - 1; i++) { | ||
| int currCharNumBytes = numBytesForFirstByte(input[byteIndex]); | ||
| result[charIndex++] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes); | ||
| result[i] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes); |
There was a problem hiding this comment.
| result[i] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes); | |
| result[charIndex] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes); |
| result[i] = UTF8String.fromBytes(input, byteIndex, currCharNumBytes); | ||
| byteIndex += currCharNumBytes; | ||
| } | ||
| result[newLimit - 1] = UTF8String.fromBytes(input, byteIndex, numBytes() - byteIndex); |
There was a problem hiding this comment.
is ArrayIndexOutOfBoundsException possible here?
what if newLimit=0 (i.e. numChars()=0, limit=-1)
There was a problem hiding this comment.
is
ArrayIndexOutOfBoundsExceptionpossible here? what if newLimit=0 (i.e. numChars()=0, limit=-1)
no, this code block will only be entered when the following conditions are met.
if (numBytes() != 0 && pattern.numBytes() == 0)
| SELECT split('hello', ''); | ||
| SELECT split('hello', '', 1); | ||
| SELECT split('hello', '', 3); | ||
| SELECT split('', ''); |
There was a problem hiding this comment.
I would also prefer to see:
SELECT split('', '', -1);
SELECT split('', '', 0);
SELECT split('', '', 1);
here, for more complete testing
There was a problem hiding this comment.
Thanks, already added.
9abf1c4 to
4ebf280
Compare
There was a problem hiding this comment.
| StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), -1), | |
| StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), 0), | |
| StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1A2B3C"), 1), | |
| StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2B3C"), 3), | |
| StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3C"), 5), | |
| StringSplitTestCase("1A2B3C", "", "UTF8_BINARY", Seq("1", "A", "2", "B", "3", "C"), 100), | |
| StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), -1), | |
| StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), 0), | |
| StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1A2B3C"), 1), | |
| StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2B3C"), 3), | |
| StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3C"), 5), | |
| StringSplitTestCase("1A2B3C", "", "UTF8_LCASE", Seq("1", "A", "2", "B", "3", "C"), 100), |
sorry, I meant to request using collation (other than UTF8_BINARY) here
There was a problem hiding this comment.
In the current situation, if UTF8_LCASE is applied to an empty string, the condition here will not be met because the value of pattern after being collated by collationAwareRegex is(?ui), meaning that #37631 does not support truncating the trailing empty string when the pattern is (?ui).
public UTF8String[] split(UTF8String pattern, int limit) {
// For the empty `pattern` a `split` function ignores trailing empty strings unless original
// string is empty.
if (numBytes() != 0 && pattern.numBytes() == 0) {
Therefore, it seems that the result is not what we want when limit <= 0.
select split('1A2B3C', '(?ui)', -1); // result is ["1", "A", "2", "B", "3", "C", ""]
select split('1A2B3C', '(?ui)', 0); // result is ["1", "A", "2", "B", "3", "C", ""]
select split('1A2B3C', '(?ui)', 1); // result is ["1A2B3C"]
select split('1A2B3C', '(?ui)', 3); // result is ["1", "A", "2B3C"]
select split('1A2B3C', '(?ui)', 6); // result is ["1", "A", "2", "B", "3", "C"]
select split('1A2B3C', '(?ui)', 100); // result is ["1", "A", "2", "B", "3", "C"]
When the pattern is "(?ui)", a simple and direct approach can be taken to correct the result.
public UTF8String[] split(UTF8String pattern, int limit) {
// For the empty `pattern` a `split` function ignores trailing empty strings unless original
// string is empty.
if (numBytes() != 0 && (pattern.numBytes() == 0 || lowercaseRegexPrefix.equals(pattern))) {
However, when the pattern is "(?ui)(?ui)" or "(?ui)(?ui)(?ui)", the result still contains a trailing empty string, and I haven't thought of an efficient way to match and resolve it. Should we consider this a reasonable situation?
Additionally, do you think it is necessary to check in the CollationSupport.lowercaseRegex method whether the regex already has a (?ui) prefix?
uros-db
left a comment
There was a problem hiding this comment.
left just one more comment, otherwise lgtm (mostly focusing on collation behaviour)
adding @vitaliili-db @cloud-fan to carefully review, extending on #37631
3d295d8 to
5eb4889
Compare
There was a problem hiding this comment.
how about: instead of checking whether the pattern equals to the (?ui) prefix, we modify the collation implementation (prefixing logic) to avoid appending the prefix at all in the case that pattern is an empty string
There was a problem hiding this comment.
I agree with what you're saying, but should we consider that the user's pattern itself might be (?ui) and is unrelated to prefixing logic?
There was a problem hiding this comment.
that is an interesting observation, although in that case I don't see why the user's pattern can't be any other flag modifier combination, such as: (?m), (?s), (?x), (?a)
taking this into consideration, there is really nothing special about lowercaseRegexPrefix. instead, you should look for a library method that can discern whether a pattern is "functionally" empty, instead of doing a manual check against lowercaseRegexPrefix
There was a problem hiding this comment.
Thank you for your explanation. It looks like there’s no way to validate this "weird" situation without losing performance. I made changes according to your advice. Thanks again.
5eb4889 to
7b240c3
Compare
uros-db
left a comment
There was a problem hiding this comment.
passing on to @vitaliili-db @cloud-fan @MaxGekk for further review
What changes were proposed in this pull request?
After SPARK-40194, the current behavior of the split function is as follows:
However, according to the function's description, when the limit is greater than zero, the last element of the split result should contain the remaining part of the input string.
So, the split function produces incorrect results with an empty regex and a limit. The correct result should be:
Why are the changes needed?
Fix correctness issue.
Does this PR introduce any user-facing change?
Yes.
When the empty regex parameter is provided along with a limit parameter greater than 0, the output of the split function changes.
Before this patch
After this patch
How was this patch tested?
Unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.