Improve choice of IndexOfXx routine for some TryFindNextStartingPosition implementations#89099
Conversation
…ion implementations
Earlier in .NET 8, we updated the Regex compiler and source generator to be able to vectorize a search for any set, not just simple ones. When one of the main routines couldn't be used, we emit a specialized IndexOfAny helper that uses SearchValues to search for any matching ASCII character or a Unicode character, and if it encounters a Unicode character, it falls back to a linear scan. This meant that a bunch of sets that wouldn't previously have taken these paths now do, but some of those sets have more efficient means of searching; for example, for the set `[^aA]` that searches case-insensitive for anything other than an 'A', with these scheme we'll emit a whole routine that uses SearchValues with a fallback, but we could just use IndexOfAnyExcept('A', 'a'). This fixes the compiler / source generator to prefer such helpers instead when available.
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsEarlier in .NET 8, we updated the Regex compiler and source generator to be able to vectorize a search for any set, not just simple ones. When one of the main routines couldn't be used, we emit a specialized IndexOfAny helper that uses SearchValues to search for any matching ASCII character or a Unicode character, and if it encounters a Unicode character, it falls back to a linear scan. This meant that a bunch of sets that wouldn't previously have taken these paths now do, but some of those sets have more efficient means of searching; for example, for the set For example, previously /// <summary>Finds the next index of any character that matches a character in the set [^Aa].</summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static int IndexOfNonAsciiOrAny_43339B0AA38B69F44701E535D8179738784663016A5CB17A6B1AEB2FB5F9D08F(this ReadOnlySpan<char> span)
{
int i = span.IndexOfAnyExcept(Utilities.s_ascii_200000002000000);
if ((uint)i < (uint)span.Length)
{
if (char.IsAscii(span[i]))
{
return i;
}
do
{
if (((span[i] | 0x20) != 'a'))
{
return i;
}
i++;
}
while ((uint)i < (uint)span.Length);
}
return -1;
}
/// <summary>Supports searching for characters in or not in "Aa".</summary>
internal static readonly SearchValues<char> s_ascii_200000002000000 = SearchValues.Create("Aa");which is then used like: int i = inputSpan.Slice(pos).IndexOfNonAsciiOrAny_43339B0AA38B69F44701E535D8179738784663016A5CB17A6B1AEB2FB5F9D08F();Now, that method isn't emitted, and the usage ends up just being: int i = inputSpan.Slice(pos).IndexOfAnyExcept('A', 'a');
|
MihaZupan
left a comment
There was a problem hiding this comment.
Nice!
I appreciate the extra comments :)
Earlier in .NET 8, we updated the Regex compiler and source generator to be able to vectorize a search for any set, not just simple ones. When one of the main routines couldn't be used, we emit a specialized IndexOfAny helper that uses SearchValues to search for any matching ASCII character or a Unicode character, and if it encounters a Unicode character, it falls back to a linear scan. This meant that a bunch of sets that wouldn't previously have taken these paths now do, but some of those sets have more efficient means of searching; for example, for the set
[^aA]that searches case-insensitive for anything other than an 'A', with these scheme we'll emit a whole routine that uses SearchValues with a fallback, but we could just use IndexOfAnyExcept('A', 'a'). This fixes the compiler / source generator to prefer such helpers instead when available.For example, previously
[GeneratedRegex(@"[^Aa]")]would result in this being emitted:which is then used like:
Now, that method isn't emitted, and the usage ends up just being:
Fixes #84150