port BitArray.CopyTo to xplat hardware intrinsics#72788
port BitArray.CopyTo to xplat hardware intrinsics#72788adamsitnik wants to merge 10 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-collections Issue DetailsOn x64 I have hit a performance regression: BenchmarkDotNet=v0.13.1.1820-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
@tannergooding I am not sure if I read the disassembly properly: but it seems that
|
src/libraries/System.Collections/src/System/Collections/BitArray.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Collections/src/System/Collections/BitArray.cs
Outdated
Show resolved
Hide resolved
It's a known issue, currently it only works fine when the shuffle mask parameter is a constant vector which is passed directly, not by a local (I assume in your code you can do it) |
@EgorBo do you mean something like this? - Vector256.Shuffle(scalar.AsByte(), shuffleMask);
+ Vector256.Shuffle(scalar.AsByte(), Vector256.Create((byte)0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3));It does not solve the problem |
Yes. If it does not solve the problem then it's worth filing an issue 🙂 should work, my bet that it's either not implemented as intrinsic (only fallback) for V256 or JIT hoists |
Done: #72793 |
| if (Avx2.IsSupported) | ||
| if (Vector256.IsHardwareAccelerated) | ||
| { | ||
| Vector256<byte> shuffleMask = Vector256.Create(lowerShuffleMask_CopyToBoolArray, upperShuffleMask_CopyToBoolArray); |
There was a problem hiding this comment.
This is going to rely on the JIT doing "several" optimizations to be efficient and "may" pessimize certain scenarios due to current JIT limitations.
We should really just have Vector256<byte> shuffleMask = Vector256.Create(0x0000_0000_0000_0000, 0x0101_0101_0101_0101, 0x02020202_02020202, 0x03030303_03030303).AsByte(); and then below likewise just initialize lowerShuffleMask and upperShuffleMask directly.
There was a problem hiding this comment.
Actually noting that this should be Vector256.Create(0x0000_0000_0000_0000, 0x0101_0101_0101_0101, 0x12121212_12121212, 0x13131313_13131313).AsByte();.
As explained in #72793 (comment), Vector256.Shuffle treats it as 1x256-bit vector not as 2x128-bit vector like Avx2 does and so you need to offset the "upper" elements by Vector128<T>.Count to ensure the "correct" indices are selected.
This is "logically the same" since all uint of the input vector are identical, but the JIT isn't advanced enough to know or track that.
There was a problem hiding this comment.
Vector256.Shuffle treats it as 1x256-bit vector not as 2x128-bit vector like Avx2 does and so you need to offset the "upper" elements by Vector128.Count to ensure the "correct" indices are selected.
@tannergooding Thanks a lot! That was very helpful!
src/libraries/System.Collections/src/System/Collections/BitArray.cs
Outdated
Show resolved
Hide resolved
| } | ||
| int bits = m_array[i / (uint)BitsPerInt32]; | ||
| Vector256<int> scalar = Vector256.Create(bits); | ||
| Vector256<byte> shuffled = Vector256.Shuffle(scalar.AsByte(), shuffleMask); |
There was a problem hiding this comment.
You'll want to make this Vector256.Shuffle(scalar.AsByte(), Vector256.Create(cns, ..., cns).AsByte()) that way its interpreted directly as a GT_CNS_VEC and recognized as something it can treat intrinsically.
We ran into timing issues getting the full support for Shuffle in and so we can only recognize simple "constant" inputs. The recognition needs to be moved down to morph for more complex cases like the mask being indirect here.
If that's done, I would expect this correctly generates vpshufb dst, src, [addr_of_cns] (where [addr_of_cns] is likely hoisted).
…codegen as of now
| Vector128<byte> lowerShuffleMask = Vector128.Create(0, 0x01010101_01010101).AsByte(); | ||
| Vector128<byte> upperShuffleMask = Vector128.Create(0x02020202_02020202, 0x03030303_03030303).AsByte(); | ||
|
|
||
| Vector256<byte> shuffleMask = Vector256.Create(lowerShuffleMask, upperShuffleMask); |
There was a problem hiding this comment.
Was it not generating the correct code with:
Vector256<byte> shuffled = Avx2.Shuffle(scalar.AsByte(), Vector256.Create().AsByte(0x0000_0000_0000_0000, 0x0101_0101_0101_0101, 0x12121212_12121212, 0x13131313_13131313));?
| ref byte destination = ref Unsafe.As<bool, byte>(ref Unsafe.Add<bool>(ref MemoryMarshal.GetArrayDataReference<bool>(boolArray), index)); | ||
|
|
||
| for (; (i + Vector256ByteCount) <= (uint)m_length; i += Vector256ByteCount) | ||
| fixed (bool* destination = &boolArray[index]) |
There was a problem hiding this comment.
It would be better to use Vector256 APIs for everything except shuffle if only shuffle has an issue (but local tests show it shouldn't).
…without big perf penalty
|
@tannergooding I've run the benchmarks for x64 for 3 configs: AVX2 enabled (Vector256 code path), AVX2 disabled (Vector128 code path) and HI disabled (no vectors): Vector256 code pathBenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22376.4
[Host] : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT AVX2
Job-KKTMRQ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-WXOJSJ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
There is 17% regression. I've profiled it with VTune: It seems that instead of one shuffle operation, two are being performed. @tannergooding edit: I assume that this is expected because of the way Vector128 code pathBenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22376.4
[Host] : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT AVX2
Job-FGKDJJ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
Job-LCEERO : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
EnvironmentVariables=COMPlus_EnableAVX2=0
There is no regression after switching from Ssse3 to Vector128! There is even an improvement, but it's most likely caused by the micro optimizations I have applied based on code review suggestions. No vectorsBenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22376.4
[Host] : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT AVX2
Job-ZOEWRR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-JCVZAO : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
EnvironmentVariables=COMPlus_EnableHWIntrinsic=0
We can see an improvement that comes from accessing the array and length as local variables instead of fields. |
|
ARM64 results: Vector128 code pathBenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22376.4
[Host] : .NET 7.0.0 (7.0.22.36704), Arm64 RyuJIT AdvSIMD
Job-ENSRSO : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-MEXSMW : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
There is a 9% regression but honestly, it's still better than I expected. I'll try to get the disassembly with a profile, but it might take me a while (I need to update my Surface Pro X to Win 11 to be able to install VS 2022 ARM64, build dotnet/runtime and use the new internal ARM64 profiler). No vectorsBenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22376.4
[Host] : .NET 7.0.0 (7.0.22.36704), Arm64 RyuJIT AdvSIMD
Job-VPBVBQ : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT
Job-ZRUZHS : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT
EnvironmentVariables=COMPlus_EnableHWIntrinsic=0
|
|
I am closing as I was not able to avoid regressions. It's acceptable. |


On x64 I have hit a performance regression:
@tannergooding I am not sure if I read the disassembly properly:
but it seems that
Vector256.Shuffleis not producing optimal codegen? Is that expected, if so which issue tracks it?