Skip to content

port BitArray.CopyTo to xplat hardware intrinsics#72788

Closed
adamsitnik wants to merge 10 commits intodotnet:mainfrom
adamsitnik:bitArrayCopyTo
Closed

port BitArray.CopyTo to xplat hardware intrinsics#72788
adamsitnik wants to merge 10 commits intodotnet:mainfrom
adamsitnik:bitArrayCopyTo

Conversation

@adamsitnik
Copy link
Member

On x64 I have hit a performance regression:

BenchmarkDotNet=v0.13.1.1820-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
Method Job Toolchain Size Mean Ratio
BitArrayCopyToBoolArray Job-RBRLTR \PR\corerun.exe 512 3,048.2 ns 26.84
BitArrayCopyToBoolArray Job-DRMRCR \baseline\corerun.exe 512 113.1 ns 1.00

@tannergooding I am not sure if I read the disassembly properly:

image

but it seems that Vector256.Shuffle is not producing optimal codegen? Is that expected, if so which issue tracks it?

@ghost
Copy link

ghost commented Jul 25, 2022

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

On x64 I have hit a performance regression:

BenchmarkDotNet=v0.13.1.1820-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
Method Job Toolchain Size Mean Ratio
BitArrayCopyToBoolArray Job-RBRLTR \PR\corerun.exe 512 3,048.2 ns 26.84
BitArrayCopyToBoolArray Job-DRMRCR \baseline\corerun.exe 512 113.1 ns 1.00

@tannergooding I am not sure if I read the disassembly properly:

image

but it seems that Vector256.Shuffle is not producing optimal codegen? Is that expected, if so which issue tracks it?

Author: adamsitnik
Assignees: -
Labels:

area-System.Collections

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented Jul 25, 2022

but it seems that Vector256.Shuffle is not producing optimal codegen? Is that expected, if so which issue tracks it?

It's a known issue, currently it only works fine when the shuffle mask parameter is a constant vector which is passed directly, not by a local (I assume in your code you can do it)

@adamsitnik
Copy link
Member Author

It's a known issue, currently it only works fine when the shuffle mask parameter is a constant vector which is passed directly, not by a local (I assume in your code you can do it)

@EgorBo do you mean something like this?

- Vector256.Shuffle(scalar.AsByte(), shuffleMask);
+ Vector256.Shuffle(scalar.AsByte(), Vector256.Create((byte)0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3));

It does not solve the problem

@EgorBo
Copy link
Member

EgorBo commented Jul 25, 2022

Vector256.Shuffle(scalar.AsByte(), Vector256.Create((byte)0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3));

Yes. If it does not solve the problem then it's worth filing an issue 🙂 should work, my bet that it's either not implemented as intrinsic (only fallback) for V256 or JIT hoists Vector256.Create from loop leading to bad codegen

@adamsitnik
Copy link
Member Author

it's worth filing an issue

Done: #72793

if (Avx2.IsSupported)
if (Vector256.IsHardwareAccelerated)
{
Vector256<byte> shuffleMask = Vector256.Create(lowerShuffleMask_CopyToBoolArray, upperShuffleMask_CopyToBoolArray);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to rely on the JIT doing "several" optimizations to be efficient and "may" pessimize certain scenarios due to current JIT limitations.

We should really just have Vector256<byte> shuffleMask = Vector256.Create(0x0000_0000_0000_0000, 0x0101_0101_0101_0101, 0x02020202_02020202, 0x03030303_03030303).AsByte(); and then below likewise just initialize lowerShuffleMask and upperShuffleMask directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually noting that this should be Vector256.Create(0x0000_0000_0000_0000, 0x0101_0101_0101_0101, 0x12121212_12121212, 0x13131313_13131313).AsByte();.

As explained in #72793 (comment), Vector256.Shuffle treats it as 1x256-bit vector not as 2x128-bit vector like Avx2 does and so you need to offset the "upper" elements by Vector128<T>.Count to ensure the "correct" indices are selected.

This is "logically the same" since all uint of the input vector are identical, but the JIT isn't advanced enough to know or track that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector256.Shuffle treats it as 1x256-bit vector not as 2x128-bit vector like Avx2 does and so you need to offset the "upper" elements by Vector128.Count to ensure the "correct" indices are selected.

@tannergooding Thanks a lot! That was very helpful!

}
int bits = m_array[i / (uint)BitsPerInt32];
Vector256<int> scalar = Vector256.Create(bits);
Vector256<byte> shuffled = Vector256.Shuffle(scalar.AsByte(), shuffleMask);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to make this Vector256.Shuffle(scalar.AsByte(), Vector256.Create(cns, ..., cns).AsByte()) that way its interpreted directly as a GT_CNS_VEC and recognized as something it can treat intrinsically.

We ran into timing issues getting the full support for Shuffle in and so we can only recognize simple "constant" inputs. The recognition needs to be moved down to morph for more complex cases like the mask being indirect here.

If that's done, I would expect this correctly generates vpshufb dst, src, [addr_of_cns] (where [addr_of_cns] is likely hoisted).

Comment on lines +817 to +820
Vector128<byte> lowerShuffleMask = Vector128.Create(0, 0x01010101_01010101).AsByte();
Vector128<byte> upperShuffleMask = Vector128.Create(0x02020202_02020202, 0x03030303_03030303).AsByte();

Vector256<byte> shuffleMask = Vector256.Create(lowerShuffleMask, upperShuffleMask);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it not generating the correct code with:

Vector256<byte> shuffled = Avx2.Shuffle(scalar.AsByte(), Vector256.Create().AsByte(0x0000_0000_0000_0000, 0x0101_0101_0101_0101, 0x12121212_12121212, 0x13131313_13131313));

?

ref byte destination = ref Unsafe.As<bool, byte>(ref Unsafe.Add<bool>(ref MemoryMarshal.GetArrayDataReference<bool>(boolArray), index));

for (; (i + Vector256ByteCount) <= (uint)m_length; i += Vector256ByteCount)
fixed (bool* destination = &boolArray[index])
Copy link
Member

@tannergooding tannergooding Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use Vector256 APIs for everything except shuffle if only shuffle has an issue (but local tests show it shouldn't).

@adamsitnik
Copy link
Member Author

adamsitnik commented Jul 27, 2022

@tannergooding I've run the benchmarks for x64 for 3 configs: AVX2 enabled (Vector256 code path), AVX2 disabled (Vector128 code path) and HI disabled (no vectors):

Vector256 code path

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22376.4
  [Host]     : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT AVX2
  Job-KKTMRQ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-WXOJSJ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Toolchain Size Mean Ratio
BitArrayCopyToBoolArray \pr\corerun.exe 4 17.65 ns 1.11
BitArrayCopyToBoolArray \baseline\corerun.exe 4 16.01 ns 1.00
BitArrayCopyToBoolArray \pr\corerun.exe 512 176.51 ns 1.17
BitArrayCopyToBoolArray \baseline\corerun.exe 512 150.67 ns 1.00

There is 17% regression. I've profiled it with VTune:

image

It seems that instead of one shuffle operation, two are being performed.

@tannergooding edit: I assume that this is expected because of the way Vector256.Shuffle and Avx2.Shuffle treat the vector.

Vector128 code path

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22376.4
  [Host]     : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT AVX2
  Job-FGKDJJ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-LCEERO : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX

EnvironmentVariables=COMPlus_EnableAVX2=0
Method Toolchain Size Mean Ratio
BitArrayCopyToBoolArray \pr\corerun.exe 4 13.54 ns 1.01
BitArrayCopyToBoolArray \baseline\corerun.exe 4 13.39 ns 1.00
BitArrayCopyToBoolArray \pr\corerun.exe 512 177.09 ns 0.93
BitArrayCopyToBoolArray \baseline\corerun.exe 512 190.79 ns 1.00

There is no regression after switching from Ssse3 to Vector128! There is even an improvement, but it's most likely caused by the micro optimizations I have applied based on code review suggestions.

No vectors

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22376.4
  [Host]     : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT AVX2
  Job-ZOEWRR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-JCVZAO : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

EnvironmentVariables=COMPlus_EnableHWIntrinsic=0
Method Toolchain Size Mean Ratio
BitArrayCopyToBoolArray \pr\corerun.exe 4 66.64 ns 1.00
BitArrayCopyToBoolArray \baseline\corerun.exe 4 66.32 ns 1.00
BitArrayCopyToBoolArray \pr\corerun.exe 512 5,493.61 ns 0.93
BitArrayCopyToBoolArray \baseline\corerun.exe 512 5,896.87 ns 1.00

We can see an improvement that comes from accessing the array and length as local variables instead of fields.

@adamsitnik
Copy link
Member Author

ARM64 results:

Vector128 code path

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22376.4
  [Host]     : .NET 7.0.0 (7.0.22.36704), Arm64 RyuJIT AdvSIMD
  Job-ENSRSO : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-MEXSMW : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Size Mean Ratio
BitArrayCopyToBoolArray /PR/corerun 4 42.40 ns 1.04
BitArrayCopyToBoolArray /baseline/corerun 4 40.81 ns 1.00
BitArrayCopyToBoolArray /PR/corerun 512 613.74 ns 1.09
BitArrayCopyToBoolArray /baseline/corerun 512 562.77 ns 1.00

There is a 9% regression but honestly, it's still better than I expected.

I'll try to get the disassembly with a profile, but it might take me a while (I need to update my Surface Pro X to Win 11 to be able to install VS 2022 ARM64, build dotnet/runtime and use the new internal ARM64 profiler).

No vectors

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22376.4
  [Host]     : .NET 7.0.0 (7.0.22.36704), Arm64 RyuJIT AdvSIMD
  Job-VPBVBQ : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT
  Job-ZRUZHS : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT

EnvironmentVariables=COMPlus_EnableHWIntrinsic=0
Method Toolchain Size Mean Ratio
BitArrayCopyToBoolArray /PR/corerun 4 625.5 ns 0.93
BitArrayCopyToBoolArray /baseline/corerun 4 672.5 ns 1.00
BitArrayCopyToBoolArray /PR/corerun 512 53,927.4 ns 0.90
BitArrayCopyToBoolArray /baseline/corerun 512 60,200.5 ns 1.00

@adamsitnik
Copy link
Member Author

I am closing as I was not able to avoid regressions. It's acceptable.

@adamsitnik adamsitnik closed this Aug 19, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Sep 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants