port BitArray.CopyTo to xplat hardware intrinsics by adamsitnik · Pull Request #72788 · dotnet/runtime

adamsitnik · 2022-07-25T13:18:08Z

On x64 I have hit a performance regression:

BenchmarkDotNet=v0.13.1.1820-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT

Method	Job	Toolchain	Size	Mean	Ratio
BitArrayCopyToBoolArray	Job-RBRLTR	\PR\corerun.exe	512	3,048.2 ns	26.84
BitArrayCopyToBoolArray	Job-DRMRCR	\baseline\corerun.exe	512	113.1 ns	1.00

@tannergooding I am not sure if I read the disassembly properly:

but it seems that Vector256.Shuffle is not producing optimal codegen? Is that expected, if so which issue tracks it?

ghost · 2022-07-25T13:18:27Z

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

On x64 I have hit a performance regression:

BenchmarkDotNet=v0.13.1.1820-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT

Method	Job	Toolchain	Size	Mean	Ratio
BitArrayCopyToBoolArray	Job-RBRLTR	\PR\corerun.exe	512	3,048.2 ns	26.84
BitArrayCopyToBoolArray	Job-DRMRCR	\baseline\corerun.exe	512	113.1 ns	1.00

@tannergooding I am not sure if I read the disassembly properly:

but it seems that Vector256.Shuffle is not producing optimal codegen? Is that expected, if so which issue tracks it?

Author:	adamsitnik
Assignees:	-
Labels:	`area-System.Collections`
Milestone:	-

src/libraries/System.Collections/src/System/Collections/BitArray.cs

EgorBo · 2022-07-25T14:00:17Z

but it seems that Vector256.Shuffle is not producing optimal codegen? Is that expected, if so which issue tracks it?

It's a known issue, currently it only works fine when the shuffle mask parameter is a constant vector which is passed directly, not by a local (I assume in your code you can do it)

adamsitnik · 2022-07-25T15:08:48Z

It's a known issue, currently it only works fine when the shuffle mask parameter is a constant vector which is passed directly, not by a local (I assume in your code you can do it)

@EgorBo do you mean something like this?

- Vector256.Shuffle(scalar.AsByte(), shuffleMask);
+ Vector256.Shuffle(scalar.AsByte(), Vector256.Create((byte)0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3));

It does not solve the problem

EgorBo · 2022-07-25T15:12:13Z

Vector256.Shuffle(scalar.AsByte(), Vector256.Create((byte)0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3));

Yes. If it does not solve the problem then it's worth filing an issue 🙂 should work, my bet that it's either not implemented as intrinsic (only fallback) for V256 or JIT hoists Vector256.Create from loop leading to bad codegen

adamsitnik · 2022-07-25T15:59:47Z

it's worth filing an issue

Done: #72793

tannergooding · 2022-07-25T17:17:58Z

src/libraries/System.Collections/src/System/Collections/BitArray.cs

-                if (Avx2.IsSupported)
+                if (Vector256.IsHardwareAccelerated)
                {
                    Vector256<byte> shuffleMask = Vector256.Create(lowerShuffleMask_CopyToBoolArray, upperShuffleMask_CopyToBoolArray);


This is going to rely on the JIT doing "several" optimizations to be efficient and "may" pessimize certain scenarios due to current JIT limitations.

We should really just have Vector256<byte> shuffleMask = Vector256.Create(0x0000_0000_0000_0000, 0x0101_0101_0101_0101, 0x02020202_02020202, 0x03030303_03030303).AsByte(); and then below likewise just initialize lowerShuffleMask and upperShuffleMask directly.

Actually noting that this should be Vector256.Create(0x0000_0000_0000_0000, 0x0101_0101_0101_0101, 0x12121212_12121212, 0x13131313_13131313).AsByte();.

As explained in #72793 (comment), Vector256.Shuffle treats it as 1x256-bit vector not as 2x128-bit vector like Avx2 does and so you need to offset the "upper" elements by Vector128<T>.Count to ensure the "correct" indices are selected.

This is "logically the same" since all uint of the input vector are identical, but the JIT isn't advanced enough to know or track that.

Vector256.Shuffle treats it as 1x256-bit vector not as 2x128-bit vector like Avx2 does and so you need to offset the "upper" elements by Vector128.Count to ensure the "correct" indices are selected.

@tannergooding Thanks a lot! That was very helpful!

src/libraries/System.Collections/src/System/Collections/BitArray.cs

tannergooding · 2022-07-25T17:27:37Z

src/libraries/System.Collections/src/System/Collections/BitArray.cs

-                        }
+                        int bits = m_array[i / (uint)BitsPerInt32];
+                        Vector256<int> scalar = Vector256.Create(bits);
+                        Vector256<byte> shuffled = Vector256.Shuffle(scalar.AsByte(), shuffleMask);


You'll want to make this Vector256.Shuffle(scalar.AsByte(), Vector256.Create(cns, ..., cns).AsByte()) that way its interpreted directly as a GT_CNS_VEC and recognized as something it can treat intrinsically.

We ran into timing issues getting the full support for Shuffle in and so we can only recognize simple "constant" inputs. The recognition needs to be moved down to morph for more complex cases like the mask being indirect here.

If that's done, I would expect this correctly generates vpshufb dst, src, [addr_of_cns] (where [addr_of_cns] is likely hoisted).

…codegen as of now

…rivate const

…han field

tannergooding · 2022-07-26T15:29:05Z

src/libraries/System.Collections/src/System/Collections/BitArray.cs

+                    Vector128<byte> lowerShuffleMask = Vector128.Create(0, 0x01010101_01010101).AsByte();
+                    Vector128<byte> upperShuffleMask = Vector128.Create(0x02020202_02020202, 0x03030303_03030303).AsByte();
+
+                    Vector256<byte> shuffleMask = Vector256.Create(lowerShuffleMask, upperShuffleMask);


Was it not generating the correct code with:

Vector256<byte> shuffled = Avx2.Shuffle(scalar.AsByte(), Vector256.Create().AsByte(0x0000_0000_0000_0000, 0x0101_0101_0101_0101, 0x12121212_12121212, 0x13131313_13131313));

?

tannergooding · 2022-07-26T15:30:25Z

src/libraries/System.Collections/src/System/Collections/BitArray.cs

-                    ref byte destination = ref Unsafe.As<bool, byte>(ref Unsafe.Add<bool>(ref MemoryMarshal.GetArrayDataReference<bool>(boolArray), index));
-
-                    for (; (i + Vector256ByteCount) <= (uint)m_length; i += Vector256ByteCount)
+                    fixed (bool* destination = &boolArray[index])


It would be better to use Vector256 APIs for everything except shuffle if only shuffle has an issue (but local tests show it shouldn't).

…without big perf penalty

adamsitnik · 2022-07-27T08:05:20Z

@tannergooding I've run the benchmarks for x64 for 3 configs: AVX2 enabled (Vector256 code path), AVX2 disabled (Vector128 code path) and HI disabled (no vectors):

Vector256 code path

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22376.4
  [Host]     : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT AVX2
  Job-KKTMRQ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-WXOJSJ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method	Toolchain	Size	Mean	Ratio
BitArrayCopyToBoolArray	\pr\corerun.exe	4	17.65 ns	1.11
BitArrayCopyToBoolArray	\baseline\corerun.exe	4	16.01 ns	1.00

BitArrayCopyToBoolArray	\pr\corerun.exe	512	176.51 ns	1.17
BitArrayCopyToBoolArray	\baseline\corerun.exe	512	150.67 ns	1.00

There is 17% regression. I've profiled it with VTune:

It seems that instead of one shuffle operation, two are being performed.

@tannergooding edit: I assume that this is expected because of the way Vector256.Shuffle and Avx2.Shuffle treat the vector.

Vector128 code path

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22376.4
  [Host]     : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT AVX2
  Job-FGKDJJ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-LCEERO : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX

EnvironmentVariables=COMPlus_EnableAVX2=0

Method	Toolchain	Size	Mean	Ratio
BitArrayCopyToBoolArray	\pr\corerun.exe	4	13.54 ns	1.01
BitArrayCopyToBoolArray	\baseline\corerun.exe	4	13.39 ns	1.00

BitArrayCopyToBoolArray	\pr\corerun.exe	512	177.09 ns	0.93
BitArrayCopyToBoolArray	\baseline\corerun.exe	512	190.79 ns	1.00

There is no regression after switching from Ssse3 to Vector128! There is even an improvement, but it's most likely caused by the micro optimizations I have applied based on code review suggestions.

No vectors

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 10 (10.0.18363.2212/1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22376.4
  [Host]     : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT AVX2
  Job-ZOEWRR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-JCVZAO : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

EnvironmentVariables=COMPlus_EnableHWIntrinsic=0

Method	Toolchain	Size	Mean	Ratio
BitArrayCopyToBoolArray	\pr\corerun.exe	4	66.64 ns	1.00
BitArrayCopyToBoolArray	\baseline\corerun.exe	4	66.32 ns	1.00

BitArrayCopyToBoolArray	\pr\corerun.exe	512	5,493.61 ns	0.93
BitArrayCopyToBoolArray	\baseline\corerun.exe	512	5,896.87 ns	1.00

We can see an improvement that comes from accessing the array and length as local variables instead of fields.

src/libraries/System.Collections/src/System/Collections/BitArray.cs

adamsitnik · 2022-07-27T09:02:12Z

ARM64 results:

Vector128 code path

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22376.4
  [Host]     : .NET 7.0.0 (7.0.22.36704), Arm64 RyuJIT AdvSIMD
  Job-ENSRSO : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-MEXSMW : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Method	Toolchain	Size	Mean	Ratio
BitArrayCopyToBoolArray	/PR/corerun	4	42.40 ns	1.04
BitArrayCopyToBoolArray	/baseline/corerun	4	40.81 ns	1.00

BitArrayCopyToBoolArray	/PR/corerun	512	613.74 ns	1.09
BitArrayCopyToBoolArray	/baseline/corerun	512	562.77 ns	1.00

There is a 9% regression but honestly, it's still better than I expected.

I'll try to get the disassembly with a profile, but it might take me a while (I need to update my Surface Pro X to Win 11 to be able to install VS 2022 ARM64, build dotnet/runtime and use the new internal ARM64 profiler).

No vectors

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22376.4
  [Host]     : .NET 7.0.0 (7.0.22.36704), Arm64 RyuJIT AdvSIMD
  Job-VPBVBQ : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT
  Job-ZRUZHS : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT

EnvironmentVariables=COMPlus_EnableHWIntrinsic=0

Method	Toolchain	Size	Mean	Ratio
BitArrayCopyToBoolArray	/PR/corerun	4	625.5 ns	0.93
BitArrayCopyToBoolArray	/baseline/corerun	4	672.5 ns	1.00

BitArrayCopyToBoolArray	/PR/corerun	512	53,927.4 ns	0.90
BitArrayCopyToBoolArray	/baseline/corerun	512	60,200.5 ns	1.00

adamsitnik · 2022-08-19T14:36:18Z

I am closing as I was not able to avoid regressions. It's acceptable.

adamsitnik added 4 commits July 25, 2022 14:25

port BitArray.CopyTo to cross platform APIs

15f6a0b

remove ARM code path

c64ed21

Merge remote-tracking branch 'upstream/main' into bitArrayCopyTo

e402e8a

remove unnecessary usings

1a345d4

adamsitnik added the area-System.Collections label Jul 25, 2022

adamsitnik requested a review from tannergooding July 25, 2022 13:18

ghost assigned adamsitnik Jul 25, 2022

gfoidl reviewed Jul 25, 2022

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved

src/libraries/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved

adamsitnik mentioned this pull request Jul 25, 2022

extend printed Runtime Info with simple Hardware Intrinsics information dotnet/BenchmarkDotNet#2051

Merged

adamsitnik mentioned this pull request Jul 25, 2022

Vector256.Shuffle does not produce optimal codegen #72793

Closed

tannergooding reviewed Jul 25, 2022

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved

tannergooding reviewed Jul 25, 2022

View reviewed changes

adamsitnik added 5 commits July 26, 2022 14:43

restore Avx2 code path as Vector256.Shuffle does not produce optimal …

918baf1

…codegen as of now

optimize Vector128 codepath to avoid regression

7965d15

address code review feedback: use Vector128<int>.Count instead of a p…

7d4757c

…rivate const

address code review feedback: access array as local variable rather t…

2ea70af

…han field

address code review feedback: optimize loop

5aca325

tannergooding reviewed Jul 26, 2022

View reviewed changes

use magic bitmask provided by Tanner and port Avx2 code to Vector256 …

bf41da6

…without big perf penalty

gfoidl reviewed Jul 27, 2022

View reviewed changes

src/libraries/System.Collections/src/System/Collections/BitArray.cs Show resolved Hide resolved

adamsitnik closed this Aug 19, 2022

ghost locked as resolved and limited conversation to collaborators Sep 18, 2022

Conversation

adamsitnik commented Jul 25, 2022

Uh oh!

ghost commented Jul 25, 2022

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Jul 25, 2022

Uh oh!

adamsitnik commented Jul 25, 2022

Uh oh!

EgorBo commented Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamsitnik commented Jul 25, 2022

Uh oh!

tannergooding Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

tannergooding Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

adamsitnik Jul 27, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tannergooding Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

tannergooding Jul 26, 2022

Choose a reason for hiding this comment

Uh oh!

tannergooding Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamsitnik commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Vector256 code path

Vector128 code path

No vectors

Uh oh!

Uh oh!

adamsitnik commented Jul 27, 2022

Vector128 code path

No vectors

Uh oh!

adamsitnik commented Aug 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

EgorBo commented Jul 25, 2022 •

edited

Loading

tannergooding Jul 26, 2022 •

edited

Loading

adamsitnik commented Jul 27, 2022 •

edited

Loading