Use mul+add+permute sequence for DotProduct when AVX is available by alexcovington · Pull Request #125666 · dotnet/runtime

alexcovington · 2026-03-17T17:35:17Z

On x86 when AVX is available, it is generally more performant to calculate dot products using a multiply+permute+addition sequence instead of vdpps/vdppd.

This PR modifies lowering to use the multiply+permute+addition sequence if AVX is available.

| Namespace                       | Type                     | Method       | Job        | Toolchain                   | Mean     | Error     | StdDev    | Median   | Min      | Max      | Ratio | RatioSD | Allocated | Alloc Ratio |
|-------------------------------- |------------------------- |------------- |----------- |---------------------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|--------:|----------:|------------:|
| System.Numerics.Tests           | Perf_Plane               | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.708 ns | 0.0591 ns | 0.0680 ns | 1.673 ns | 1.645 ns | 1.840 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Plane               | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0098 ns | 0.0076 ns | 1.296 ns | 1.284 ns | 1.308 ns |  0.76 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Quaternion          | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.655 ns | 0.0324 ns | 0.0333 ns | 1.638 ns | 1.628 ns | 1.740 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Quaternion          | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.302 ns | 0.0288 ns | 0.0308 ns | 1.287 ns | 1.278 ns | 1.373 ns |  0.79 |    0.02 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector2             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.664 ns | 0.0231 ns | 0.0205 ns | 1.667 ns | 1.632 ns | 1.709 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector2             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0167 ns | 0.0130 ns | 1.294 ns | 1.276 ns | 1.313 ns |  0.78 |    0.01 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Float      | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.706 ns | 0.0923 ns | 0.1063 ns | 1.648 ns | 1.624 ns | 1.961 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Float      | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.314 ns | 0.0369 ns | 0.0425 ns | 1.302 ns | 1.273 ns | 1.420 ns |  0.77 |    0.05 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Double> | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.476 ns | 0.0282 ns | 0.0313 ns | 1.474 ns | 1.443 ns | 1.534 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Double> | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.131 ns | 0.0327 ns | 0.0377 ns | 1.116 ns | 1.098 ns | 1.219 ns |  0.77 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Single> | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.652 ns | 0.0278 ns | 0.0260 ns | 1.651 ns | 1.620 ns | 1.710 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Single> | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.301 ns | 0.0238 ns | 0.0199 ns | 1.301 ns | 1.274 ns | 1.347 ns |  0.79 |    0.02 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Double>    | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.474 ns | 0.0163 ns | 0.0127 ns | 1.468 ns | 1.462 ns | 1.501 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Double>    | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.291 ns | 0.0109 ns | 0.0085 ns | 1.289 ns | 1.282 ns | 1.311 ns |  0.88 |    0.01 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Single>    | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.887 ns | 0.0756 ns | 0.0841 ns | 1.853 ns | 1.811 ns | 2.095 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Single>    | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0105 ns | 0.0082 ns | 1.293 ns | 1.286 ns | 1.311 ns |  0.69 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector3             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.704 ns | 0.0420 ns | 0.0467 ns | 1.702 ns | 1.641 ns | 1.806 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector3             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.331 ns | 0.0423 ns | 0.0488 ns | 1.317 ns | 1.283 ns | 1.412 ns |  0.78 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector4             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.675 ns | 0.0402 ns | 0.0430 ns | 1.666 ns | 1.633 ns | 1.781 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector4             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.293 ns | 0.0133 ns | 0.0111 ns | 1.289 ns | 1.280 ns | 1.315 ns |  0.77 |    0.02 |         - |          NA |

Disasm

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vdpps     xmm0,xmm0,xmm1,0FF
       ret
; Total bytes of code 15

Diff

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vmulps    xmm0,xmm1,xmm0
       vpermilps xmm1,xmm0,0B1
       vaddps    xmm0,xmm1,xmm0
       vpermilps xmm1,xmm0,4E
       vaddps    xmm0,xmm1,xmm0
       ret
; Total bytes of code 33

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vdppd     xmm0,xmm0,xmm1,33
       ret
; Total bytes of code 15

Diff

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vmulpd    xmm0,xmm1,xmm0
       vpermilpd xmm1,xmm0,1
       vaddpd    xmm0,xmm1,xmm0
       ret
; Total bytes of code 23

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

; System.Numerics.Tests.Perf_VectorOf`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vbroadcastss ymm0,dword ptr [0C5A4]
       vdpps     ymm0,ymm0,[0C5C0],0FF
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddps    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 33

Diff

; System.Numerics.Tests.Perf_VectorOf`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vbroadcastss ymm0,dword ptr [0C8E8]
       vmulps    ymm0,ymm0,dword bcst [0C8EC]
       vpermilps ymm1,ymm0,0B1
       vaddps    ymm0,ymm1,ymm0
       vpermilps ymm1,ymm0,4E
       vaddps    ymm0,ymm0,ymm1
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddps    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 53

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

; System.Numerics.Tests.Perf_VectorOf`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vbroadcastsd ymm0,qword ptr [0E6D8]
       vmulpd    ymm0,ymm0,qword bcst [0E6E0]
       vhaddpd   ymm0,ymm0,ymm0
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddpd    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 37

Diff

; System.Numerics.Tests.Perf_VectorOf`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vbroadcastsd ymm0,qword ptr [0E880]
       vmulpd    ymm0,ymm0,qword bcst [0E888]
       vpermilpd ymm1,ymm0,5
       vaddpd    ymm0,ymm1,ymm0
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddpd    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 43

dotnet-policy-service · 2026-03-17T17:36:27Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

tannergooding · 2026-03-17T19:24:16Z

src/coreclr/jit/lowerxarch.cpp

-                tmp1 = m_compiler->gtNewSimdHWIntrinsicNode(simdType, op1, op2, idx, NI_AVX_DotProduct, simdBaseType,
-                                                            simdSize);
-                BlockRange().InsertAfter(idx, tmp1);
+                tmp1 = m_compiler->gtNewSimdBinOpNode(GT_MUL, simdType, op1, op2, simdBaseType, simdSize);


Can we instead lower this to SUM(MUL(op1, op2)) and then relower SUM, just to avoid the duplicative work of the log2(count) shuffle+add permutations?

NI_X86Base_DotProduct has a SIMD output type, and the proposed sequence uses SIMD types through the entire calculation until the end where it is only converted to a scalar if necessary. This makes the proposed sequence a "drop-in" replacement for cases where the NI_X86Base_DotProduct index uses all elements in a Vector128/256.

SUM(MUL(op1, op2)) I think would require using gtNewSimdSumNode, which always outputs a scalar type from what I can tell. I don't know if gtNewSimdSumNode would generate a sequence that guarantees every element in the intermediate SIMD calculation is the same (this is how NI_X86Base_DotProduct works), so there would either need to always be an extra broadcast or some other transformation to match the SIMD output of NI_X86Base_DotProduct. Duplicating the work of the shuffle+add permutations avoids having to generate an extra broadcast.

Use mul+add+permute sequence for DotProduct when AVX is available

2a32831

github-actions bot added the area-System.Memory label Mar 17, 2026

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 17, 2026

Correct TYP_DOUBLE case for Vector128

e6e5d54

tannergooding reviewed Mar 17, 2026

View reviewed changes

Fix typo in index

682188f

This was referenced Mar 18, 2026

[android] Android.Device_Emulator.JIT.Test failing on emulators with CoreCLR #112633

Open

[Android][CoreCLR] System.Security.Cryptography.Tests killed by lowmemorykiller #118603

Open

MsQuic fails with QUIC_STATUS_OUT_OF_MEMORY on AzureLinux #123216

Open

Mark correct node as unused

853508a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use mul+add+permute sequence for DotProduct when AVX is available#125666

Use mul+add+permute sequence for DotProduct when AVX is available#125666
alexcovington wants to merge 4 commits intodotnet:mainfrom
alexcovington:avx-dotproduct

alexcovington commented Mar 17, 2026

Uh oh!

dotnet-policy-service bot commented Mar 17, 2026

Uh oh!

tannergooding Mar 17, 2026

Uh oh!

alexcovington Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexcovington commented Mar 17, 2026

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

Diff

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

Diff

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

Diff

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

Diff

Uh oh!

dotnet-policy-service bot commented Mar 17, 2026

Uh oh!

tannergooding Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

alexcovington Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants