Skip to content

Use mul+add+permute sequence for DotProduct when AVX is available#125666

Open
alexcovington wants to merge 4 commits intodotnet:mainfrom
alexcovington:avx-dotproduct
Open

Use mul+add+permute sequence for DotProduct when AVX is available#125666
alexcovington wants to merge 4 commits intodotnet:mainfrom
alexcovington:avx-dotproduct

Conversation

@alexcovington
Copy link
Contributor

On x86 when AVX is available, it is generally more performant to calculate dot products using a multiply+permute+addition sequence instead of vdpps/vdppd.

This PR modifies lowering to use the multiply+permute+addition sequence if AVX is available.

| Namespace                       | Type                     | Method       | Job        | Toolchain                   | Mean     | Error     | StdDev    | Median   | Min      | Max      | Ratio | RatioSD | Allocated | Alloc Ratio |
|-------------------------------- |------------------------- |------------- |----------- |---------------------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|--------:|----------:|------------:|
| System.Numerics.Tests           | Perf_Plane               | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.708 ns | 0.0591 ns | 0.0680 ns | 1.673 ns | 1.645 ns | 1.840 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Plane               | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0098 ns | 0.0076 ns | 1.296 ns | 1.284 ns | 1.308 ns |  0.76 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Quaternion          | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.655 ns | 0.0324 ns | 0.0333 ns | 1.638 ns | 1.628 ns | 1.740 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Quaternion          | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.302 ns | 0.0288 ns | 0.0308 ns | 1.287 ns | 1.278 ns | 1.373 ns |  0.79 |    0.02 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector2             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.664 ns | 0.0231 ns | 0.0205 ns | 1.667 ns | 1.632 ns | 1.709 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector2             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0167 ns | 0.0130 ns | 1.294 ns | 1.276 ns | 1.313 ns |  0.78 |    0.01 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Float      | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.706 ns | 0.0923 ns | 0.1063 ns | 1.648 ns | 1.624 ns | 1.961 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Float      | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.314 ns | 0.0369 ns | 0.0425 ns | 1.302 ns | 1.273 ns | 1.420 ns |  0.77 |    0.05 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Double> | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.476 ns | 0.0282 ns | 0.0313 ns | 1.474 ns | 1.443 ns | 1.534 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Double> | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.131 ns | 0.0327 ns | 0.0377 ns | 1.116 ns | 1.098 ns | 1.219 ns |  0.77 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Single> | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.652 ns | 0.0278 ns | 0.0260 ns | 1.651 ns | 1.620 ns | 1.710 ns |  1.00 |    0.00 |         - |          NA |
| System.Runtime.Intrinsics.Tests | Perf_Vector128Of<Single> | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.301 ns | 0.0238 ns | 0.0199 ns | 1.301 ns | 1.274 ns | 1.347 ns |  0.79 |    0.02 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Double>    | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.474 ns | 0.0163 ns | 0.0127 ns | 1.468 ns | 1.462 ns | 1.501 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Double>    | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.291 ns | 0.0109 ns | 0.0085 ns | 1.289 ns | 1.282 ns | 1.311 ns |  0.88 |    0.01 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_VectorOf<Single>    | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.887 ns | 0.0756 ns | 0.0841 ns | 1.853 ns | 1.811 ns | 2.095 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_VectorOf<Single>    | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.295 ns | 0.0105 ns | 0.0082 ns | 1.293 ns | 1.286 ns | 1.311 ns |  0.69 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector3             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.704 ns | 0.0420 ns | 0.0467 ns | 1.702 ns | 1.641 ns | 1.806 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector3             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.331 ns | 0.0423 ns | 0.0488 ns | 1.317 ns | 1.283 ns | 1.412 ns |  0.78 |    0.03 |         - |          NA |
|                                 |                          |              |            |                             |          |           |           |          |          |          |       |         |           |             |
| System.Numerics.Tests           | Perf_Vector4             | DotBenchmark | Job-IJGDOK | \base\Core_Root\corerun.exe | 1.675 ns | 0.0402 ns | 0.0430 ns | 1.666 ns | 1.633 ns | 1.781 ns |  1.00 |    0.00 |         - |          NA |
| System.Numerics.Tests           | Perf_Vector4             | DotBenchmark | Job-XUUJWJ | \diff\Core_Root\corerun.exe | 1.293 ns | 0.0133 ns | 0.0111 ns | 1.289 ns | 1.280 ns | 1.315 ns |  0.77 |    0.02 |         - |          NA |
Disasm

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vdpps     xmm0,xmm0,xmm1,0FF
       ret
; Total bytes of code 15

Diff

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vmulps    xmm0,xmm1,xmm0
       vpermilps xmm1,xmm0,0B1
       vaddps    xmm0,xmm1,xmm0
       vpermilps xmm1,xmm0,4E
       vaddps    xmm0,xmm1,xmm0
       ret
; Total bytes of code 33

System.Runtime.Intrinsics.Tests.Perf_Vector128Of

Base

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vdppd     xmm0,xmm0,xmm1,33
       ret
; Total bytes of code 15

Diff

; System.Runtime.Intrinsics.Tests.Perf_Vector128Of`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vpcmpeqd  xmm0,xmm0,xmm0
       vpcmpeqd  xmm1,xmm1,xmm1
       vmulpd    xmm0,xmm1,xmm0
       vpermilpd xmm1,xmm0,1
       vaddpd    xmm0,xmm1,xmm0
       ret
; Total bytes of code 23

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

; System.Numerics.Tests.Perf_VectorOf`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vbroadcastss ymm0,dword ptr [0C5A4]
       vdpps     ymm0,ymm0,[0C5C0],0FF
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddps    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 33

Diff

; System.Numerics.Tests.Perf_VectorOf`1[[System.Single, System.Private.CoreLib]].DotBenchmark()
       vbroadcastss ymm0,dword ptr [0C8E8]
       vmulps    ymm0,ymm0,dword bcst [0C8EC]
       vpermilps ymm1,ymm0,0B1
       vaddps    ymm0,ymm1,ymm0
       vpermilps ymm1,ymm0,4E
       vaddps    ymm0,ymm0,ymm1
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddps    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 53

System.Runtime.Intrinsics.Tests.Perf_VectorOf

Base

; System.Numerics.Tests.Perf_VectorOf`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vbroadcastsd ymm0,qword ptr [0E6D8]
       vmulpd    ymm0,ymm0,qword bcst [0E6E0]
       vhaddpd   ymm0,ymm0,ymm0
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddpd    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 37

Diff

; System.Numerics.Tests.Perf_VectorOf`1[[System.Double, System.Private.CoreLib]].DotBenchmark()
       vbroadcastsd ymm0,qword ptr [0E880]
       vmulpd    ymm0,ymm0,qword bcst [0E888]
       vpermilpd ymm1,ymm0,5
       vaddpd    ymm0,ymm1,ymm0
       vperm2f128 ymm1,ymm0,ymm0,1
       vaddpd    ymm0,ymm1,ymm0
       vzeroupper
       ret
; Total bytes of code 43

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 17, 2026
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

tmp1 = m_compiler->gtNewSimdHWIntrinsicNode(simdType, op1, op2, idx, NI_AVX_DotProduct, simdBaseType,
simdSize);
BlockRange().InsertAfter(idx, tmp1);
tmp1 = m_compiler->gtNewSimdBinOpNode(GT_MUL, simdType, op1, op2, simdBaseType, simdSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead lower this to SUM(MUL(op1, op2)) and then relower SUM, just to avoid the duplicative work of the log2(count) shuffle+add permutations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NI_X86Base_DotProduct has a SIMD output type, and the proposed sequence uses SIMD types through the entire calculation until the end where it is only converted to a scalar if necessary. This makes the proposed sequence a "drop-in" replacement for cases where the NI_X86Base_DotProduct index uses all elements in a Vector128/256.

SUM(MUL(op1, op2)) I think would require using gtNewSimdSumNode, which always outputs a scalar type from what I can tell. I don't know if gtNewSimdSumNode would generate a sequence that guarantees every element in the intermediate SIMD calculation is the same (this is how NI_X86Base_DotProduct works), so there would either need to always be an extra broadcast or some other transformation to match the SIMD output of NI_X86Base_DotProduct. Duplicating the work of the shuffle+add permutations avoids having to generate an extra broadcast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Memory community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants