Use mul+add+permute sequence for DotProduct when AVX is available#125666
Use mul+add+permute sequence for DotProduct when AVX is available#125666alexcovington wants to merge 4 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-memory |
| tmp1 = m_compiler->gtNewSimdHWIntrinsicNode(simdType, op1, op2, idx, NI_AVX_DotProduct, simdBaseType, | ||
| simdSize); | ||
| BlockRange().InsertAfter(idx, tmp1); | ||
| tmp1 = m_compiler->gtNewSimdBinOpNode(GT_MUL, simdType, op1, op2, simdBaseType, simdSize); |
There was a problem hiding this comment.
Can we instead lower this to SUM(MUL(op1, op2)) and then relower SUM, just to avoid the duplicative work of the log2(count) shuffle+add permutations?
There was a problem hiding this comment.
NI_X86Base_DotProduct has a SIMD output type, and the proposed sequence uses SIMD types through the entire calculation until the end where it is only converted to a scalar if necessary. This makes the proposed sequence a "drop-in" replacement for cases where the NI_X86Base_DotProduct index uses all elements in a Vector128/256.
SUM(MUL(op1, op2)) I think would require using gtNewSimdSumNode, which always outputs a scalar type from what I can tell. I don't know if gtNewSimdSumNode would generate a sequence that guarantees every element in the intermediate SIMD calculation is the same (this is how NI_X86Base_DotProduct works), so there would either need to always be an extra broadcast or some other transformation to match the SIMD output of NI_X86Base_DotProduct. Duplicating the work of the shuffle+add permutations avoids having to generate an extra broadcast.
On x86 when AVX is available, it is generally more performant to calculate dot products using a multiply+permute+addition sequence instead of
vdpps/vdppd.This PR modifies lowering to use the multiply+permute+addition sequence if AVX is available.
Disasm
System.Runtime.Intrinsics.Tests.Perf_Vector128Of
Base
Diff
System.Runtime.Intrinsics.Tests.Perf_Vector128Of
Base
Diff
System.Runtime.Intrinsics.Tests.Perf_VectorOf
Base
Diff
System.Runtime.Intrinsics.Tests.Perf_VectorOf
Base
Diff