Improving the SIMD codegen for SIMD12 load/store#80083
Improving the SIMD codegen for SIMD12 load/store#80083tannergooding merged 9 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsThis adds support for containment on TYP_SIMD12 loads/stores and improves the codegen to require less temporary registers and use better instructions when available. Improved Load: - lea rax, bword ptr [rcx+30H]
- vmovss xmm0, dword ptr [rax+08H]
- vmovsd xmm1, qword ptr [rax]
- vshufps xmm1, xmm0, 68
+ vmovsd xmm0, qword ptr [rcx+30H]
+ vinsertps xmm0, dword ptr [rcx+38H], 2Improved Store: - vmovsd qword ptr [rdx], xmm1
- vpshufd xmm0, xmm1, 2
- vmovss dword ptr [rdx+08H], xmm0
+ vmovsd qword ptr [rdx], xmm0
+ vextractps dword ptr [rdx+08H], xmm0, 2Combined this saves 9 bytes of codegen and improves the PerScore by 1.5 Total diffs are all relatively similar. Emitting
|
|
CC. @dotnet/jit-contrib, this should be ready for review. Gives some small size savings for x64 (~2k bytes in fullopts and ~0.5k bytes in minopts) and a small TP win on x64 |
|
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
Fixed the jitstress failure. Results in ~3.3k savings on x86/x64 and a -0.01% TP improvement |
TIHan
left a comment
There was a problem hiding this comment.
This looks good to me. There seems to be a failure for ARM though, but it looks like just one test case at the moment.
It's an unrelated/existing GC timeout. I've retriggered it and it should pass on rerun. |
5c4afad to
ca3ada1
Compare
ca3ada1 to
7bee874
Compare
|
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
|
Azure Pipelines successfully started running 3 pipeline(s). |
This adds support for containment on TYP_SIMD12 loads/stores and improves the codegen to require less temporary registers and use better instructions when available.
Improved Load:
Improved Store:
Combined this saves 9 bytes of codegen and improves the PerScore by 1.5
Total diffs are all relatively similar. Emitting
vmovsd + vinsertpsorvmovsd + vextractpsand removing now unnecessaryleain favor of containing them.