[Mono] [Arm64] Added SIMD support for vector 2/3/4 methods#98761
[Mono] [Arm64] Added SIMD support for vector 2/3/4 methods#98761jkurdek merged 16 commits intodotnet:mainfrom
Conversation
|
/azp run runtime-extra-platforms |
|
Azure Pipelines successfully started running 1 pipeline(s). |
matouskozak
left a comment
There was a problem hiding this comment.
Looking good, thanks for implementing these. I left a few comments with questions.
Before merging, please take a look if fullAOT llvm compiles correctly these new intrinsic.
src/mono/mono/mini/simd-intrinsics.c
Outdated
| case SN_Lerp: { | ||
| #if defined (TARGET_ARM64) | ||
| MonoInst* v1 = args [1]; | ||
| if (!strcmp ("Quaternion", m_class_get_name (klass))) { |
There was a problem hiding this comment.
Quaternion.Lerp is not marked as intrinsic in the libraries:
However, if this implementation generates better codegen we should probably keep it.
There was a problem hiding this comment.
The intrinsified version is around 40% faster on my machine
There was a problem hiding this comment.
Why is the intrinsified version faster here? Is it fundamentally doing something differently from the managed implementation or is there potentially a missing JIT optimization?
There was a problem hiding this comment.
Or perhaps there is simply missing a change on the managed side and so its using scalar logic rather than any actual vectorization and a better fix is to update the managed impl?
We've typically tried to keep a clear separation between intrinsic functionality and more complex methods.
APIs like operator + or Sqrt are generally mapped to exactly 1 hardware instruction and this is the case for most platforms.
APIs like DotProduct or even Create may be mapped to exactly 1 hardware instruction on some platforms and are fairly "core" to the general throughput considerations of many platforms.
APIs like Quaternion.Lerp or CopyTo are more complex functions which use multiple instructions on all platforms and which may even require branching or masking logic. So, we've typically tried to keep them in managed and have them use the intrinsic APIs instead.
There was a problem hiding this comment.
I agree that we should align Mono's behavior with CoreCLR, not intrinsifying Quaternion.Lerp or CopyTo for mono either.
There was a problem hiding this comment.
Why is the intrinsified version faster here? Is it fundamentally doing something differently from the managed implementation or is there potentially a missing JIT optimization?
In general, Mono's mini JIT doesn't have as comprehensive optimizations as CoreCLR's RyuJIT.
src/mono/mono/mini/simd-intrinsics.c
Outdated
| MonoInst *sum = emit_simd_ins (cfg, klass, OP_ARM64_XADDV, arg->dreg, -1); | ||
| sum->inst_c0 = INTRINS_AARCH64_ADV_SIMD_FADDV; | ||
| sum->inst_c1 = MONO_TYPE_R4; |
There was a problem hiding this comment.
Worth noting that Arm has typically pushed us away from using FADDV as it does not perform well on some hardware.
Rather instead they had us use a sequence of FADDP (AddPairwise) instructions which tend to have better perf/throughput: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L25190-L25210
There was a problem hiding this comment.
Thanks for sharing the information, @tannergooding. @jkurdek Feel free to create an issue to address it in a future PR.
f68c44b to
1ebb378
Compare
Implements #91394.