[LLVMCPU][AArch64] New heuristic for matmul vector tile sizes#22932
[LLVMCPU][AArch64] New heuristic for matmul vector tile sizes#22932hanhanW merged 6 commits intoiree-org:mainfrom
Conversation
|
Thank you for working on this, @momchil-velikov ! Approving as-is - we discussed this offline and the heuristics make sense to me. The experiments you ran (various matmul kernels + GPT-2) show clear benefit. IIRC, the evaluation was done on Neoverse V2, right? The implementation itself doesn’t appear microarchitecture-specific, but it would be good to mention the evaluation hardware explicitly in the summary. I appreciate that there might be some other (more generic?) approaches to this, but I don't have any specific suggestions myself. Given the observed performance improvements on the tested workloads, I am in favor. It would be great to extend the PR summary with a bit more detail:
Thanks again! @hanhanW , what are your thoughts re the implementation? I've not really touched this code for a while, so don't have a strong opinion. |
hanhanW
left a comment
There was a problem hiding this comment.
IIRC, the evaluation was done on Neoverse V2, right? The implementation itself doesn’t appear microarchitecture-specific, but it would be good to mention the evaluation hardware explicitly in the summary.
+1 for making summary better.
Thanks again! @hanhanW , what are your thoughts re the implementation? I've not really touched this code for a while, so don't have a strong opinion.
I don't touch them for a while, either. IMO, some of the tile sizes were driven by old hardware or workload, so some of them are legacy to me. I don't have strong opinion for changing these code as long as it does not significant impact some benchmarks, although we dont have any benchmarks atm.
| scalableSizeFlags.append({false, | ||
| isScalableVectorizationEnabled() && | ||
| hasAnySVEFeature(targetAttr.getConfiguration()), | ||
| false}); |
There was a problem hiding this comment.
If you are passing linalgOp (not MatmulOp), it is better to infer M/N/K dims. The other approach can be bailing out in the entry if the op is not a matmul op.
I prefer returning the result instead of passing by reference, though. In any case, can you split it two main statements for readability? E.g.,
- Initialize with three false.
- Set N dim to true if it meets the condition.
vec.resize(3, false);
if (...) {
vec[1] = true;
}
There was a problem hiding this comment.
infer M/N/K dims.
I'm afraid I don't quite understand what does this mean.
Looking from the point of view of the function I'm replacing, there's no change in semantics, per se, just choosing different values with the convention that in the returned vectors the element 0 is M, the element 1 is N, and element 2 is K (M, N, and K themselves having conventional meaning). That looks common across most (all?) other functions in the file that suggest tile sizes.
The other approach can be bailing out in the entry if the op is not a matmul op.
We know we have a contraction (ContractionOpInterface) with a single reduction dimension, is the choice going to be bad for all such contraction ops that are not a matmul?
Maybe for some, but then I'd rather enhance the heuristic if we stumble upon such cases.
There was a problem hiding this comment.
I'm afraid I don't quite understand what does this mean.
There was a problem hiding this comment.
Looking from the point of view of the function I'm replacing, there's no change in semantics, per se, just choosing different values with the convention that in the returned vectors the element 0 is M, the element 1 is N, and element 2 is K (M, N, and K themselves having conventional meaning). That looks common across most (all?) other functions in the file that suggest tile sizes.
Thanks @banach-space for the pointer! I was not clear about checking MatmulOp because I forgot that a MatmulOp op now takes indexing maps: llvm/llvm-project@d152808
Some codes are pretty old and they are not evolved because we're understaffing for some backends. IMO, we should bail out if we have such assumption. Otherwise, people may run into issues easily. E.g., you'll generate the same lowering config if you replace a matmul with the same matmul that has different indexing maps.
// The shapes are MxK, NxK, MxN.
%5 = linalg.matmul
indexing_maps = [
affine_map<(d0, d1, d2) -> (d0, d2)>,
affine_map<(d0, d1, d2) -> (d1, d2)>,
affine_map<(d0, d1, d2) -> (d0, d1)>
]
ins(%1, %2 : tensor<512x128xi8>, tensor<512x128xi8>) outs(%fill : tensor<512x512xi32>) -> tensor<512x512xi32>
We may have a utility function to check if a linalg::LinalgOp is a common matmul form and use it here; it addresses my concern.
|
I'll take a look at the code today or tomorrow, but I quickly gave this a try on my M5: (latencies without the PR vs with the PR) Dynamic Matmul
Static Matmul
(used the flags |
4abfd86 to
9511ec1
Compare
I have also observed that no single set of tile parameters are best for all matrix sizes. Also regressions are only(?) in SVE code, Neon universally improves.
|
hanhanW
left a comment
There was a problem hiding this comment.
I have a final nit. Please fix DCO issue, thanks.
| if (failed(cDims) || cDims->m.size() != 1 || cDims->n.size() != 1 || | ||
| cDims->k.size() != 1) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
You also need to check that mDim == 0, nDim == 1, kDim == 2.
Compute vector tile sizes using a heuristic that aims to keep the entire ACC/OUT tile in registers, leave a few registers for LHS/RHS columns or rows, and all that while not exceeding the number of available registers. The rationale is that a matrix multiplication typically lowers to a loop nest in which the ACC/OUT tile remains live across all iterations of the innermost loop, whereas the LHS and RHS operands live for a single iteration and do not require the entire tiles to be simultaneously resident in registers. The base element type used is the element type of the output vector under the assumption the operand types with smaller bitwidths will be promoted to the output type and thus require more registers for the same number of elements. We have observed improvements in performance for on AArch64 targets (Neoverse-V1 and Neoverse-V2 cores) for Neon and SVE configurations without data tiling and peeling vector preprocessing strategy, for example: * Neon, `batch.matmul`, f32: ~46% improvement (less time) * Neon, GPT2, f32: ~31% improvement Signed-off-by: Momchil Velikov <[email protected]>
Signed-off-by: Momchil Velikov <[email protected]>
Signed-off-by: Momchil Velikov <[email protected]>
c19317a to
346381c
Compare
hanhanW
left a comment
There was a problem hiding this comment.
LGTM, just a final nit about assertion. I don't know why I did not point it out in the first place, sorry about that. Thanks for your patch!
| // Find the output element type of the matmul. | ||
| assert(op->getResultTypes().size() == 1 && | ||
| "Expected single output type for matmul op"); |
There was a problem hiding this comment.
inferContractionDims guarantees that it has single result, because it is op interface definition. The implementation also reflects the requirement. So we can remove this assertion.
Signed-off-by: Momchil Velikov <[email protected]>
…2/3, respectively. Signed-off-by: Momchil Velikov <[email protected]>
|
Hi @momchil-velikov I'm not sure if you have permission to merge the PR. I'm happy to help land it if you can clear the lint issue. Thanks! |
Signed-off-by: Momchil Velikov <[email protected]>
7d40be5 to
abffb99
Compare
…rg#22932) Compute vector tile sizes using a heuristic that aims to keep the entire ACC/OUT tile in registers, leave a few registers for LHS/RHS columns or rows, and all that while not exceeding the number of available registers. The rationale is that a matrix multiplication typically lowers to a loop nest in which the ACC/OUT tile remains live across all iterations of the innermost loop, whereas the LHS and RHS operands live for a single iteration and do not require the entire tiles to be simultaneously resident in registers. The base element type used is the element type of the output vector under the assumption the operand types with smaller bitwidths will be promoted to the output type and thus require more registers for the same number of elements. We have observed improvements in performance for on AArch64 targets (Neoverse-V1 and Neoverse-V2 cores) for Neon and SVE configurations without data tiling and peeling vector preprocessing strategy, for example: * Neon, `batch.matmul`, f32: ~46% improvement (less time) * Neon, GPT2, f32: ~31% improvement Signed-off-by: Momchil Velikov <[email protected]> --------- Signed-off-by: Momchil Velikov <[email protected]>
Compute vector tile sizes using a heuristic that aims to keep the entire
ACC/OUT tile in registers, leave a few registers for LHS/RHS columns
or rows, and all that while not exceeding the number of available
registers. The rationale is that a matrix multiplication typically
lowers to a loop nest in which the ACC/OUT tile remains live across all
iterations of the innermost loop, whereas the LHS and RHS operands live
for a single iteration and do not require the entire tiles to be
simultaneously resident in registers.
The base element type used is the element type of the output vector
under the assumption the operand types with smaller bitwidths
will be promoted to the output type and thus require more registers
for the same number of elements.
We have observed improvements in performance for on AArch64 targets
(Neoverse-V1 and Neoverse-V2 cores) for Neon and SVE configurations
without data tiling and peeling vector preprocessing strategy,
for example:
batch.matmul, f32: ~46% improvement (less time)Signed-off-by: Momchil Velikov [email protected]