[LLVMCPU][AArch64] New heuristic for matmul vector tile sizes by momchil-velikov · Pull Request #22932 · iree-org/iree

momchil-velikov · 2025-12-17T11:20:17Z

Compute vector tile sizes using a heuristic that aims to keep the entire
ACC/OUT tile in registers, leave a few registers for LHS/RHS columns
or rows, and all that while not exceeding the number of available
registers. The rationale is that a matrix multiplication typically
lowers to a loop nest in which the ACC/OUT tile remains live across all
iterations of the innermost loop, whereas the LHS and RHS operands live
for a single iteration and do not require the entire tiles to be
simultaneously resident in registers.

The base element type used is the element type of the output vector
under the assumption the operand types with smaller bitwidths
will be promoted to the output type and thus require more registers
for the same number of elements.

We have observed improvements in performance for on AArch64 targets
(Neoverse-V1 and Neoverse-V2 cores) for Neon and SVE configurations
without data tiling and peeling vector preprocessing strategy,
for example:

Neon, batch.matmul, f32: ~46% improvement (less time)
Neon, GPT2, f32: ~31% improvement

Signed-off-by: Momchil Velikov [email protected]

banach-space · 2025-12-18T09:53:07Z

Thank you for working on this, @momchil-velikov !

Approving as-is - we discussed this offline and the heuristics make sense to me. The experiments you ran (various matmul kernels + GPT-2) show clear benefit.

IIRC, the evaluation was done on Neoverse V2, right? The implementation itself doesn’t appear microarchitecture-specific, but it would be good to mention the evaluation hardware explicitly in the summary.

I appreciate that there might be some other (more generic?) approaches to this, but I don't have any specific suggestions myself. Given the observed performance improvements on the tested workloads, I am in favor.

It would be great to extend the PR summary with a bit more detail:

Observed performance improvements,
Which configurations are affected (e.g. DT, peeling),
Hardware used for evaluation.

Thanks again! @hanhanW , what are your thoughts re the implementation? I've not really touched this code for a while, so don't have a strong opinion.

hanhanW

IIRC, the evaluation was done on Neoverse V2, right? The implementation itself doesn’t appear microarchitecture-specific, but it would be good to mention the evaluation hardware explicitly in the summary.

+1 for making summary better.

Thanks again! @hanhanW , what are your thoughts re the implementation? I've not really touched this code for a while, so don't have a strong opinion.

I don't touch them for a while, either. IMO, some of the tile sizes were driven by old hardware or workload, so some of them are legacy to me. I don't have strong opinion for changing these code as long as it does not significant impact some benchmarks, although we dont have any benchmarks atm.

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

hanhanW · 2025-12-19T02:40:06Z

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

+  scalableSizeFlags.append({false,
+                            isScalableVectorizationEnabled() &&
+                                hasAnySVEFeature(targetAttr.getConfiguration()),
+                            false});


If you are passing linalgOp (not MatmulOp), it is better to infer M/N/K dims. The other approach can be bailing out in the entry if the op is not a matmul op.

I prefer returning the result instead of passing by reference, though. In any case, can you split it two main statements for readability? E.g.,

Initialize with three false.

Set N dim to true if it meets the condition.

vec.resize(3, false); if (...) { vec[1] = true; }

infer M/N/K dims.

I'm afraid I don't quite understand what does this mean.

Looking from the point of view of the function I'm replacing, there's no change in semantics, per se, just choosing different values with the convention that in the returned vectors the element 0 is M, the element 1 is N, and element 2 is K (M, N, and K themselves having conventional meaning). That looks common across most (all?) other functions in the file that suggest tile sizes.

The other approach can be bailing out in the entry if the op is not a matmul op.

We know we have a contraction (ContractionOpInterface) with a single reduction dimension, is the choice going to be bad for all such contraction ops that are not a matmul?
Maybe for some, but then I'd rather enhance the heuristic if we stumble upon such cases.

I'm afraid I don't quite understand what does this mean.

See: https://github.com/llvm/llvm-project/blob/fa511cde48ea218dadfa8b35658ac06368f34607/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h?plain=1#L66

Looking from the point of view of the function I'm replacing, there's no change in semantics, per se, just choosing different values with the convention that in the returned vectors the element 0 is M, the element 1 is N, and element 2 is K (M, N, and K themselves having conventional meaning). That looks common across most (all?) other functions in the file that suggest tile sizes.

Thanks @banach-space for the pointer! I was not clear about checking MatmulOp because I forgot that a MatmulOp op now takes indexing maps: llvm/llvm-project@d152808

Some codes are pretty old and they are not evolved because we're understaffing for some backends. IMO, we should bail out if we have such assumption. Otherwise, people may run into issues easily. E.g., you'll generate the same lowering config if you replace a matmul with the same matmul that has different indexing maps.

// The shapes are MxK, NxK, MxN. %5 = linalg.matmul indexing_maps = [ affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)> ] ins(%1, %2 : tensor<512x128xi8>, tensor<512x128xi8>) outs(%fill : tensor<512x512xi32>) -> tensor<512x512xi32>

We may have a utility function to check if a linalg::LinalgOp is a common matmul form and use it here; it addresses my concern.

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

egebeysel · 2026-01-05T12:54:48Z

I'll take a look at the code today or tomorrow, but I quickly gave this a try on my M5:

(latencies without the PR vs with the PR)

Dynamic Matmul

1024x1024x1024: 9.06 vs 7.63 ms
(wide k) 1024x1024x4096: 52.0 vs 64.6 ms
(wide n) 1024x4096x1024: 39.8 vs 45.9 ms
(wide m) 4096x1024x1024: 25.6 vs 21.8 ms

Static Matmul

1024x1024x1024: 11.6 vs 7.45 ms
(wide k) 1024x1024x4096: 55.5 vs 68.1 ms
(wide n) 1024x4096x1024: 42.1 vs 40.4 ms
(wide m) 4096x1024x1024: 33.9 vs 22.8 ms

(used the flags --iree-llvmcpu-enable-ukernels=none --iree-opt-data-tiling=false --iree-dispatch-creation-data-tiling=false --iree-llvmcpu-target-cpu=host --iree-hal-target-backends=llvm-cpu to compile)

momchil-velikov · 2026-01-06T11:34:40Z

I'll take a look at the code today or tomorrow, but I quickly gave this a try on my M5:

(latencies without the PR vs with the PR)

Dynamic Matmul

1024x1024x1024: 9.06 vs 7.63 ms

(wide k) 1024x1024x4096: 52.0 vs 64.6 ms

(wide n) 1024x4096x1024: 39.8 vs 45.9 ms

(wide m) 4096x1024x1024: 25.6 vs 21.8 ms

Static Matmul

1024x1024x1024: 11.6 vs 7.45 ms

(wide k) 1024x1024x4096: 55.5 vs 68.1 ms

(wide n) 1024x4096x1024: 42.1 vs 40.4 ms

(wide m) 4096x1024x1024: 33.9 vs 22.8 ms

(used the flags --iree-llvmcpu-enable-ukernels=none --iree-opt-data-tiling=false --iree-dispatch-creation-data-tiling=false --iree-llvmcpu-target-cpu=host --iree-hal-target-backends=llvm-cpu to compile)

I have also observed that no single set of tile parameters are best for all matrix sizes. Also regressions are only(?) in SVE code, Neon universally improves.
One issue is increase in cache misses, mostly when loading the RHS. The accesses to RHS are with a wide stride (for both the original and the patched compilers), however the original RHS tile rows are twice bigger (4 registers) than the modified ones (2 registers), which is one possible explanation.
I have also experimented with tile parameters that increase N while still staying within register budget, but then the SVE register allocation is quite suboptimal, with unnecessary spills.
There are other deficiencies with SVE codegen (for some of which we have initial downstream patches):

it does not use indexed FMLA (that potentially saves an explicit broadcast)
there are not fully realized opportunities to use the ld1rqw for loading the LHS
the addressing calculations for loading the strided rows (where you need to add the same constant a few times to a base address) are preformed quite badly (sometimes, may even result in spilling GPRs in the inner loop).

hanhanW

I have a final nit. Please fix DCO issue, thanks.

hanhanW · 2026-01-17T00:33:25Z

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

+  if (failed(cDims) || cDims->m.size() != 1 || cDims->n.size() != 1 ||
+      cDims->k.size() != 1) {
+    return;
  }


You also need to check that mDim == 0, nDim == 1, kDim == 2.

Compute vector tile sizes using a heuristic that aims to keep the entire ACC/OUT tile in registers, leave a few registers for LHS/RHS columns or rows, and all that while not exceeding the number of available registers. The rationale is that a matrix multiplication typically lowers to a loop nest in which the ACC/OUT tile remains live across all iterations of the innermost loop, whereas the LHS and RHS operands live for a single iteration and do not require the entire tiles to be simultaneously resident in registers. The base element type used is the element type of the output vector under the assumption the operand types with smaller bitwidths will be promoted to the output type and thus require more registers for the same number of elements. We have observed improvements in performance for on AArch64 targets (Neoverse-V1 and Neoverse-V2 cores) for Neon and SVE configurations without data tiling and peeling vector preprocessing strategy, for example: * Neon, `batch.matmul`, f32: ~46% improvement (less time) * Neon, GPT2, f32: ~31% improvement Signed-off-by: Momchil Velikov <[email protected]>

Signed-off-by: Momchil Velikov <[email protected]>

hanhanW

LGTM, just a final nit about assertion. I don't know why I did not point it out in the first place, sorry about that. Thanks for your patch!

hanhanW · 2026-02-02T18:21:18Z

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

+  // Find the output element type of the matmul.
+  assert(op->getResultTypes().size() == 1 &&
+         "Expected single output type for matmul op");


inferContractionDims guarantees that it has single result, because it is op interface definition. The implementation also reflects the requirement. So we can remove this assertion.

https://github.com/llvm/llvm-project/blob/91c4decc01522c0130d47f8270330194259f4207/mlir/lib/Dialect/Linalg/IR/LinalgInterfaces.cpp#L483-L489

Signed-off-by: Momchil Velikov <[email protected]>

…2/3, respectively. Signed-off-by: Momchil Velikov <[email protected]>

hanhanW · 2026-02-10T18:24:08Z

Hi @momchil-velikov I'm not sure if you have permission to merge the PR. I'm happy to help land it if you can clear the lint issue. Thanks!

Signed-off-by: Momchil Velikov <[email protected]>

…rg#22932) Compute vector tile sizes using a heuristic that aims to keep the entire ACC/OUT tile in registers, leave a few registers for LHS/RHS columns or rows, and all that while not exceeding the number of available registers. The rationale is that a matrix multiplication typically lowers to a loop nest in which the ACC/OUT tile remains live across all iterations of the innermost loop, whereas the LHS and RHS operands live for a single iteration and do not require the entire tiles to be simultaneously resident in registers. The base element type used is the element type of the output vector under the assumption the operand types with smaller bitwidths will be promoted to the output type and thus require more registers for the same number of elements. We have observed improvements in performance for on AArch64 targets (Neoverse-V1 and Neoverse-V2 cores) for Neon and SVE configurations without data tiling and peeling vector preprocessing strategy, for example: * Neon, `batch.matmul`, f32: ~46% improvement (less time) * Neon, GPT2, f32: ~31% improvement Signed-off-by: Momchil Velikov <[email protected]> --------- Signed-off-by: Momchil Velikov <[email protected]>

momchil-velikov requested a review from hanhanW as a code owner December 17, 2025 11:20

hanhanW requested a review from banach-space December 17, 2025 13:40

banach-space requested a review from egebeysel December 18, 2025 08:25

banach-space approved these changes Dec 18, 2025

View reviewed changes

hanhanW requested changes Dec 19, 2025

View reviewed changes

momchil-velikov force-pushed the tile-size-fill-reg-file-heuristic branch from 4abfd86 to 9511ec1 Compare January 6, 2026 11:08

hanhanW reviewed Jan 17, 2026

View reviewed changes

momchil-velikov added 3 commits January 30, 2026 14:32

[fixup] Add check for matmul

26b0118

Signed-off-by: Momchil Velikov <[email protected]>

[fixup] Extra check for matmul

346381c

Signed-off-by: Momchil Velikov <[email protected]>

momchil-velikov force-pushed the tile-size-fill-reg-file-heuristic branch from c19317a to 346381c Compare February 2, 2026 10:19

hanhanW approved these changes Feb 2, 2026

View reviewed changes

momchil-velikov added 2 commits February 6, 2026 10:54

[fixup] Remove a redundant assertion

01694bd

Signed-off-by: Momchil Velikov <[email protected]>

[fixup] Allow for a batch matmul with m/n/k dimension positions at 1/…

7ce3d18

…2/3, respectively. Signed-off-by: Momchil Velikov <[email protected]>

[fixup] Run clang-format-diff

abffb99

Signed-off-by: Momchil Velikov <[email protected]>

momchil-velikov force-pushed the tile-size-fill-reg-file-heuristic branch from 7d40be5 to abffb99 Compare February 11, 2026 10:37

hanhanW enabled auto-merge (squash) February 11, 2026 17:53

hanhanW merged commit 3af8a10 into iree-org:main Feb 11, 2026
55 of 56 checks passed

Conversation

momchil-velikov commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

banach-space commented Dec 18, 2025

Uh oh!

hanhanW left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

egebeysel commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

momchil-velikov commented Jan 6, 2026

Uh oh!

hanhanW left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanhanW left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanhanW commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

momchil-velikov commented Dec 17, 2025 •

edited

Loading

egebeysel commented Jan 5, 2026 •

edited

Loading