Skip to content

[LLVMCPU][AArch64] New heuristic for matmul vector tile sizes#22932

Merged
hanhanW merged 6 commits intoiree-org:mainfrom
momchil-velikov:tile-size-fill-reg-file-heuristic
Feb 11, 2026
Merged

[LLVMCPU][AArch64] New heuristic for matmul vector tile sizes#22932
hanhanW merged 6 commits intoiree-org:mainfrom
momchil-velikov:tile-size-fill-reg-file-heuristic

Conversation

@momchil-velikov
Copy link
Contributor

@momchil-velikov momchil-velikov commented Dec 17, 2025

Compute vector tile sizes using a heuristic that aims to keep the entire
ACC/OUT tile in registers, leave a few registers for LHS/RHS columns
or rows, and all that while not exceeding the number of available
registers. The rationale is that a matrix multiplication typically
lowers to a loop nest in which the ACC/OUT tile remains live across all
iterations of the innermost loop, whereas the LHS and RHS operands live
for a single iteration and do not require the entire tiles to be
simultaneously resident in registers.

The base element type used is the element type of the output vector
under the assumption the operand types with smaller bitwidths
will be promoted to the output type and thus require more registers
for the same number of elements.

We have observed improvements in performance for on AArch64 targets
(Neoverse-V1 and Neoverse-V2 cores) for Neon and SVE configurations
without data tiling and peeling vector preprocessing strategy,
for example:

  • Neon, batch.matmul, f32: ~46% improvement (less time)
  • Neon, GPT2, f32: ~31% improvement

Signed-off-by: Momchil Velikov [email protected]

@banach-space
Copy link
Collaborator

Thank you for working on this, @momchil-velikov !

Approving as-is - we discussed this offline and the heuristics make sense to me. The experiments you ran (various matmul kernels + GPT-2) show clear benefit.

IIRC, the evaluation was done on Neoverse V2, right? The implementation itself doesn’t appear microarchitecture-specific, but it would be good to mention the evaluation hardware explicitly in the summary.

I appreciate that there might be some other (more generic?) approaches to this, but I don't have any specific suggestions myself. Given the observed performance improvements on the tested workloads, I am in favor.

It would be great to extend the PR summary with a bit more detail:

  • Observed performance improvements,
  • Which configurations are affected (e.g. DT, peeling),
  • Hardware used for evaluation.

Thanks again! @hanhanW , what are your thoughts re the implementation? I've not really touched this code for a while, so don't have a strong opinion.

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the evaluation was done on Neoverse V2, right? The implementation itself doesn’t appear microarchitecture-specific, but it would be good to mention the evaluation hardware explicitly in the summary.

+1 for making summary better.

Thanks again! @hanhanW , what are your thoughts re the implementation? I've not really touched this code for a while, so don't have a strong opinion.

I don't touch them for a while, either. IMO, some of the tile sizes were driven by old hardware or workload, so some of them are legacy to me. I don't have strong opinion for changing these code as long as it does not significant impact some benchmarks, although we dont have any benchmarks atm.

Comment on lines +1601 to +1604
scalableSizeFlags.append({false,
isScalableVectorizationEnabled() &&
hasAnySVEFeature(targetAttr.getConfiguration()),
false});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are passing linalgOp (not MatmulOp), it is better to infer M/N/K dims. The other approach can be bailing out in the entry if the op is not a matmul op.

I prefer returning the result instead of passing by reference, though. In any case, can you split it two main statements for readability? E.g.,

  • Initialize with three false.
  • Set N dim to true if it meets the condition.
vec.resize(3, false);
if (...) {
  vec[1] = true;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer M/N/K dims.

I'm afraid I don't quite understand what does this mean.

Looking from the point of view of the function I'm replacing, there's no change in semantics, per se, just choosing different values with the convention that in the returned vectors the element 0 is M, the element 1 is N, and element 2 is K (M, N, and K themselves having conventional meaning). That looks common across most (all?) other functions in the file that suggest tile sizes.

The other approach can be bailing out in the entry if the op is not a matmul op.

We know we have a contraction (ContractionOpInterface) with a single reduction dimension, is the choice going to be bad for all such contraction ops that are not a matmul?
Maybe for some, but then I'd rather enhance the heuristic if we stumble upon such cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking from the point of view of the function I'm replacing, there's no change in semantics, per se, just choosing different values with the convention that in the returned vectors the element 0 is M, the element 1 is N, and element 2 is K (M, N, and K themselves having conventional meaning). That looks common across most (all?) other functions in the file that suggest tile sizes.

Thanks @banach-space for the pointer! I was not clear about checking MatmulOp because I forgot that a MatmulOp op now takes indexing maps: llvm/llvm-project@d152808

Some codes are pretty old and they are not evolved because we're understaffing for some backends. IMO, we should bail out if we have such assumption. Otherwise, people may run into issues easily. E.g., you'll generate the same lowering config if you replace a matmul with the same matmul that has different indexing maps.

  // The shapes are MxK, NxK, MxN.
  %5 = linalg.matmul
    indexing_maps = [
      affine_map<(d0, d1, d2) -> (d0, d2)>,
      affine_map<(d0, d1, d2) -> (d1, d2)>,
      affine_map<(d0, d1, d2) -> (d0, d1)>
    ]
    ins(%1, %2 : tensor<512x128xi8>, tensor<512x128xi8>) outs(%fill : tensor<512x512xi32>) -> tensor<512x512xi32>

We may have a utility function to check if a linalg::LinalgOp is a common matmul form and use it here; it addresses my concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@egebeysel
Copy link
Contributor

egebeysel commented Jan 5, 2026

I'll take a look at the code today or tomorrow, but I quickly gave this a try on my M5:

(latencies without the PR vs with the PR)

Dynamic Matmul

  • 1024x1024x1024: 9.06 vs 7.63 ms
  • (wide k) 1024x1024x4096: 52.0 vs 64.6 ms
  • (wide n) 1024x4096x1024: 39.8 vs 45.9 ms
  • (wide m) 4096x1024x1024: 25.6 vs 21.8 ms

Static Matmul

  • 1024x1024x1024: 11.6 vs 7.45 ms
  • (wide k) 1024x1024x4096: 55.5 vs 68.1 ms
  • (wide n) 1024x4096x1024: 42.1 vs 40.4 ms
  • (wide m) 4096x1024x1024: 33.9 vs 22.8 ms

(used the flags --iree-llvmcpu-enable-ukernels=none --iree-opt-data-tiling=false --iree-dispatch-creation-data-tiling=false --iree-llvmcpu-target-cpu=host --iree-hal-target-backends=llvm-cpu to compile)

@momchil-velikov momchil-velikov force-pushed the tile-size-fill-reg-file-heuristic branch from 4abfd86 to 9511ec1 Compare January 6, 2026 11:08
@momchil-velikov
Copy link
Contributor Author

I'll take a look at the code today or tomorrow, but I quickly gave this a try on my M5:

(latencies without the PR vs with the PR)

Dynamic Matmul

  • 1024x1024x1024: 9.06 vs 7.63 ms
  • (wide k) 1024x1024x4096: 52.0 vs 64.6 ms
  • (wide n) 1024x4096x1024: 39.8 vs 45.9 ms
  • (wide m) 4096x1024x1024: 25.6 vs 21.8 ms

Static Matmul

  • 1024x1024x1024: 11.6 vs 7.45 ms
  • (wide k) 1024x1024x4096: 55.5 vs 68.1 ms
  • (wide n) 1024x4096x1024: 42.1 vs 40.4 ms
  • (wide m) 4096x1024x1024: 33.9 vs 22.8 ms

(used the flags --iree-llvmcpu-enable-ukernels=none --iree-opt-data-tiling=false --iree-dispatch-creation-data-tiling=false --iree-llvmcpu-target-cpu=host --iree-hal-target-backends=llvm-cpu to compile)

I have also observed that no single set of tile parameters are best for all matrix sizes. Also regressions are only(?) in SVE code, Neon universally improves.
One issue is increase in cache misses, mostly when loading the RHS. The accesses to RHS are with a wide stride (for both the original and the patched compilers), however the original RHS tile rows are twice bigger (4 registers) than the modified ones (2 registers), which is one possible explanation.
I have also experimented with tile parameters that increase N while still staying within register budget, but then the SVE register allocation is quite suboptimal, with unnecessary spills.
There are other deficiencies with SVE codegen (for some of which we have initial downstream patches):

  • it does not use indexed FMLA (that potentially saves an explicit broadcast)
  • there are not fully realized opportunities to use the ld1rqw for loading the LHS
  • the addressing calculations for loading the strided rows (where you need to add the same constant a few times to a base address) are preformed quite badly (sometimes, may even result in spilling GPRs in the inner loop).

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a final nit. Please fix DCO issue, thanks.

Comment on lines +1533 to 1536
if (failed(cDims) || cDims->m.size() != 1 || cDims->n.size() != 1 ||
cDims->k.size() != 1) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to check that mDim == 0, nDim == 1, kDim == 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Compute vector tile sizes using a heuristic that aims to keep the entire
ACC/OUT tile in registers, leave a few registers for LHS/RHS columns
or rows, and all that while not exceeding the number of available
registers.  The rationale is that a matrix multiplication typically
lowers to a loop nest in which the ACC/OUT tile remains live across all
iterations of the innermost loop, whereas the LHS and RHS operands live
for a single iteration and do not require the entire tiles to be
simultaneously resident in registers.

The base element type used is the element type of the output vector
under the assumption the operand types with smaller bitwidths
will be promoted to the output type and thus require more registers
for the same number of elements.

We have observed improvements in performance for on AArch64 targets
(Neoverse-V1 and Neoverse-V2 cores) for Neon and SVE configurations
without data tiling and peeling vector preprocessing strategy,
for example:
  * Neon, `batch.matmul`, f32: ~46% improvement (less time)
  * Neon, GPT2, f32: ~31% improvement

Signed-off-by: Momchil Velikov <[email protected]>
Signed-off-by: Momchil Velikov <[email protected]>
Signed-off-by: Momchil Velikov <[email protected]>
@momchil-velikov momchil-velikov force-pushed the tile-size-fill-reg-file-heuristic branch from c19317a to 346381c Compare February 2, 2026 10:19
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a final nit about assertion. I don't know why I did not point it out in the first place, sorry about that. Thanks for your patch!

Comment on lines +1561 to +1563
// Find the output element type of the matmul.
assert(op->getResultTypes().size() == 1 &&
"Expected single output type for matmul op");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inferContractionDims guarantees that it has single result, because it is op interface definition. The implementation also reflects the requirement. So we can remove this assertion.

https://github.com/llvm/llvm-project/blob/91c4decc01522c0130d47f8270330194259f4207/mlir/lib/Dialect/Linalg/IR/LinalgInterfaces.cpp#L483-L489

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@hanhanW
Copy link
Contributor

hanhanW commented Feb 10, 2026

Hi @momchil-velikov I'm not sure if you have permission to merge the PR. I'm happy to help land it if you can clear the lint issue. Thanks!

Signed-off-by: Momchil Velikov <[email protected]>
@momchil-velikov momchil-velikov force-pushed the tile-size-fill-reg-file-heuristic branch from 7d40be5 to abffb99 Compare February 11, 2026 10:37
@hanhanW hanhanW enabled auto-merge (squash) February 11, 2026 17:53
@hanhanW hanhanW merged commit 3af8a10 into iree-org:main Feb 11, 2026
55 of 56 checks passed
MaheshRavishankar pushed a commit to MaheshRavishankar/iree that referenced this pull request Feb 24, 2026
…rg#22932)

Compute vector tile sizes using a heuristic that aims to keep the entire
ACC/OUT tile in registers, leave a few registers for LHS/RHS columns
or rows, and all that while not exceeding the number of available
registers.  The rationale is that a matrix multiplication typically
lowers to a loop nest in which the ACC/OUT tile remains live across all
iterations of the innermost loop, whereas the LHS and RHS operands live
for a single iteration and do not require the entire tiles to be
simultaneously resident in registers.

The base element type used is the element type of the output vector
under the assumption the operand types with smaller bitwidths
will be promoted to the output type and thus require more registers
for the same number of elements.

We have observed improvements in performance for on AArch64 targets
(Neoverse-V1 and Neoverse-V2 cores) for Neon and SVE configurations
without data tiling and peeling vector preprocessing strategy,
for example:
  * Neon, `batch.matmul`, f32: ~46% improvement (less time)
  * Neon, GPT2, f32: ~31% improvement

Signed-off-by: Momchil Velikov <[email protected]>

---------

Signed-off-by: Momchil Velikov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants