[GPUHeuristics] Prefer larger MMA intrinsics for very large compute-bound GEMMs by lialan · Pull Request #23641 · iree-org/iree

lialan · 2026-03-03T23:52:05Z

add a new category VeryLargeGemms for heuristic.
in VeryLargeGemms, prefer intrinsics with higher compute throughput, and lower VGPR pressure from operand staging.
The cutoff is set at 22x the compute-memory ratio (perfTflops / memoryBandwidthTbps). On mi355x , this gives ~7000 compute intensity, corresponding to square matmuls around 10k-11k. This is empirical number determined by experiments done on mi355x.
This change starts to beat the baseline around ~10.5k square matmul, and can do ~14% better for 16k square matmuls with minimal changes to the code. The number continues to grow with the size.
Reuse heuristic seeds with LargeGemms, as the experiments did not show improvements over a dedicated seed.

For compute-bound GEMMs (LargeGemm), reverse the intrinsic area preference in compareIntrinsics: maximize compute throughput (M*N*K) first, then among equal-compute intrinsics prefer smaller operand area (M+N)*K to reduce VGPR pressure from operand staging. E.g., on gfx950 for a large f16 matmul, 32x32x16 (compute=16384) is now preferred over 16x16x32 (compute=8192), and among same-compute intrinsics the one with smaller K wins. Memory-bound GEMMs retain the existing preference for larger area.

Only prefer low-VGPR-pressure intrinsics (e.g., 32x32x16 over 16x16x32) for very large GEMMs where the larger tiles provide enough data reuse to overcome the overhead. Benchmarking on mi355x shows the crossover is at compute intensity ~6875 (22x compute-memory cutoff), corresponding to square matmul sizes around 10k-11k. Co-Authored-By: Claude Opus 4.6 <[email protected]>

lialan · 2026-03-04T01:36:34Z

I think with DMA + pipelining, we will be able to push the limit down to much lower compute-intensity scenarios.

yzhang93 · 2026-03-04T01:44:42Z

FYI, I am working on a similar heuristics that choosing 32x32x16 instead of 16x16x32 intrinsics for the large M/N size. However, what blocks me is the regressions on some convolution shapes in the tracking sheet. Maybe the condition I set was too aggressive, but I would suggest a similar full run of benchmarks on all GEMM and conv shapes before this PR is merged.

lialan · 2026-03-04T16:06:05Z

FYI, I am working on a similar heuristics that choosing 32x32x16 instead of 16x16x32 intrinsics for the large M/N size. However, what blocks me is the regressions on some convolution shapes in the tracking sheet. Maybe the condition I set was too aggressive, but I would suggest a similar full run of benchmarks on all GEMM and conv shapes before this PR is merged.

@yzhang93 True, I will get some numbers to make sure it do not regress other fields.

My experiment shows that the cutoff should be very big. Take MNK=8k,8k,8k for example, I could not find a configuration that uses 32x32x16 that beats the baseline which uses 16x16x32.

The compute intensity of 8k square matmul is ~5.5k, and we need 7k to break even.

lialan · 2026-03-04T16:26:02Z

Okay I confirm no impact on the benchmark numbers ... because no entries are categorized as VeryLargeGemms.

yzhang93 · 2026-03-04T17:13:56Z

Okay I confirm no impact on the benchmark numbers ... because no entries are categorized as VeryLargeGemms.

That's good, but I feel like the cutoff is setting too high. During my experiments, I found that a bunch of shapes in the tracking sheet actually benefit from using larger intrinsics, e.g., m = 150000 shapes ([150000, 16384], [16384, 4096] or [150000, 4096], [4096, 16384]). But with your cutoff, these shapes are still using 16x16x32 intrinsics, which are suboptimal.

yzhang93

I'm approving this PR because it sets a very strict condition on VeryLargeGemm and have improvements on large square shaped GEMMs. My above concern can be addressed separately as next steps. I will continue looking into how to optimize the imbalance shaped GEMMs.

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_tile_and_fuse_gfx950.mlir

lialan · 2026-03-04T18:38:12Z

e.g., m = 150000 shapes ([150000, 16384], [16384, 4096] or [150000, 4096], [4096, 16384]). But with your cutoff, these shapes are still using 16x16x32 intrinsics, which are suboptimal.

@yzhang93 I see. We should tune the bounds a bit, its compute intensity is very close to the cutoff (6413 vs 7000).

Let's plan on making some progress this front, particularly, I want to see if we can find lower criteria to make large intrinsics happen.

yzhang93 · 2026-03-04T18:53:08Z

e.g., m = 150000 shapes ([150000, 16384], [16384, 4096] or [150000, 4096], [4096, 16384]). But with your cutoff, these shapes are still using 16x16x32 intrinsics, which are suboptimal.

@yzhang93 I see. We should tune the bounds a bit, its compute intensity is very close to the cutoff (6413 vs 7000).

Let's plan on making some progress this front, particularly, I want to see if we can find lower criteria to make large intrinsics happen.

SG. There are a bunch of m=150000 shapes in the tracking sheet that can be improved. Could you take those shapes and see if by simply tuning the bounds can improve them all?

nirvedhmeshram

LGTM, also +1 to using this more effectively as @yzhang93 pointed out.

lialan and others added 3 commits March 3, 2026 10:20

update

a652e6a

lialan changed the title ~~[Codegen] Prefer larger MMA intrinsics for very large compute-bound GEMMs~~ [GPUHeuristics] Prefer larger MMA intrinsics for very large compute-bound GEMMs Mar 4, 2026

lialan marked this pull request as ready for review March 4, 2026 01:35

lialan requested review from Groverkss, Max191, krzysz00, kuhar, nirvedhmeshram and qedawkins as code owners March 4, 2026 01:35

yzhang93 approved these changes Mar 4, 2026

View reviewed changes

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_tile_and_fuse_gfx950.mlir Outdated Show resolved Hide resolved

Address comments.

4d286f6

nirvedhmeshram approved these changes Mar 4, 2026

View reviewed changes

lialan merged commit e002a09 into main Mar 4, 2026
59 of 60 checks passed

lialan deleted the users/lialan/direct-load-tiling-heuristics branch March 4, 2026 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPUHeuristics] Prefer larger MMA intrinsics for very large compute-bound GEMMs#23641

[GPUHeuristics] Prefer larger MMA intrinsics for very large compute-bound GEMMs#23641
lialan merged 4 commits intomainfrom
users/lialan/direct-load-tiling-heuristics

lialan commented Mar 3, 2026 •

edited

Loading

Uh oh!

lialan commented Mar 4, 2026

Uh oh!

yzhang93 commented Mar 4, 2026 •

edited

Loading

Uh oh!

lialan commented Mar 4, 2026

Uh oh!

lialan commented Mar 4, 2026

Uh oh!

yzhang93 commented Mar 4, 2026

Uh oh!

yzhang93 left a comment

Uh oh!

Uh oh!

lialan commented Mar 4, 2026

Uh oh!

yzhang93 commented Mar 4, 2026

Uh oh!

nirvedhmeshram left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lialan commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lialan commented Mar 4, 2026

Uh oh!

yzhang93 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lialan commented Mar 4, 2026

Uh oh!

lialan commented Mar 4, 2026

Uh oh!

yzhang93 commented Mar 4, 2026

Uh oh!

yzhang93 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lialan commented Mar 4, 2026

Uh oh!

yzhang93 commented Mar 4, 2026

Uh oh!

nirvedhmeshram left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lialan commented Mar 3, 2026 •

edited

Loading

yzhang93 commented Mar 4, 2026 •

edited

Loading