Skip to content

[GPUHeuristics] Prefer larger MMA intrinsics for very large compute-bound GEMMs#23641

Merged
lialan merged 4 commits intomainfrom
users/lialan/direct-load-tiling-heuristics
Mar 4, 2026
Merged

[GPUHeuristics] Prefer larger MMA intrinsics for very large compute-bound GEMMs#23641
lialan merged 4 commits intomainfrom
users/lialan/direct-load-tiling-heuristics

Conversation

@lialan
Copy link
Contributor

@lialan lialan commented Mar 3, 2026

  • add a new category VeryLargeGemms for heuristic.
  • in VeryLargeGemms, prefer intrinsics with higher compute throughput, and lower VGPR pressure from operand staging.
  • The cutoff is set at 22x the compute-memory ratio (perfTflops / memoryBandwidthTbps). On mi355x , this gives ~7000 compute intensity, corresponding to square matmuls around 10k-11k. This is empirical number determined by experiments done on mi355x.
  • This change starts to beat the baseline around ~10.5k square matmul, and can do ~14% better for 16k square matmuls with minimal changes to the code. The number continues to grow with the size.
  • Reuse heuristic seeds with LargeGemms, as the experiments did not show improvements over a dedicated seed.

lialan and others added 3 commits March 3, 2026 10:20
For compute-bound GEMMs (LargeGemm), reverse the intrinsic area
preference in compareIntrinsics: maximize compute throughput (M*N*K)
first, then among equal-compute intrinsics prefer smaller operand area
(M+N)*K to reduce VGPR pressure from operand staging.

E.g., on gfx950 for a large f16 matmul, 32x32x16 (compute=16384) is
now preferred over 16x16x32 (compute=8192), and among same-compute
intrinsics the one with smaller K wins. Memory-bound GEMMs retain the
existing preference for larger area.
Only prefer low-VGPR-pressure intrinsics (e.g., 32x32x16 over 16x16x32)
for very large GEMMs where the larger tiles provide enough data reuse to
overcome the overhead. Benchmarking on mi355x shows the crossover is at
compute intensity ~6875 (22x compute-memory cutoff), corresponding to
square matmul sizes around 10k-11k.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@lialan lialan changed the title [Codegen] Prefer larger MMA intrinsics for very large compute-bound GEMMs [GPUHeuristics] Prefer larger MMA intrinsics for very large compute-bound GEMMs Mar 4, 2026
@lialan lialan marked this pull request as ready for review March 4, 2026 01:35
@lialan
Copy link
Contributor Author

lialan commented Mar 4, 2026

I think with DMA + pipelining, we will be able to push the limit down to much lower compute-intensity scenarios.

@yzhang93
Copy link
Contributor

yzhang93 commented Mar 4, 2026

FYI, I am working on a similar heuristics that choosing 32x32x16 instead of 16x16x32 intrinsics for the large M/N size. However, what blocks me is the regressions on some convolution shapes in the tracking sheet. Maybe the condition I set was too aggressive, but I would suggest a similar full run of benchmarks on all GEMM and conv shapes before this PR is merged.

@lialan
Copy link
Contributor Author

lialan commented Mar 4, 2026

FYI, I am working on a similar heuristics that choosing 32x32x16 instead of 16x16x32 intrinsics for the large M/N size. However, what blocks me is the regressions on some convolution shapes in the tracking sheet. Maybe the condition I set was too aggressive, but I would suggest a similar full run of benchmarks on all GEMM and conv shapes before this PR is merged.

@yzhang93 True, I will get some numbers to make sure it do not regress other fields.

My experiment shows that the cutoff should be very big. Take MNK=8k,8k,8k for example, I could not find a configuration that uses 32x32x16 that beats the baseline which uses 16x16x32.

The compute intensity of 8k square matmul is ~5.5k, and we need 7k to break even.

@lialan
Copy link
Contributor Author

lialan commented Mar 4, 2026

Okay I confirm no impact on the benchmark numbers ... because no entries are categorized as VeryLargeGemms.

@yzhang93
Copy link
Contributor

yzhang93 commented Mar 4, 2026

Okay I confirm no impact on the benchmark numbers ... because no entries are categorized as VeryLargeGemms.

That's good, but I feel like the cutoff is setting too high. During my experiments, I found that a bunch of shapes in the tracking sheet actually benefit from using larger intrinsics, e.g., m = 150000 shapes ([150000, 16384], [16384, 4096] or [150000, 4096], [4096, 16384]). But with your cutoff, these shapes are still using 16x16x32 intrinsics, which are suboptimal.

Copy link
Contributor

@yzhang93 yzhang93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving this PR because it sets a very strict condition on VeryLargeGemm and have improvements on large square shaped GEMMs. My above concern can be addressed separately as next steps. I will continue looking into how to optimize the imbalance shaped GEMMs.

@lialan
Copy link
Contributor Author

lialan commented Mar 4, 2026

e.g., m = 150000 shapes ([150000, 16384], [16384, 4096] or [150000, 4096], [4096, 16384]). But with your cutoff, these shapes are still using 16x16x32 intrinsics, which are suboptimal.

@yzhang93 I see. We should tune the bounds a bit, its compute intensity is very close to the cutoff (6413 vs 7000).

Let's plan on making some progress this front, particularly, I want to see if we can find lower criteria to make large intrinsics happen.

@yzhang93
Copy link
Contributor

yzhang93 commented Mar 4, 2026

e.g., m = 150000 shapes ([150000, 16384], [16384, 4096] or [150000, 4096], [4096, 16384]). But with your cutoff, these shapes are still using 16x16x32 intrinsics, which are suboptimal.

@yzhang93 I see. We should tune the bounds a bit, its compute intensity is very close to the cutoff (6413 vs 7000).

Let's plan on making some progress this front, particularly, I want to see if we can find lower criteria to make large intrinsics happen.

SG. There are a bunch of m=150000 shapes in the tracking sheet that can be improved. Could you take those shapes and see if by simply tuning the bounds can improve them all?

Copy link
Contributor

@nirvedhmeshram nirvedhmeshram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, also +1 to using this more effectively as @yzhang93 pointed out.

@lialan lialan merged commit e002a09 into main Mar 4, 2026
59 of 60 checks passed
@lialan lialan deleted the users/lialan/direct-load-tiling-heuristics branch March 4, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants