[GPUHeuristics] Prefer larger MMA intrinsics for very large compute-bound GEMMs#23641
[GPUHeuristics] Prefer larger MMA intrinsics for very large compute-bound GEMMs#23641
Conversation
For compute-bound GEMMs (LargeGemm), reverse the intrinsic area preference in compareIntrinsics: maximize compute throughput (M*N*K) first, then among equal-compute intrinsics prefer smaller operand area (M+N)*K to reduce VGPR pressure from operand staging. E.g., on gfx950 for a large f16 matmul, 32x32x16 (compute=16384) is now preferred over 16x16x32 (compute=8192), and among same-compute intrinsics the one with smaller K wins. Memory-bound GEMMs retain the existing preference for larger area.
Only prefer low-VGPR-pressure intrinsics (e.g., 32x32x16 over 16x16x32) for very large GEMMs where the larger tiles provide enough data reuse to overcome the overhead. Benchmarking on mi355x shows the crossover is at compute intensity ~6875 (22x compute-memory cutoff), corresponding to square matmul sizes around 10k-11k. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
I think with DMA + pipelining, we will be able to push the limit down to much lower compute-intensity scenarios. |
|
FYI, I am working on a similar heuristics that choosing 32x32x16 instead of 16x16x32 intrinsics for the large M/N size. However, what blocks me is the regressions on some convolution shapes in the tracking sheet. Maybe the condition I set was too aggressive, but I would suggest a similar full run of benchmarks on all GEMM and conv shapes before this PR is merged. |
@yzhang93 True, I will get some numbers to make sure it do not regress other fields. My experiment shows that the cutoff should be very big. Take MNK=8k,8k,8k for example, I could not find a configuration that uses 32x32x16 that beats the baseline which uses 16x16x32. The compute intensity of 8k square matmul is ~5.5k, and we need 7k to break even. |
|
Okay I confirm no impact on the benchmark numbers ... because no entries are categorized as |
That's good, but I feel like the cutoff is setting too high. During my experiments, I found that a bunch of shapes in the tracking sheet actually benefit from using larger intrinsics, e.g., m = 150000 shapes ([150000, 16384], [16384, 4096] or [150000, 4096], [4096, 16384]). But with your cutoff, these shapes are still using 16x16x32 intrinsics, which are suboptimal. |
yzhang93
left a comment
There was a problem hiding this comment.
I'm approving this PR because it sets a very strict condition on VeryLargeGemm and have improvements on large square shaped GEMMs. My above concern can be addressed separately as next steps. I will continue looking into how to optimize the imbalance shaped GEMMs.
compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_tile_and_fuse_gfx950.mlir
Outdated
Show resolved
Hide resolved
@yzhang93 I see. We should tune the bounds a bit, its compute intensity is very close to the cutoff (6413 vs 7000). Let's plan on making some progress this front, particularly, I want to see if we can find lower criteria to make large intrinsics happen. |
SG. There are a bunch of m=150000 shapes in the tracking sheet that can be improved. Could you take those shapes and see if by simply tuning the bounds can improve them all? |
nirvedhmeshram
left a comment
There was a problem hiding this comment.
LGTM, also +1 to using this more effectively as @yzhang93 pointed out.
VeryLargeGemmsfor heuristic.VeryLargeGemms, prefer intrinsics with higher compute throughput, and lower VGPR pressure from operand staging.LargeGemms, as the experiments did not show improvements over a dedicated seed.