[Codegen][GPU] Fix shared memory estimation for multi-buffering by Yu-Zhewen · Pull Request #23736 · iree-org/iree

Yu-Zhewen · 2026-03-11T14:15:52Z

Pass useDirectLoad and prefetchNumStages through calculateOperandsSharedMemoryUsedInBytes, so that it can account
for multi-buffering when using direct loads with prefetching.
Properly guard direct load flag for scaled matmuls as forced off (Scaled matmul fails to compile when using global load DMA promotion #22119) to avoid inconsistency, by overriding the flag with a
warning emitted.
As such, for regular matmul the shared memory usage scales with prefetchNumStages when useDirectLoad is enabled. For scaled matmul, shared memory usage is unchanged since direct load is forced off.

Signed-off-by: Yu-Zhewen <[email protected]>

kuhar · 2026-03-11T15:25:39Z

@bangtianliu @RattataKing we should mirror this in the tuner (eventually)

compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_tile_and_fuse_gfx950.mlir

lialan · 2026-03-11T15:43:01Z

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_tile_and_fuse_gfx950.mlir

+
+// CHECK-REMARKS-DIRECT-LOAD-3: [Analysis] SharedMemoryUsage
+// CHECK-REMARKS-DIRECT-LOAD-3-SAME: Category:deduceMMASchedule
+// CHECK-REMARKS-DIRECT-LOAD-3-SAME: Remark=34816


same here, find if we only need 1, or something is not right.

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_tile_and_fuse_gfx950.mlir

compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp

jerryyin · 2026-03-11T17:06:07Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp

+  // Account for multi-buffering when using direct loads.
+  int64_t numBuffers =
+      (useDirectLoad && prefetchNumStages > 0) ? prefetchNumStages : 1;
+
+  int64_t lhsSharedMemoryUsed = numBuffers * tileM * tileK * lhsBitwidth;
+  int64_t rhsSharedMemoryUsed =
+      numBuffers * numRhs * tileN * tileK * rhsBitwidth;
+  int64_t aScaleSharedMemoryUsed =
+      numBuffers * tileM * tileKo * lhsScaleBitwidth;
+  int64_t bScaleSharedMemoryUsed =
+      numBuffers * numRhs * tileN * tileKo * rhsScaleBitwidth;


I think it'd be cleaner if we just multiply numBuffers at the very end before return if we are scaling numBuffers * uniformly.

Also, add a explicit comment about the ROCDLPrefetchSharedMemoryPass that the current decisions of multiple buffering is the num of multi-buffer equals to numStages in direct load mode.

Help me inspect another scenario too: At this point is the useDirectLoad a done deal or will it be overridden in some way later? I don't want us to be overly aggressive in our estimation if that's the case.

The only place useDirectLoad gets overridden is the scaled matmul case discussed above (@lialan is looking into this as an orthogonal issue). By the time we reach calculateOperandsSharedMemoryUsedInBytes, the flag reflects the final decision, so there's no risk of over-estimation.

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_tile_and_fuse_gfx950.mlir

Signed-off-by: Yu-Zhewen <[email protected]>

jerryyin

LGTM

lialan

+1

Max191

Sorry, I just missed the merge, but could you address my comment as a followup?

Max191 · 2026-03-11T18:50:36Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.h

+    bool canUpcastAcc = false, bool useDirectLoad = false,
+    int64_t prefetchNumStages = 0, bool mustBeAligned = true,


Please update the function docs to explain what the new params do.

Also, I think the naming here is a bit confusing. The decision to use direct load is independent to the decision to use multi-buffering AFAIU. It just happens that the decision is linked today. Perhaps instead of these params we could just have a single int64_t param called numMultiBufferingStages (defaulting to 1), and then we just multiply by that value in the shared mem computation.

Please update the function docs to explain what the new params do.

Thanks, will do.

Also, I think the naming here is a bit confusing. The decision to use direct load is independent to the decision to use multi-buffering AFAIU. It just happens that the decision is linked today. Perhaps instead of these params we could just have a single int64_t param called numMultiBufferingStages (defaulting to 1), and then we just multiply by that value in the shared mem computation.

I'd prefer to keep the two separate parameters, as they correspond to two user-facing flags. Also, prefetchNumStages has different resource implications (multi-buffering via LDS or VGPRs) depending on the load mode.

init commit

209e778

Signed-off-by: Yu-Zhewen <[email protected]>

Yu-Zhewen marked this pull request as ready for review March 11, 2026 14:59

Yu-Zhewen requested review from Groverkss, Max191, krzysz00, kuhar, nirvedhmeshram and qedawkins as code owners March 11, 2026 14:59

Yu-Zhewen requested review from jerryyin and lialan March 11, 2026 14:59

kuhar approved these changes Mar 11, 2026

View reviewed changes

lialan reviewed Mar 11, 2026

View reviewed changes

compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp Show resolved Hide resolved

lialan reviewed Mar 11, 2026

View reviewed changes

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_tile_and_fuse_gfx950.mlir Show resolved Hide resolved

lialan reviewed Mar 11, 2026

View reviewed changes

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_tile_and_fuse_gfx950.mlir Show resolved Hide resolved

jerryyin reviewed Mar 11, 2026

View reviewed changes

address reviews

f835ca8

Signed-off-by: Yu-Zhewen <[email protected]>

jerryyin self-requested a review March 11, 2026 18:11

jerryyin approved these changes Mar 11, 2026

View reviewed changes

lialan approved these changes Mar 11, 2026

View reviewed changes

Yu-Zhewen merged commit 8206e32 into iree-org:main Mar 11, 2026
56 checks passed

Max191 reviewed Mar 11, 2026

View reviewed changes

Yu-Zhewen mentioned this pull request Mar 12, 2026

[GPU][NFC] Document useDirectLoad and prefetchNumStages in deduceMMASchedule #23754

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Codegen][GPU] Fix shared memory estimation for multi-buffering #23736

[Codegen][GPU] Fix shared memory estimation for multi-buffering #23736
Yu-Zhewen merged 2 commits intoiree-org:mainfrom
Yu-Zhewen:num_stages_shared_mem

Yu-Zhewen commented Mar 11, 2026

Uh oh!

kuhar commented Mar 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

lialan Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

jerryyin Mar 11, 2026

Uh oh!

Yu-Zhewen Mar 11, 2026

Uh oh!

Uh oh!

jerryyin left a comment

Uh oh!

lialan left a comment

Uh oh!

Uh oh!

Max191 left a comment

Uh oh!

Max191 Mar 11, 2026

Uh oh!

Yu-Zhewen Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		bool canUpcastAcc = false, bool useDirectLoad = false,
		int64_t prefetchNumStages = 0, bool mustBeAligned = true,

Conversation

Yu-Zhewen commented Mar 11, 2026

Uh oh!

kuhar commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lialan Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jerryyin Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Yu-Zhewen Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jerryyin left a comment

Choose a reason for hiding this comment

Uh oh!

lialan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Max191 left a comment

Choose a reason for hiding this comment

Uh oh!

Max191 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Yu-Zhewen Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kuhar commented Mar 11, 2026 •

edited

Loading

Yu-Zhewen Mar 11, 2026 •

edited

Loading