[GPU] Add iree_gpu.global_subgroup_barrier op#23451
Conversation
Add a synchronization-only barrier op that has no memory fence semantics. Unlike gpu.barrier, this op preserves consecutive instances (no canonicalizer) which is critical for the pingpong double-buffer schedule. Fences are handled separately. Includes lowerings for both ROCDL and NVVM. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@krzysz00 I figure you're the best to take a look at this. Basically we need to distinguish between "The" barrier and "a" barrier. gpu.barrier (mainly per the canonicalization that drops adjacent ones) semantically requires all threads to reach that specific barrier to proceed, while we need the semantics "all threads must reach any instance of the barrier to proceed" for wave specialization on targets that don't have named barriers. If you have ideas other than a new op I'd be interested to hear your thoughts. |
compiler/src/iree/compiler/Codegen/LLVMGPU/test/convert_to_rocdl_gfx1200.mlir
Outdated
Show resolved
Hide resolved
| const char *asmStr = ";;;WARNING: BREAKS DEBUG WATCHES\ns_barrier"; | ||
| rewriter.replaceOpWithNewOp<LLVM::InlineAsmOp>( |
There was a problem hiding this comment.
Would it be worth moving to rocdl?
There was a problem hiding this comment.
WDYM moving to rocdl? Moving the op to rocdl?
There was a problem hiding this comment.
I'm generally leaning towards not having inline asm in iree and using rocdl for these low-level intrinsics
There was a problem hiding this comment.
I'll note that the standing position of the compiler folks is "if you have felt the need to use inline asm, please don't, but if you must, tell us about it so we can see if your usecase can be eliminated"
There was a problem hiding this comment.
This is just a direct copy-paste of this. If you guys want to tell me what to put here I'd be happy to put whatever.
There was a problem hiding this comment.
On re-reading, it's probably fine to keep the MI-100 workaround in here, and that comment's even correct.
|
@qedawkins This is `gpu.barrier memfence []' |
| // Pre-gfx90a: use inline asm. | ||
| auto asmDialectAttr = LLVM::AsmDialectAttr::get(rewriter.getContext(), | ||
| LLVM::AsmDialect::AD_ATT); | ||
| const char *asmStr = ";;;WARNING: BREAKS DEBUG WATCHES\ns_barrier"; |
There was a problem hiding this comment.
- No it doesn't break debug watches
- Use the intrinsic on pain of getting a stern talking to from the compiler team unless you know a good reason not to
|
Not quite, this folds because the barriers are unique, meaning we can assume the second barrier is redundant. The same folder is invalid for global barriers since other workers can have a critical region between the two. |
In which case, can you go upstream and patch the folder to special-case the no-memfence case? |
|
|
|
Re-reading, I can see why you'd want this - it's to make sure we have correctness around pingpong I'd still want to re-use upstream implementations here and see if maybe we don't want to fiddle with the folders up there instead of adding a new op that'l awkwardly diverge |
|
|
The memory fence is a separate discussion. Whether or not |
|
@qedawkins Ok, I can see what you're getting at now, though I'm at least going to argue for wandering upstream and either explicitly adding a carve-out to the uniqueness thing when you specify |
I'm not interested in adding a flag to gpu.barrier for these semantics just to kill a folder. And justifying the semantics I'm adding here in a vacuum upstream does not sound appealing to me. Re: named barriers, I'm expecting we'll need something shaped fairly different since typically you need a signal and an await for those no? |
|
SPIR-V has, IIRC, named barriers but not a split signal/wait - but that's off-topic |
krzysz00
left a comment
There was a problem hiding this comment.
Approved now that we've got more context on this one
Add a synchronization-only barrier op that has no memory fence semantics. The key distinction between this and
gpu.barrieris that it's semantically global. That is, a subgroup is let through a barrier once all subgroups have reached any instance of the barrier, not just a specific one. Note that this is a dramatically more restrictive condition for optimizing the barriers themselves and should only be preferred in situations where it is expressly required. This op also does not fence memory and expects that to be handled separately.