Skip to content

[GPU] Add iree_gpu.global_subgroup_barrier op#23451

Merged
qedawkins merged 3 commits intoiree-org:mainfrom
qedawkins:global-subgroup-barrier-pr
Mar 9, 2026
Merged

[GPU] Add iree_gpu.global_subgroup_barrier op#23451
qedawkins merged 3 commits intoiree-org:mainfrom
qedawkins:global-subgroup-barrier-pr

Conversation

@qedawkins
Copy link
Contributor

Add a synchronization-only barrier op that has no memory fence semantics. The key distinction between this and gpu.barrier is that it's semantically global. That is, a subgroup is let through a barrier once all subgroups have reached any instance of the barrier, not just a specific one. Note that this is a dramatically more restrictive condition for optimizing the barriers themselves and should only be preferred in situations where it is expressly required. This op also does not fence memory and expects that to be handled separately.

Add a synchronization-only barrier op that has no memory fence
semantics. Unlike gpu.barrier, this op preserves consecutive
instances (no canonicalizer) which is critical for the pingpong
double-buffer schedule. Fences are handled separately. Includes
lowerings for both ROCDL and NVVM.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@qedawkins qedawkins marked this pull request as ready for review February 11, 2026 02:46
@qedawkins
Copy link
Contributor Author

@krzysz00 I figure you're the best to take a look at this. Basically we need to distinguish between "The" barrier and "a" barrier. gpu.barrier (mainly per the canonicalization that drops adjacent ones) semantically requires all threads to reach that specific barrier to proceed, while we need the semantics "all threads must reach any instance of the barrier to proceed" for wave specialization on targets that don't have named barriers.

If you have ideas other than a new op I'd be interested to hear your thoughts.

Comment on lines +87 to +88
const char *asmStr = ";;;WARNING: BREAKS DEBUG WATCHES\ns_barrier";
rewriter.replaceOpWithNewOp<LLVM::InlineAsmOp>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth moving to rocdl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM moving to rocdl? Moving the op to rocdl?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally leaning towards not having inline asm in iree and using rocdl for these low-level intrinsics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note that the standing position of the compiler folks is "if you have felt the need to use inline asm, please don't, but if you must, tell us about it so we can see if your usecase can be eliminated"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/llvm/llvm-project/blob/f9bca14a33888e70e2f9f1aa1f4a4edd3d7e5dfd/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp#L601

This is just a direct copy-paste of this. If you guys want to tell me what to put here I'd be happy to put whatever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On re-reading, it's probably fine to keep the MI-100 workaround in here, and that comment's even correct.

@krzysz00
Copy link
Contributor

@qedawkins This is `gpu.barrier memfence []'

Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding

// Pre-gfx90a: use inline asm.
auto asmDialectAttr = LLVM::AsmDialectAttr::get(rewriter.getContext(),
LLVM::AsmDialect::AD_ATT);
const char *asmStr = ";;;WARNING: BREAKS DEBUG WATCHES\ns_barrier";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. No it doesn't break debug watches
  2. Use the intrinsic on pain of getting a stern talking to from the compiler team unless you know a good reason not to

@qedawkins
Copy link
Contributor Author

Not quite,

gpu.barrier memfence []
gpu.barrier memfence []

this folds because the barriers are unique, meaning we can assume the second barrier is redundant. The same folder is invalid for global barriers since other workers can have a critical region between the two.

@krzysz00
Copy link
Contributor

Not quite,

gpu.barrier memfence []
gpu.barrier memfence []

this folds because the barriers are unique, meaning we can assume the second barrier is redundant. The same folder is invalid for global barriers since other workers can have a critical region between the two.

In which case, can you go upstream and patch the folder to special-case the no-memfence case?

@qedawkins
Copy link
Contributor Author

amdgpu.lds_barrier folds for the same reason (as does any version of this op with memfences on it). The issue is one of semantics, unless we want to change that.

@krzysz00
Copy link
Contributor

Re-reading, I can see why you'd want this - it's to make sure we have correctness around pingpong

I'd still want to re-use upstream implementations here and see if maybe we don't want to fiddle with the folders up there instead of adding a new op that'l awkwardly diverge

@krzysz00
Copy link
Contributor

gpu.barrier memfence [] is specifically "a barrier with no memory fencing semantics" and so its folders should probably be updated accordingly

@qedawkins
Copy link
Contributor Author

gpu.barrier memfence [] is specifically "a barrier with no memory fencing semantics" and so its folders should probably be updated accordingly

The memory fence is a separate discussion. Whether or not gpu.barrier fences memory is independent of the folder, so it really is a semantics change to the upstream barrier (either the folder never applies or always applies to sequential identical barriers). If we want to go with changing the semantics of gpu.barrier and dropping the folder I'm good with that. Though I do prefer not having to engage with discussions about it upstream because the current semantics + folder are also completely reasonable as a use case.

@krzysz00
Copy link
Contributor

@qedawkins Ok, I can see what you're getting at now, though I'm at least going to argue for wandering upstream and either explicitly adding a carve-out to the uniqueness thing when you specify memfence [] or doing something interesting like allowing for named barriers that don't allow the folding (barriers can have an optional id for example - or we'd do a named barrier object sort of thing and just only support the trivial usage on gfx9)

@qedawkins
Copy link
Contributor Author

@qedawkins Ok, I can see what you're getting at now, though I'm at least going to argue for wandering upstream and either explicitly adding a carve-out to the uniqueness thing when you specify memfence [] or doing something interesting like allowing for named barriers that don't allow the folding (barriers can have an optional id for example - or we'd do a named barrier object sort of thing and just only support the trivial usage on gfx9)

I'm not interested in adding a flag to gpu.barrier for these semantics just to kill a folder. And justifying the semantics I'm adding here in a vacuum upstream does not sound appealing to me. Re: named barriers, I'm expecting we'll need something shaped fairly different since typically you need a signal and an await for those no?

@krzysz00
Copy link
Contributor

SPIR-V has, IIRC, named barriers but not a split signal/wait - but that's off-topic

Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved now that we've got more context on this one

@qedawkins qedawkins merged commit fa5580c into iree-org:main Mar 9, 2026
57 checks passed
@qedawkins qedawkins deleted the global-subgroup-barrier-pr branch March 9, 2026 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants