Fixes to memory model for barriers and atomics by alan-baker · Pull Request #2297 · gpuweb/gpuweb

alan-baker · 2021-11-12T01:38:09Z

All barriers ~~`and atomics~~ only have a workgroup memory scope
- This means translations of storageBarrier to HLSL will
  over-synchronize the memory
Fix links between atomics and barriers and the memory model
Fix description of compare exchange atomic
Update remaining uses of Device to use QueueFamily

I realize I made a mistake when reading the Metal docs and unintentionally assumed more functionality than is guaranteed. threadgroup_barrier in MSL is effectively OpenCL C 1.2's barrier intrinsic which only has a memory scope of Workgroup. This means (in the context of #2229) that inter-workgroup communication cannot be support correctly on all underlying platforms.

* All barriers and atomics only have a workgroup memory scope * This means translations of storageBarrier to HLSL will over-synchronize the memory * Fix links between atomics and barriers and the memory model * Fix description of compare exchange atomic

github-actions · 2021-11-12T01:43:15Z

Previews, as seen when this build job started (84bdbc7):
WebGPU | IDL
WGSL
Explainer

dneto0 · 2021-11-12T14:48:02Z

I realize I made a mistake when reading the Metal docs and unintentionally assumed more functionality than is guaranteed. threadgroup_barrier in MSL is effectively OpenCL C 1.2's barrier intrinsic which only has a memory scope of Workgroup. This means (in the context of #2229) that inter-workgroup communication cannot be support correctly on all underlying platforms.

Really? From the MSL manual v2.4, 6.9.1 "Threadgroup and SIMD-group synchronization functions"

The threadgroup_barrier (or simdgroup_barrier) function can also queue a memory fence
(reads and writes) to ensure the correct ordering of memory operations to threadgroup or
device memory.

And table 6.12 (mem fence flags): has

mem_device: Ensure correct ordering of memory operations to device
memory.

alan-baker · 2021-11-12T14:59:08Z

The OpenCL docs provide a bit better explanation, but there are a few things going on with barriers (and atomics).

For barrier and the work_group_barrier variant that does not take a memory scope, the scope is memory_scope_work_group.

That's from the unified OpenCL C 3.0 spec. From OpenCL's SPIR environment:

In an OpenCL 1.2 environment, for the Barrier Instructions OpControlBarrier and OpMemoryBarrier, the Scope for Memory must be Workgroup, and the memory-order constraint in Memory Semantics must be SequentiallyConsistent.

For barriers, I think SPIR breaks it down reasonably well, though the semantics operand is a bit muddled:

Execution scope: which invocations participate in the control aspect
Memory scope: the extent (in terms of invocations) of cache flushing
Memory semantics: both memory ordering and the affected memory types

Metal's mem_fence_flags only affects the storage aspect of memory semantics, not the memory scope. That's the intent of this PR.

The important part is to update the mapping as that really describes the memory model aspect. We can do a follow up PR or two to try and clarify what these four (five?) things really are for barriers and atomics (no execution scope there of course).

dneto0 · 2021-11-12T15:15:53Z

Metal's mem_fence_flags only affects the storage aspect of memory semantics, not the memory scope. That's the intent of this PR.

Got it.

If I'm reading it right, it means Metal's threadgroup_barrier doesn't reliably support message passing between one threadgroup (workgroup) and another via device memory. Hence the first complaint in #2229.

A similar conclusion applies to vanilla barrier() in GLSL on Vulkan or OpenGL. You'd need to use a more capable barrier enabled by GL_KHR_memory_scope_semantics

I think your analysis of HLSL's barriers concluded that it does support what's needed. (And we made sure the SPIR-V backend on DXC translates appropriately when targeting Vulkan at least.)

* Storage atomics should remain device scope

github-actions · 2021-11-12T15:26:21Z

Previews, as seen when this build job started (ee7b2ae):
WebGPU | IDL
WGSL
Explainer

raphlinus · 2021-11-12T15:35:54Z

To address the questions raised by @dneto0 about what's needed for message passing across workgroups (where the message can't be packed in a 32 bit word along with the flags), no, from what I can tell it can't be done in Metal at all (confirmed by testing and discussions with experts). On GLSL in compatibility mode, memoryBarrierBuffer should be sufficient, as it's defined in the spec to have queue family scope (which is fine, all I care about is within a dispatch; coordination between GPU and CPU and suchlike is a whole nother ball of wax). And on HLSL I am fairly certain DeviceMemoryBarrier has device scope as well.

dneto0

Please remove the now-extraneous note about Device scope, at (new) line 7401

dneto0 · 2021-11-12T15:25:10Z

wgsl/index.bs


 The access mode `A` in all atomic built-in functions must be [=access/read_write=].

-TODO: Add links to the eventual memory model descriptions.


good, thanks.

wgsl/index.bs

dneto0 · 2021-11-12T16:49:33Z

On GLSL in compatibility mode, memoryBarrierBuffer should be sufficient, as it's defined in the spec to have queue family scope (which is fine, all I care about is within a dispatch; coordination between GPU and CPU and suchlike is a whole nother ball of wax).

'memoryBarrierBuffer' was added in the vulkan memory model extension for GLSL (that's what you pointed at).

Prior to that extension, there is only barrier() and memoryBarrier().
The barrier() is not sufficient:

https://github.com/KhronosGroup/GLSL/blob/master/extensions/khr/GL_KHR_memory_scope_semantics.txt#L461

In compute shaders, barrier() is equivalent to controlBarrier() with
execution and memory scope equal to gl_ScopeWorkgroup, storage semantics
equal to gl_StorageSemanticsShared, and sem equal to
gl_SemanticsAcquireRelease. Informally, this means it synchronizes
accesses to shared memory between invocations in the same workgroup.

For memoryBarrier() I may stand corrected:

https://github.com/KhronosGroup/GLSL/blob/master/extensions/khr/GL_KHR_memory_scope_semantics.txt#L492

    void memoryBarrier()
    // equivalent to:
    // memoryBarrier(gl_ScopeQueueFamily,
    //               gl_StorageSemanticsBuffer |
    //               gl_StorageSemanticsShared |
    //               gl_StorageSemanticsImage,
    //               gl_SemanticsAcquireRelease)

That's a lot stronger than I expected. But then there's a caveat that we're relying on GLSL-before-Vulkan-memory-model chaining a barrier() with a memoryBarrier() will effectively form (essentially) a synchronizing chain. I'm not clear that we can guarantee that, but it's what we're shooting for in the WGSL memory model, and will test. We do guarantee that in the vulkan memory model case.

raphlinus · 2021-11-12T17:13:13Z

Ah, thanks. I thought memoryBarrierBuffer was in the compatibility set, I might switch it to memoryBarrier if I come across a device that has trouble with it.

I'm finding a lot of failures on actual GPUs as I exercise this stuff, as in I haven't yet come across a single card that executes all the tests in my suite without any issues. I think this is going to be a two-part effort, first spec'ing the memory model, then doing a considerable amount of validation work.

dneto0 · 2021-11-12T17:38:09Z

I'm finding a lot of failures on actual GPUs as I exercise this stuff, as in I haven't yet come across a single card that executes all the tests in my suite without any issues.

Thank you!

I think this is going to be a two-part effort, first spec'ing the memory model, then doing a considerable amount of validation work.

Definitely. We have to find the common set of functionality that actually works. The draft as landed, and amended here, is our best understanding at this point.

But Metal and D3D memory models are unspecified, and Vulkan is really only specified once we opt into "vulkan memory model". So we're relying on "lore", commonly-understood-but-never-really-well-specified. One way to look at it is that GPU-land is where CPU-land was prior to C++11.

As for testing, @alan-baker engaged with UCSC researchers. They have been writing memory model litmus tests for WebGPU. They've done a great job. See https://users.soe.ucsc.edu/~reeselevine/webgpu/ (requires Chrome canary with a flag). They plan to present an overview in an upcoming WGSL meeting (December 14).

There are lots of moving parts here:

what's the actual behaviour of underlying platforms
spec'ing that well
testing that well
ensuring browsers map correctly to underlying platforms
ensuring platform compilers and drivers do what they promise (or don't promise a the case may be)

Thanks for pushing the envelope. It helps us all.

dneto0

Approve without caveats or nits.

wgsl/index.bs

github-actions · 2021-11-12T18:32:58Z

Previews, as seen when this build job started (ee7b2ae):
WebGPU | IDL
WGSL
Explainer

reeselevine · 2021-11-13T02:15:17Z

Minor update from the UCSC side, we now have an official subdomain for the memory model tests at https://gpuharbor.ucsc.edu/ (and it's in the origin trial so no need for Chrome Canary!).

I've been talking to @raphlinus about these issues as well, there are some ideas that I'm hoping to work into the tests there. Also, there's a version of his prefix sum code at https://gpuharbor.ucsc.edu/prefix-sum/, which we can hopefully use to reproduce some of the failures and look for more testing ideas!

raphlinus · 2021-11-16T15:23:49Z

I agree this represents the current state of understanding, and thanks. I would argue that the name storageBarrier is confusing for the new understanding (I'd much prefer the name be reserved for a barrier with device scope), and that the mismatch between scope and space will be difficult to document, but don't at this point have a better proposal.

A bit of a follow-up. I have indeed begun collaborating with @reeselevine on more rigorously testing the memory model, especially in light of the failures I've been seeing (which btw are not quite as bad in my last comment, some of the failures turned out to be a write-after-read hazard in my code). I should also clarify, the decoupled look-back test is now not expected to pass in WebGPU, but the program may still be useful in tracking bugs (it absolutely would show failures in current AMD 5700 XT Windows Vulkan if the barrier were upgraded to device scope), and of course in pointing the way to what might be added post-1.0 to make such workloads viable.

So far we haven't come up with a test that is valid WebGPU and unambiguously catches new failures, but we have one in the pipeline which may pan out, relating to coherence of relaxed atomic loads. We'll keep iterating on that and certainly file an issue if anything needs fixing in the spec.

@jdashg

SHA: 559a883 Reason: push, by @jdashg Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

@jdashg

SHA: 559a883 Reason: push, by @jdashg Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

@jdashg

SHA: 559a883 Reason: push, by @jdashg Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

kdashg · 2021-11-16T20:06:51Z

WGSL meeting minutes 2021-11-16

Updates:
- Had to weaken barriers because in Metal threadgroup_barrier only provides a memory ordering between invocations in the same threadgroup. (That’s not clear at all from the docs; OpenCL 1.x has the same problem.)
- Other updates were editorial.
- (David was tempted to land this directly…)
AB: Effectively this is the only way it can work. Documents a limitation in an underlying platform.
RM: Agree, should land it unless someone sees an issue with it.
TR: There’s a comment at the start saying translations to HLSL will over-synchronize buffers. Don’t think that’s true? Or something else.
AB: Maybe over-synchronize is the wrong word. HLSL memory scope is larger than what WGSL will offer.
TR: So because UAV is device memory shared across a whole device, it may eventually flush to other workgroups, but WGSL doesn’t guarantee as much as that.
AB: Difference is that in Metal you won’t get that flush across workgroups, but in HLSL you will get that flush across workgroups.
TR: So you’re saying you can write code that can access locations in the same global buffer, but will be independent?
AB: If they are non-atomic accesses, and at least one is a write, then it’s a data race. And that’s undefined behaviour.
AB: In my mind a control barrier has 4 essential parameters
- Control scope: workgroup: which invocations wait for each other.
- Memory scope: what set of innovations will see the flushing of memory operations. Could be workgroup, or could be device.
- Storage semantics: what kinds of memory does the cache invalidation/flushing apply to? Workgroup memory or device memory or image memory.
- Ordering: e.g. Relaxed (for atomics), or acquire-release. (used in barriers, and C++11 atomics.)
TR: There’s different scopes for UAVs. there are “withSync” variants.
AB: My understanding of the “withSync” variant is that these are control barriers in addition to memory fences.
TR: There should be a barrier intrinsic that does UAV barrier at workgroup memory scope, except it’s not as granular for buffer types.
- AllMemoryBarrierWithGroupSync
  - AB: That’s shared and device
  - TR: Yes
- GroupMemoryBarrierWithGroupSync.
  - AB: I missed those in review.
- TR: DXIL may have more options than HLSL does. The HLSL level doesn’t expose the UAV with workgroup memory scope.
  - AB: Ok, so that’s why we didn’t notice it. We’re generating HLSL in Tint, and have too-big memory scope. We use DeviceMemoryBarrierWithGroupSync (?)
  - AB: In future we’ll use the more narrow one.
JG: Sounds like we’re** Approved to Merge**
MM: After this, how many variations are we going to have in WGSL
AB: Two, though internal feedback says one is poorly named.
MM: What’s the difference between them?
AB: Workgroup vs device memory

* Fix the memory scope for non-private reads and writes missed in gpuweb#2297 * only atomics have device memory scope * Define and explain memory and execution scopes * add links in atomic and sync built-in function sections * Define a quad for the purpose of an implicit execution scope on derivatives * Remove todos from collective operations * fill out barrier, defining control barrier

* Fill out scope explanations and collective operations * Fix the memory scope for non-private reads and writes missed in #2297 * only atomics have device memory scope * Define and explain memory and execution scopes * add links in atomic and sync built-in function sections * Define a quad for the purpose of an implicit execution scope on derivatives * Remove todos from collective operations * fill out barrier, defining control barrier * Changes for review * move definitions to more appropriate locations * attempt to further clarify memory scope

alan-baker added the wgsl WebGPU Shading Language Issues label Nov 12, 2021

alan-baker requested review from RobinMorisset, dneto0 and kvark November 12, 2021 01:38

Revert changes to atomic memory scopes

ee7b2ae

* Storage atomics should remain device scope

dneto0 approved these changes Nov 12, 2021

View reviewed changes

wgsl/index.bs Show resolved Hide resolved

RobinMorisset approved these changes Nov 16, 2021

View reviewed changes

kdashg merged commit 559a883 into gpuweb:main Nov 16, 2021

github-actions bot added a commit that referenced this pull request Nov 16, 2021

Fixes to memory model for barriers and atomics (#2297)

cb2a7f6

SHA: 559a883 Reason: push, by @jdashg Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

github-actions bot added a commit that referenced this pull request Nov 16, 2021

Fixes to memory model for barriers and atomics (#2297)

53de277

SHA: 559a883 Reason: push, by @jdashg Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

github-actions bot added a commit that referenced this pull request Nov 16, 2021

Fixes to memory model for barriers and atomics (#2297)

75adb93

SHA: 559a883 Reason: push, by @jdashg Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

raphlinus mentioned this pull request Nov 18, 2021

Atomic concerns #2229

Closed

alan-baker mentioned this pull request Nov 29, 2021

Fill out scope explanations and collective operations #2351

Merged

alan-baker deleted the memory-model-fixes branch November 30, 2021 17:09

dneto0 mentioned this pull request Jan 25, 2023

question: can barriers synchronize memory accesses across workgroups #3774

Closed

anicusan mentioned this pull request Nov 13, 2024

accumulate on Metal sometimes fails due to weaker @synchronize guarantees than on other platforms JuliaGPU/AcceleratedKernels.jl#10

Closed

SludgePhD mentioned this pull request Dec 19, 2024

Way to check if device supports Persistent Threads gfx-rs/wgpu#6784

Closed

mikialex mentioned this pull request Mar 28, 2025

Global synchronization scope storageBarrier gfx-rs/wgpu#7445

Closed

anicusan mentioned this pull request Jul 15, 2025

Use UnsafeAtomics to fix race in accumulate JuliaGPU/AcceleratedKernels.jl#44

Merged


		The access mode `A` in all atomic built-in functions must be [=access/read_write=].

		TODO: Add links to the eventual memory model descriptions.

Conversation

alan-baker commented Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 12, 2021

Uh oh!

dneto0 commented Nov 12, 2021

Uh oh!

alan-baker commented Nov 12, 2021

Uh oh!

dneto0 commented Nov 12, 2021

Uh oh!

github-actions bot commented Nov 12, 2021

Uh oh!

raphlinus commented Nov 12, 2021

Uh oh!

dneto0 left a comment

Choose a reason for hiding this comment

Uh oh!

dneto0 Nov 12, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dneto0 commented Nov 12, 2021

Uh oh!

raphlinus commented Nov 12, 2021

Uh oh!

dneto0 commented Nov 12, 2021

Uh oh!

dneto0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Nov 12, 2021

Uh oh!

reeselevine commented Nov 13, 2021

Uh oh!

raphlinus commented Nov 16, 2021

Uh oh!

kdashg commented Nov 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

alan-baker commented Nov 12, 2021 •

edited

Loading