Skip to content

Fixes to memory model for barriers and atomics#2297

Merged
kdashg merged 2 commits intogpuweb:mainfrom
alan-baker:memory-model-fixes
Nov 16, 2021
Merged

Fixes to memory model for barriers and atomics#2297
kdashg merged 2 commits intogpuweb:mainfrom
alan-baker:memory-model-fixes

Conversation

@alan-baker
Copy link
Contributor

@alan-baker alan-baker commented Nov 12, 2021

  • All barriers `and atomics only have a workgroup memory scope
    • This means translations of storageBarrier to HLSL will
      over-synchronize the memory
  • Fix links between atomics and barriers and the memory model
  • Fix description of compare exchange atomic
  • Update remaining uses of Device to use QueueFamily

I realize I made a mistake when reading the Metal docs and unintentionally assumed more functionality than is guaranteed. threadgroup_barrier in MSL is effectively OpenCL C 1.2's barrier intrinsic which only has a memory scope of Workgroup. This means (in the context of #2229) that inter-workgroup communication cannot be support correctly on all underlying platforms.

* All barriers and atomics only have a workgroup memory scope
  * This means translations of storageBarrier to HLSL will
    over-synchronize the memory
* Fix links between atomics and barriers and the memory model
* Fix description of compare exchange atomic
@alan-baker alan-baker added the wgsl WebGPU Shading Language Issues label Nov 12, 2021
@github-actions
Copy link
Contributor

Previews, as seen when this build job started (84bdbc7):
WebGPU | IDL
WGSL
Explainer

@dneto0
Copy link
Contributor

dneto0 commented Nov 12, 2021

I realize I made a mistake when reading the Metal docs and unintentionally assumed more functionality than is guaranteed. threadgroup_barrier in MSL is effectively OpenCL C 1.2's barrier intrinsic which only has a memory scope of Workgroup. This means (in the context of #2229) that inter-workgroup communication cannot be support correctly on all underlying platforms.

Really? From the MSL manual v2.4, 6.9.1 "Threadgroup and SIMD-group synchronization functions"

The threadgroup_barrier (or simdgroup_barrier) function can also queue a memory fence
(reads and writes) to ensure the correct ordering of memory operations to threadgroup or
device memory.

And table 6.12 (mem fence flags): has

mem_device: Ensure correct ordering of memory operations to device
memory.

@alan-baker
Copy link
Contributor Author

The OpenCL docs provide a bit better explanation, but there are a few things going on with barriers (and atomics).

For barrier and the work_group_barrier variant that does not take a memory scope, the scope is memory_scope_work_group.

That's from the unified OpenCL C 3.0 spec. From OpenCL's SPIR environment:

In an OpenCL 1.2 environment, for the Barrier Instructions OpControlBarrier and OpMemoryBarrier, the Scope for Memory must be Workgroup, and the memory-order constraint in Memory Semantics must be SequentiallyConsistent.

For barriers, I think SPIR breaks it down reasonably well, though the semantics operand is a bit muddled:

  • Execution scope: which invocations participate in the control aspect
  • Memory scope: the extent (in terms of invocations) of cache flushing
  • Memory semantics: both memory ordering and the affected memory types

Metal's mem_fence_flags only affects the storage aspect of memory semantics, not the memory scope. That's the intent of this PR.

The important part is to update the mapping as that really describes the memory model aspect. We can do a follow up PR or two to try and clarify what these four (five?) things really are for barriers and atomics (no execution scope there of course).

@dneto0
Copy link
Contributor

dneto0 commented Nov 12, 2021

Metal's mem_fence_flags only affects the storage aspect of memory semantics, not the memory scope. That's the intent of this PR.

Got it.

If I'm reading it right, it means Metal's threadgroup_barrier doesn't reliably support message passing between one threadgroup (workgroup) and another via device memory. Hence the first complaint in #2229.

A similar conclusion applies to vanilla barrier() in GLSL on Vulkan or OpenGL. You'd need to use a more capable barrier enabled by GL_KHR_memory_scope_semantics

I think your analysis of HLSL's barriers concluded that it does support what's needed. (And we made sure the SPIR-V backend on DXC translates appropriately when targeting Vulkan at least.)

* Storage atomics should remain device scope
@github-actions
Copy link
Contributor

Previews, as seen when this build job started (ee7b2ae):
WebGPU | IDL
WGSL
Explainer

@raphlinus
Copy link

To address the questions raised by @dneto0 about what's needed for message passing across workgroups (where the message can't be packed in a 32 bit word along with the flags), no, from what I can tell it can't be done in Metal at all (confirmed by testing and discussions with experts). On GLSL in compatibility mode, memoryBarrierBuffer should be sufficient, as it's defined in the spec to have queue family scope (which is fine, all I care about is within a dispatch; coordination between GPU and CPU and suchlike is a whole nother ball of wax). And on HLSL I am fairly certain DeviceMemoryBarrier has device scope as well.

Copy link
Contributor

@dneto0 dneto0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the now-extraneous note about Device scope, at (new) line 7401


The access mode `A` in all atomic built-in functions must be [=access/read_write=].

TODO: Add links to the eventual memory model descriptions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, thanks.

@dneto0
Copy link
Contributor

dneto0 commented Nov 12, 2021

On GLSL in compatibility mode, memoryBarrierBuffer should be sufficient, as it's defined in the spec to have queue family scope (which is fine, all I care about is within a dispatch; coordination between GPU and CPU and suchlike is a whole nother ball of wax).

'memoryBarrierBuffer' was added in the vulkan memory model extension for GLSL (that's what you pointed at).

Prior to that extension, there is only barrier() and memoryBarrier().
The barrier() is not sufficient:

https://github.com/KhronosGroup/GLSL/blob/master/extensions/khr/GL_KHR_memory_scope_semantics.txt#L461

In compute shaders, barrier() is equivalent to controlBarrier() with
execution and memory scope equal to gl_ScopeWorkgroup, storage semantics
equal to gl_StorageSemanticsShared, and sem equal to
gl_SemanticsAcquireRelease. Informally, this means it synchronizes
accesses to shared memory between invocations in the same workgroup.

For memoryBarrier() I may stand corrected:

https://github.com/KhronosGroup/GLSL/blob/master/extensions/khr/GL_KHR_memory_scope_semantics.txt#L492

    void memoryBarrier()
    // equivalent to:
    // memoryBarrier(gl_ScopeQueueFamily,
    //               gl_StorageSemanticsBuffer |
    //               gl_StorageSemanticsShared |
    //               gl_StorageSemanticsImage,
    //               gl_SemanticsAcquireRelease)

That's a lot stronger than I expected. But then there's a caveat that we're relying on GLSL-before-Vulkan-memory-model chaining a barrier() with a memoryBarrier() will effectively form (essentially) a synchronizing chain. I'm not clear that we can guarantee that, but it's what we're shooting for in the WGSL memory model, and will test. We do guarantee that in the vulkan memory model case.

@raphlinus
Copy link

Ah, thanks. I thought memoryBarrierBuffer was in the compatibility set, I might switch it to memoryBarrier if I come across a device that has trouble with it.

I'm finding a lot of failures on actual GPUs as I exercise this stuff, as in I haven't yet come across a single card that executes all the tests in my suite without any issues. I think this is going to be a two-part effort, first spec'ing the memory model, then doing a considerable amount of validation work.

@dneto0
Copy link
Contributor

dneto0 commented Nov 12, 2021

I'm finding a lot of failures on actual GPUs as I exercise this stuff, as in I haven't yet come across a single card that executes all the tests in my suite without any issues.

Thank you!

I think this is going to be a two-part effort, first spec'ing the memory model, then doing a considerable amount of validation work.

Definitely. We have to find the common set of functionality that actually works. The draft as landed, and amended here, is our best understanding at this point.

But Metal and D3D memory models are unspecified, and Vulkan is really only specified once we opt into "vulkan memory model". So we're relying on "lore", commonly-understood-but-never-really-well-specified. One way to look at it is that GPU-land is where CPU-land was prior to C++11.

As for testing, @alan-baker engaged with UCSC researchers. They have been writing memory model litmus tests for WebGPU. They've done a great job. See https://users.soe.ucsc.edu/~reeselevine/webgpu/ (requires Chrome canary with a flag). They plan to present an overview in an upcoming WGSL meeting (December 14).

There are lots of moving parts here:

  • what's the actual behaviour of underlying platforms
  • spec'ing that well
  • testing that well
  • ensuring browsers map correctly to underlying platforms
  • ensuring platform compilers and drivers do what they promise (or don't promise a the case may be)

Thanks for pushing the envelope. It helps us all.

Copy link
Contributor

@dneto0 dneto0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve without caveats or nits.

@github-actions
Copy link
Contributor

Previews, as seen when this build job started (ee7b2ae):
WebGPU | IDL
WGSL
Explainer

@reeselevine
Copy link

Minor update from the UCSC side, we now have an official subdomain for the memory model tests at https://gpuharbor.ucsc.edu/ (and it's in the origin trial so no need for Chrome Canary!).

I've been talking to @raphlinus about these issues as well, there are some ideas that I'm hoping to work into the tests there. Also, there's a version of his prefix sum code at https://gpuharbor.ucsc.edu/prefix-sum/, which we can hopefully use to reproduce some of the failures and look for more testing ideas!

@raphlinus
Copy link

I agree this represents the current state of understanding, and thanks. I would argue that the name storageBarrier is confusing for the new understanding (I'd much prefer the name be reserved for a barrier with device scope), and that the mismatch between scope and space will be difficult to document, but don't at this point have a better proposal.

A bit of a follow-up. I have indeed begun collaborating with @reeselevine on more rigorously testing the memory model, especially in light of the failures I've been seeing (which btw are not quite as bad in my last comment, some of the failures turned out to be a write-after-read hazard in my code). I should also clarify, the decoupled look-back test is now not expected to pass in WebGPU, but the program may still be useful in tracking bugs (it absolutely would show failures in current AMD 5700 XT Windows Vulkan if the barrier were upgraded to device scope), and of course in pointing the way to what might be added post-1.0 to make such workloads viable.

So far we haven't come up with a test that is valid WebGPU and unambiguously catches new failures, but we have one in the pipeline which may pan out, relating to coherence of relaxed atomic loads. We'll keep iterating on that and certainly file an issue if anything needs fixing in the spec.

@kdashg kdashg merged commit 559a883 into gpuweb:main Nov 16, 2021
github-actions bot added a commit that referenced this pull request Nov 16, 2021
SHA: 559a883
Reason: push, by @jdashg

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
github-actions bot added a commit that referenced this pull request Nov 16, 2021
SHA: 559a883
Reason: push, by @jdashg

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
github-actions bot added a commit that referenced this pull request Nov 16, 2021
SHA: 559a883
Reason: push, by @jdashg

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@kdashg
Copy link
Contributor

kdashg commented Nov 16, 2021

WGSL meeting minutes 2021-11-16
  • Updates:
    • Had to weaken barriers because in Metal threadgroup_barrier only provides a memory ordering between invocations in the same threadgroup. (That’s not clear at all from the docs; OpenCL 1.x has the same problem.)
    • Other updates were editorial.
    • (David was tempted to land this directly…)
  • AB: Effectively this is the only way it can work. Documents a limitation in an underlying platform.
  • RM: Agree, should land it unless someone sees an issue with it.
  • TR: There’s a comment at the start saying translations to HLSL will over-synchronize buffers. Don’t think that’s true? Or something else.
  • AB: Maybe over-synchronize is the wrong word. HLSL memory scope is larger than what WGSL will offer.
  • TR: So because UAV is device memory shared across a whole device, it may eventually flush to other workgroups, but WGSL doesn’t guarantee as much as that.
  • AB: Difference is that in Metal you won’t get that flush across workgroups, but in HLSL you will get that flush across workgroups.
  • TR: So you’re saying you can write code that can access locations in the same global buffer, but will be independent?
  • AB: If they are non-atomic accesses, and at least one is a write, then it’s a data race. And that’s undefined behaviour.
  • AB: In my mind a control barrier has 4 essential parameters
    • Control scope: workgroup: which invocations wait for each other.
    • Memory scope: what set of innovations will see the flushing of memory operations. Could be workgroup, or could be device.
    • Storage semantics: what kinds of memory does the cache invalidation/flushing apply to? Workgroup memory or device memory or image memory.
    • Ordering: e.g. Relaxed (for atomics), or acquire-release. (used in barriers, and C++11 atomics.)
  • TR: There’s different scopes for UAVs. there are “withSync” variants.
  • AB: My understanding of the “withSync” variant is that these are control barriers in addition to memory fences.
  • TR: There should be a barrier intrinsic that does UAV barrier at workgroup memory scope, except it’s not as granular for buffer types.
    • AllMemoryBarrierWithGroupSync
      • AB: That’s shared and device
      • TR: Yes
    • GroupMemoryBarrierWithGroupSync.
      • AB: I missed those in review.
    • TR: DXIL may have more options than HLSL does. The HLSL level doesn’t expose the UAV with workgroup memory scope.
      • AB: Ok, so that’s why we didn’t notice it. We’re generating HLSL in Tint, and have too-big memory scope. We use DeviceMemoryBarrierWithGroupSync (?)
      • AB: In future we’ll use the more narrow one.
  • JG: Sounds like we’re** Approved to Merge**
  • MM: After this, how many variations are we going to have in WGSL
  • AB: Two, though internal feedback says one is poorly named.
  • MM: What’s the difference between them?
  • AB: Workgroup vs device memory

@raphlinus raphlinus mentioned this pull request Nov 18, 2021
alan-baker added a commit to alan-baker/gpuweb that referenced this pull request Nov 29, 2021
* Fix the memory scope for non-private reads and writes missed in gpuweb#2297
  * only atomics have device memory scope
* Define and explain memory and execution scopes
  * add links in atomic and sync built-in function sections
* Define a quad for the purpose of an implicit execution scope on
  derivatives
* Remove todos from collective operations
  * fill out barrier, defining control barrier
dneto0 pushed a commit that referenced this pull request Nov 29, 2021
* Fill out scope explanations and collective operations

* Fix the memory scope for non-private reads and writes missed in #2297
  * only atomics have device memory scope
* Define and explain memory and execution scopes
  * add links in atomic and sync built-in function sections
* Define a quad for the purpose of an implicit execution scope on
  derivatives
* Remove todos from collective operations
  * fill out barrier, defining control barrier

* Changes for review

* move definitions to more appropriate locations
* attempt to further clarify memory scope
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wgsl WebGPU Shading Language Issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants