Fixes to memory model for barriers and atomics#2297
Conversation
* All barriers and atomics only have a workgroup memory scope
* This means translations of storageBarrier to HLSL will
over-synchronize the memory
* Fix links between atomics and barriers and the memory model
* Fix description of compare exchange atomic
Really? From the MSL manual v2.4, 6.9.1 "Threadgroup and SIMD-group synchronization functions"
And table 6.12 (mem fence flags): has
|
|
The OpenCL docs provide a bit better explanation, but there are a few things going on with barriers (and atomics).
That's from the unified OpenCL C 3.0 spec. From OpenCL's SPIR environment:
For barriers, I think SPIR breaks it down reasonably well, though the semantics operand is a bit muddled:
Metal's The important part is to update the mapping as that really describes the memory model aspect. We can do a follow up PR or two to try and clarify what these four (five?) things really are for barriers and atomics (no execution scope there of course). |
Got it. If I'm reading it right, it means Metal's threadgroup_barrier doesn't reliably support message passing between one threadgroup (workgroup) and another via device memory. Hence the first complaint in #2229. A similar conclusion applies to vanilla I think your analysis of HLSL's barriers concluded that it does support what's needed. (And we made sure the SPIR-V backend on DXC translates appropriately when targeting Vulkan at least.) |
* Storage atomics should remain device scope
|
To address the questions raised by @dneto0 about what's needed for message passing across workgroups (where the message can't be packed in a 32 bit word along with the flags), no, from what I can tell it can't be done in Metal at all (confirmed by testing and discussions with experts). On GLSL in compatibility mode, memoryBarrierBuffer should be sufficient, as it's defined in the spec to have queue family scope (which is fine, all I care about is within a dispatch; coordination between GPU and CPU and suchlike is a whole nother ball of wax). And on HLSL I am fairly certain DeviceMemoryBarrier has device scope as well. |
dneto0
left a comment
There was a problem hiding this comment.
Please remove the now-extraneous note about Device scope, at (new) line 7401
|
|
||
| The access mode `A` in all atomic built-in functions must be [=access/read_write=]. | ||
|
|
||
| TODO: Add links to the eventual memory model descriptions. |
'memoryBarrierBuffer' was added in the vulkan memory model extension for GLSL (that's what you pointed at). Prior to that extension, there is only
For That's a lot stronger than I expected. But then there's a caveat that we're relying on GLSL-before-Vulkan-memory-model chaining a barrier() with a memoryBarrier() will effectively form (essentially) a synchronizing chain. I'm not clear that we can guarantee that, but it's what we're shooting for in the WGSL memory model, and will test. We do guarantee that in the vulkan memory model case. |
|
Ah, thanks. I thought memoryBarrierBuffer was in the compatibility set, I might switch it to memoryBarrier if I come across a device that has trouble with it. I'm finding a lot of failures on actual GPUs as I exercise this stuff, as in I haven't yet come across a single card that executes all the tests in my suite without any issues. I think this is going to be a two-part effort, first spec'ing the memory model, then doing a considerable amount of validation work. |
Thank you!
Definitely. We have to find the common set of functionality that actually works. The draft as landed, and amended here, is our best understanding at this point. But Metal and D3D memory models are unspecified, and Vulkan is really only specified once we opt into "vulkan memory model". So we're relying on "lore", commonly-understood-but-never-really-well-specified. One way to look at it is that GPU-land is where CPU-land was prior to C++11. As for testing, @alan-baker engaged with UCSC researchers. They have been writing memory model litmus tests for WebGPU. They've done a great job. See https://users.soe.ucsc.edu/~reeselevine/webgpu/ (requires Chrome canary with a flag). They plan to present an overview in an upcoming WGSL meeting (December 14). There are lots of moving parts here:
Thanks for pushing the envelope. It helps us all. |
dneto0
left a comment
There was a problem hiding this comment.
Approve without caveats or nits.
|
Minor update from the UCSC side, we now have an official subdomain for the memory model tests at https://gpuharbor.ucsc.edu/ (and it's in the origin trial so no need for Chrome Canary!). I've been talking to @raphlinus about these issues as well, there are some ideas that I'm hoping to work into the tests there. Also, there's a version of his prefix sum code at https://gpuharbor.ucsc.edu/prefix-sum/, which we can hopefully use to reproduce some of the failures and look for more testing ideas! |
|
I agree this represents the current state of understanding, and thanks. I would argue that the name A bit of a follow-up. I have indeed begun collaborating with @reeselevine on more rigorously testing the memory model, especially in light of the failures I've been seeing (which btw are not quite as bad in my last comment, some of the failures turned out to be a write-after-read hazard in my code). I should also clarify, the decoupled look-back test is now not expected to pass in WebGPU, but the program may still be useful in tracking bugs (it absolutely would show failures in current AMD 5700 XT Windows Vulkan if the barrier were upgraded to device scope), and of course in pointing the way to what might be added post-1.0 to make such workloads viable. So far we haven't come up with a test that is valid WebGPU and unambiguously catches new failures, but we have one in the pipeline which may pan out, relating to coherence of relaxed atomic loads. We'll keep iterating on that and certainly file an issue if anything needs fixing in the spec. |
WGSL meeting minutes 2021-11-16
|
* Fix the memory scope for non-private reads and writes missed in gpuweb#2297 * only atomics have device memory scope * Define and explain memory and execution scopes * add links in atomic and sync built-in function sections * Define a quad for the purpose of an implicit execution scope on derivatives * Remove todos from collective operations * fill out barrier, defining control barrier
* Fill out scope explanations and collective operations * Fix the memory scope for non-private reads and writes missed in #2297 * only atomics have device memory scope * Define and explain memory and execution scopes * add links in atomic and sync built-in function sections * Define a quad for the purpose of an implicit execution scope on derivatives * Remove todos from collective operations * fill out barrier, defining control barrier * Changes for review * move definitions to more appropriate locations * attempt to further clarify memory scope
`and atomicsonly have a workgroup memory scopeover-synchronize the memory
I realize I made a mistake when reading the Metal docs and unintentionally assumed more functionality than is guaranteed.
threadgroup_barrierin MSL is effectively OpenCL C 1.2'sbarrierintrinsic which only has a memory scope ofWorkgroup. This means (in the context of #2229) that inter-workgroup communication cannot be support correctly on all underlying platforms.