Skip to content

Add multi-draw-indirect feature.#1949

Closed
mrshannon wants to merge 2 commits intogpuweb:mainfrom
mrshannon:add-multi-draw-indirect-feature
Closed

Add multi-draw-indirect feature.#1949
mrshannon wants to merge 2 commits intogpuweb:mainfrom
mrshannon:add-multi-draw-indirect-feature

Conversation

@mrshannon
Copy link
Contributor

@mrshannon mrshannon commented Jul 15, 2021

Indirect drawing with multiple indirect drawing commands is a common technique for drawing complex scenes that would otherwise be infeasible due to either an excessive number of CPU issued draw calls or scene complexity that cannot be built by the CPU alone. This is done by:

  • Executing multiple draws with a single API call.
  • Allowing the GPU to generate both geometry and the draws necessary to render it.
  • Culling out unnecessary draw calls on the GPU in more complex scenes than CPU culling could achieve.

This PR addresses adding a multi-draw-indirect feature. In particular it addresses adding:

  • multiDrawIndirect and multiDrawIndexedIndirect methods on GPURenderEncoderBase.
    • Allows submitting multiple draws with a single API call (multi-draw).
    • Allows the GPU to determine the number of draw calls (draw count).
    • Use cases:
      • GPU derived scene data
      • GPU based culling
      • GPU based LOD
      • Efficient execution of complex scenes with a large number of draws
  • Non-zero firstInstance for drawIndirect, drawIndexedIndirect, multiDrawIndirect, and multiDrawIndexedIndirect.
    • This is the only available per draw input, without rebinds, that is readable in the shader.
    • Use cases:
      • Select instance stride vertex data
      • Index into per object or per draw data in storage buffers
      • Multi material, single API call, rendering

Compatibility

The required backend features to implement multi-draw-indirect are available on:

  • Newer Apple devices (~2016+)
  • All DX12 devices
  • All Vulkan capable desktops (with up to date drivers)
  • 30% of Android devices

See the sections below for details.

Vulkan

Multi-Draw

Requires the 0 or 1 restriction on the drawCount argument of vkCmdDrawIndirect and vkCmdDrawIndexIndirect to be relaxed to any non-negative integer. This requires the multiDrawIndirect feature which is supported on:

  • 99% of desktop GPUs
  • 63% of Android devices

NOTE: The stride argument will always be set for tight packing, in order to maintain compatibility with DX12.

Draw Count

Requires the vkCmdDrawIndirectCount and vkCmdDrawIndexedIndirectCount functions which are provided by either the drawIndirectCount feature of Vulkan 1.2 or one of the following extensions:

  • VK_AMD_draw_indirect_count
  • VK_KHR_draw_indirect_count

Because drawIndirectCount was introduced in driver updates the statistics at https://vulkan.gpuinfo.org cannot be relied upon. The following is based on the oldest card that supports drawIndirectCount from each manufacturer, if newer cards dropped support for drawIndirectCount that is not captured here.

  • Intel integrated cards (that support Vulkan) support drawIndirectCount.
  • NVIDIA cards going back to Kepler support drawIndirectCount.
  • AMD cards going back to the HD 8000 series support drawIndirectCount.

For Android:

  • drawIndirectCount is supported on 100% of devices that support Vulkan 1.2.
  • drawIndirectCount is supported, as an extension, on 28% of devices that do not support Vulkan 1.2.

Non-zero firstInstance

Requires the firstInstance property of the VkDrawIndirectCommand and VkDrawIndexedIndirectCommand to be non-zero. This requires the drawIndirectFirstInstance feature which is supported on:

  • 99% of desktop GPUs
  • 64% of Android devices

DX12

All required features are core to DX12.

Multi-Draw

Uses ExecuteIndirect where the MaxCommandCount argument is greater than 1 and the pArgumentBuffer argument points to a GPU buffer containing an array of D3D12_DRAW_ARGUMENTS or D3D12_DRAW_INDEXED_ARGUMENTS.

NOTE: The binary layout of these structs are compatible with Vulkan.

Draw Count

Uses ExecuteIndirect where the pCountBuffer argument is not NULL.

Non-zero firstInstance

This is the StartInstanceLocation of the D3D12_DRAW_ARGUMENTS or D3D12_DRAW_INDEXED_ARGUMENTS structures. Has native support for values greater than 0.

Metal

Multi-Draw

Can be emulated with Indirect Command Buffers (ICBs) and an extra compute shader invocation to translate from the Vulkan-like indirect draw buffer to an ICB.

Requires

  • iOS 12.0+
  • macOS 10.14+
  • MTLGPUFamilyMac2

Non-zero firstInstance

Natively supported with the baseInstance argument.

Draw Count

Don't record commands past this count in the ICB and use optimizedIndirectCommandBuffer.

Requires

  • iOS 12.0+
  • macOS 10.14+
  • MTLGPUFamilyMac2

Preview | Diff

Copy link
Contributor

@kvark kvark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much detailed, thank you for the proposal!

We'll probably not expose this on Metal for some time, but having the ICB backing it is an interesting concept.

I think, just polyfilling it entirely on base WebGPU takes a similar amount of effort:

  1. copy the indirect arguments into a temporary buffer
  2. run a compute shader, which will read the count value from the counts buffer, and then it will zero out the instance count on all of the indirect argument entries (in the temp copies) that are behind the read count value.
  3. record the maxDrawCount consecutive drawIndirect invocations, advancing the offset in each, and pointing to the temporary indirect buffer.

Given that it's roughly the same complexity as Metal's ICB workaround, I'm not seeing the latter to be feasible. We might as well polyfill this on all the platforms (that don't support it natively), or not involve any compute-based polyfill at all (leaving Apple platforms behind, and asking users to implement this polyfill on their side).

spec/index.bs Outdated

undefined drawIndirect(GPUBuffer indirectBuffer, GPUSize64 indirectOffset);
undefined drawIndexedIndirect(GPUBuffer indirectBuffer, GPUSize64 indirectOffset);
undefined drawIndirect(GPUBuffer indirectBuffer, GPUSize64 indirectOffset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider adding the extra methods (e.g. vkCmdDrawIndirectCount) instead of overloading the existing ones?

Copy link
Contributor Author

@mrshannon mrshannon Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered adding multiDrawIndirect and multiDrawIndexedIndirect. No other feature added methods so I was not sure what would be preferred. I can modify the PR if that is what the committee would prefer. I see 3 options:

  • Add the drawCount argument to the existing methods and make new methods with drawCountBuffer and drawCountOffset.
  • Put all multi-draw features in a new set of methods, multiDrawIndirect and multiDrawIndexedIndirect.
  • Add both multiDrawIndirect and multiDrawIndexedIndirect, and multiDrawIndirectCount and multiDrawIndexedIndirectCount.

DX12 does not separate the GPU derived count vs CPU derived count, but Vulkan does.

spec/index.bs Outdated
undefined drawIndirect(GPUBuffer indirectBuffer, GPUSize64 indirectOffset);
undefined drawIndexedIndirect(GPUBuffer indirectBuffer, GPUSize64 indirectOffset);
undefined drawIndirect(GPUBuffer indirectBuffer, GPUSize64 indirectOffset,
optional GPUSize32 maxDrawCount = 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this number be anything, or do we need an extra limit?
Glancing at gpuinfo, some Android devices report it as 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we would need another limit (but maybe not for Android), maxIndirectDrawCount seems to only be 1 for devices that do not support multiDrawIndirect.

spec/index.bs Outdated
- |drawCountBuffer| is `null`, unless the {{GPUFeatureName/"multi-draw-indirect"}} [=feature=] is enabled.
- |drawCountBuffer| is [$valid to use with$] |this|.
- |drawCountBuffer|.{{GPUBuffer/[[usage]]}} contains {{GPUBufferUsage/INDIRECT}}.
- |drawCountOffset| + sizeof([=indirect draw count=]) ≤
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we taking sizeof of the number here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will always be 32-bit, I could just specify 4 bytes.

@Kangz
Copy link
Contributor

Kangz commented Jul 16, 2021

+1 on this PR being nicely detailed, and the functionality already being polyfillable with compute shaders that copy the draw argument data in a buffer with maxCount drawIndirect (and zero-out unused draws).

Because of the compute shader validation that needs to happen I don't think we'll be able to implement this any time soon (that's why during OT Chromium won't have drawIndexedIndirect and dispatchIndirect enabled), but we could roll it out to more and more hardware gradually.

@mrshannon
Copy link
Contributor Author

mrshannon commented Jul 16, 2021

I think, just polyfilling it entirely on base WebGPU takes a similar amount of effort:

  1. copy the indirect arguments into a temporary buffer
  2. run a compute shader, which will read the count value from the counts buffer, and then it will zero out the instance count on all of the indirect argument entries (in the temp copies) that are behind the read count value.
  3. record the maxDrawCount consecutive drawIndirect invocations, advancing the offset in each, and pointing to the temporary indirect buffer.

Given that it's roughly the same complexity as Metal's ICB workaround, I'm not seeing the latter to be feasible. We might as well polyfill this on all the platforms (that don't support it natively), or not involve any compute-based polyfill at all (leaving Apple platforms behind, and asking users to implement this polyfill on their side).

On Metal this works because firstInstance can be non-zero. On Vulkan, everywhere firstInstance can be non-zero you also have multiDrawIndirect, though not necessarily drawIndirectCount. Though you can get around drawIndirectCount by zeroing out the instanceCount, either for the user in a compute shader or having the user do it themselves.

So this brings the question, should:

  • multiDrawIndirect
  • drawIndirectFirstInstance
  • drawIndirectCount

all be separate features, and let the user zero out instanceCount and/or polyfill themselves.

The issue with this is it does not fit well with our current model of ask for what you need and either get it or get nothing, because at first a user might ask for all of it. Then fallback on zeroing out instanceCount (less overhead if the user does this) and then need to ask for multiDrawIndirect and drawIndirectFirstInstance. Then on Metal that fails as well and they fall back to polyfill and ask for drawIndirectFirstInstance. This does not even consider that the requestAdapter call could have failed for other reasons.

@kvark
Copy link
Contributor

kvark commented Jul 16, 2021

It sounds to me that we can split the drawIndirectFirstInstance out of this feature, so that the user can polyfill the rest (and first instance semantic is truly orthogonal to the rest).

@mrshannon
Copy link
Contributor Author

It sounds to me that we can split the drawIndirectFirstInstance out of this feature, so that the user can polyfill the rest (and first instance semantic is truly orthogonal to the rest).

I will split that off, but what are the thoughts on overloading the existing draws vs adding new ones.

@kainino0x
Copy link
Contributor

Editors chatted and we think the multidraw calls are too different in shape and functionality - so preference for using a different name instead of overloading with optional arguments.

@kvark
Copy link
Contributor

kvark commented Jul 19, 2021

I will split that off, but what are the thoughts on overloading the existing draws vs adding new ones.

We talked about it some more with the editors, and we think it would really help to know how much the firstInstance feature is correlated with "multi-draw" in the native APIs.
The users can polyfill the "multi-draw", while they can't polyfill "firstInstance" efficiently, but their code can expose API with it's own constraints, it doesn't have to be exactly WebGPU API.

@mrshannon
Copy link
Contributor Author

mrshannon commented Jul 19, 2021

how much the firstInstance feature is correlated with "multi-draw" in the native APIs.

On Vulkan firstInstance is available everywhere multi-draw is (less than 1% difference), though its two separate features. It's only the drawIndirectCount that has less support. On Metal (without some sort of translation to ICBs) firstInstance is always available, but multi-draw is never available. On DX12 everything is always available. So separating firstInstance is only for Metal.

@kvark
Copy link
Contributor

kvark commented Jul 21, 2021

less than 1% difference

just to confirm, are we sure that the set of hardware is different by a small margin, or is it just the total percent/number that is different?

@mrshannon
Copy link
Contributor Author

just to confirm, are we sure that the set of hardware is different by a small margin, or is it just the total percent/number that is different?

multiDrawIndirect (63%) is mostly useless without drawIndirectFirstInstance (64%) which leads me to conclude that there are 1% of devices that have drawIndirectFirstInstance but not multiDrawIndirect. I guess there could be feature disparity but I don't know what the point would be as each of the draws would render the same instance.

@litherum
Copy link
Contributor

Why are we considering adding a new feature ostensibly for performance without any performance data?

WebGL and WebGPU have very different CPU overhead characteristics. We can't use the existence of this extension in WebGL as motivation for it in WebGPU.

@mrshannon
Copy link
Contributor Author

mrshannon commented Jul 26, 2021

Why are we considering adding a new feature ostensibly for performance without any performance data?

WebGL and WebGPU have very different CPU overhead characteristics. We can't use the existence of this extension in WebGL as motivation for it in WebGPU.

I just did some benchmarking using Vulkan. I have primarily been an OpenGL user and thus my assumption on the performance increase was wrong, I had expected at least an order of magnitude based on my OpenGL experience. By reusing the command buffer and using non-zero firstInstance I was able to achieve 0% to 6% performance increase (depending on GPU) with drawCount > 1, in comparison to using drawCount = 1. This was with a scene containing ~25,000 draws.

Next I simulated culling out 20,000 of those calls on the GPU and did the culling by setting instanceCount to 0. This resulted in a 5% to 48% performance improvement with drawCount = 24,576 over drawCount = 1. Adding in drawIndirectCount (instead of zeroing instanceCount) gained another 11% to 13% improvement. NOTE: The weaker the GPU the less the improvements are.

Therefore, in cases where pipelines and bind groups are consistent between frames (so render bundles can be used) and the number of draws is fairly consistent, a user could polyfill using firstInstance > 0 and render bundles with minimal performance loss. But with varied final draw numbers the performance of true multi-draw goes up significantly. However, this is only when the command buffer can be reused (render bundles). If the command buffer must be built every frame there is certainly CPU overhead to calling drawIndirect 25,000 times vs. calling multiDrawIndirect once.

I do not know what the performance loss would be with dynamic uniform/storage offset changes between indirect draws in the render bundle (which would be required without firstInstance > 0). I can't benchmark this because it is somewhat implementation dependent (argument buffers etc) and is no longer in WebGPU so I can't benchmark it that way.

@mrshannon
Copy link
Contributor Author

mrshannon commented Jul 26, 2021

Now that multi-draw-indirect is extracted into new methods I think first-instance-indirect should probably be a separate feature. This is because it seems strange that multi-draw-indirect changes the behavior of methods it does not add.

@github-actions
Copy link
Contributor

Previews, as seen when this build job started (b80d5d9):
WebGPU | IDL
WGSL
Explainer

@kainino0x
Copy link
Contributor

We can't use the existence of this extension in WebGL as motivation for it in WebGPU.

This extension doesn't exist in WebGL, which doesn't have indirect draws at all. WebGL has non-indirect multi-draw as an extension.

However the presence of this feature in Vulkan is some evidence of its value.

@kainino0x
Copy link
Contributor

Now that multi-draw-indirect is extracted into new methods I think first-instance-indirect should probably be a separate feature. This is because it seems strange that multi-draw-indirect changes the behavior of methods it does not add.

This makes sense, though if we find that it's unnecessary to expose them separately, we could give it a more generic name like "draw-indirect2" or something.

@litherum
Copy link
Contributor

litherum commented Aug 2, 2021

Supported on all versions which are still maintained.

The Metal Feature Set Tables indicates that support on Mac is limited to MTLGPUFamilyMac2 and is unavailable on MTLGPUFamilyMac1. Are you indicating that you're considering MTLGPUFamilyMac1 to be unmaintained?

@kainino0x
Copy link
Contributor

kainino0x commented Aug 2, 2021

Resolution:

  • Tentatively two features: multi-draw-indirect, and first-instance-indirect.
  • Think about browsers emulating feature(s) in the future, but ~assume no emulation for now.
  • Make sure the out-of-bounds behavior is specified for this.

@kainino0x
Copy link
Contributor

Supported on all versions which are still maintained.

The Metal Feature Set Tables indicates that support on Mac is limited to MTLGPUFamilyMac2 and is unavailable on MTLGPUFamilyMac1. Are you indicating that you're considering MTLGPUFamilyMac1 to be unmaintained?

Just realized this was a confusing snip. From context, it's clear that "versions which are still maintained" was referring to OS releases, not hardware. The hardware requirement may have been overlooked.

@mrshannon
Copy link
Contributor Author

The Metal Feature Set Tables indicates that support on Mac is limited to MTLGPUFamilyMac2 and is unavailable on MTLGPUFamilyMac1. Are you indicating that you're considering MTLGPUFamilyMac1 to be unmaintained?

Corrected in PR comment, was unfamiliar with how Apple lists capabilities. Metal docs only list OS version.

@kainino0x
Copy link
Contributor

Resolution: accepted, merge after #2022 has landed and this is rebased over it.

@kainino0x kainino0x added for webgpu editors meeting copyediting Pure editorial stuff (copyediting, *.bs file syntax, etc.) and removed for webgpu editors meeting labels Aug 25, 2022
@kainino0x
Copy link
Contributor

kainino0x commented Aug 30, 2022

superseded by #2315

@kainino0x kainino0x closed this Aug 30, 2022
@kainino0x kainino0x linked an issue Oct 24, 2023 that may be closed by this pull request
@gpuweb gpuweb deleted a comment from fwadnjar Oct 26, 2023
@kainino0x kainino0x mentioned this pull request Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

copyediting Pure editorial stuff (copyediting, *.bs file syntax, etc.)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigation: multi-draw-indirect

5 participants