The Brain Dump

The experimental Sokol Vulkan backend

Mon, 01 Dec 2025 00:00:00 +0000

Update: merge happened on 02-Dec-2025.

In a couple of days I will merge the first implementation of a sokol-gfx Vulkan backend. Please consider this backend as ‘experimental’, it has only received limited testing, has limited platform coverage and some known shortcomings and feature gaps which I will address in followup updates.

The related PRs are here:

sokol/#1350 - this one also has all the embedded shaders for the sokol ‘utility headers’, so it looks much bigger than it actually is (the Vulkan backend is around the same size as the GL backend, a bit over 3 kloc)
sokol-tools/#196 - this is the update for the shader compiler which is already merged

The currently known limitiations are:

the entire code expects a ‘desktop GPU feature set’ and doesn’t implement fallback paths for mobile or generally ancient GPUs
the window system glue in sokol_app.h is only implemented for Linux/X11 - and before the question comes up again: it works just fine on Wayland-only distros
only tested on an Intel Meteor Lake integrated GPU (which also means that some buffer types may be allocated in memory types that are not optimal on GPUs without unified memory)
barriers for CPU => GPU updates are currently quite conservative (e.g. more barriers might be inserted than needed, or at a too early point in a frame)
there’s currently no GPU memory allocator, nor a way to inject an external GPU memory allocator like VMA (at least the latter is planned)
rendering is currently only supported to a single swapchain (not a problem when used with sokol_app.h because that also only supports a single window)
it’s currently not possible to inject native Vulkan buffers and images into sokol-gfx (that’s a somewhat esoteric feature supported by the other backends)
I couldn’t get RenderDoc to work, but it’s unclear why

On the upside:

no sokol-gfx API or shader-authoring changes are required (there are some minor breaking API changes because of some code cleanup work I had planned already and which are not directly related to Vulkan, but most code should work without or only minimal changes)
the Vulkan validation layer is silent on all sokol-samples (which try to cover most sokol-gfx features and their combined usage), and this includes the tricky optional synchronization2 validations (I’m pretty proud of that considering that most Vulkan samples I tried have sync-validation errors)
performance on my Intel Meteor Lake laptop in the drawcallperf-sample is already slightly better than the OpenGL backend (on a vanilla Kubuntu system)

It’s also important to understand what actually motivated the Vulkan backend (e.g. why now, and not earlier or much later):

It’s not mainly about performance, but about ‘future potential’ and OpenGL rot. Essentially, the Vulkan backend is the first step towards deprecating the OpenGL backend (first, an alternative to WebGL2 had to happen - which exists now with WebGPU, and next an alternative for OpenGL on Linux (and less important: Android) had to be implemented (which is the Vulkan backend). So far Linux and Android were the only sokol-gfx target platforms limited to a single backend: OpenGL. All other target platforms already have a more modern alternative (Windows with D3D11 and macOS/iOS with Metal). Deprecating the OpenGL backend won’t happen for a while, but personally I can’t wait to free sokol-gfx from the ‘shackles of OpenGL’ ;)

Also another reason why I felt that now is the right time to tackle Vulkan support is that the Vulkan API has improved quite a bit since 1.0 in ways that make it a much better fit for sokol-gfx. In a nutshell (if you already know Vulkan concepts), the sokol-gfx backend makes use of the following ‘modern’ Vulkan features:

‘dynamic rendering’ (e.g. render passes are enclosed by begin/end calls instead of being baked into render-pass objects) - e.g. pretty much a copy of the Metal render pass model. This is a perfect match for sokol-gfx sg_begin_pass()/sg_end_pass()
EXT_descriptor_buffer - this is a controversial choice, but it’s a perfect match for the sokol-gfx resource binding model and I really did not want to deal with the traditional rigid Vulkan descriptor API (which is an overengineered boondoggle if I’ve ever seen one). This is also the main reason why mobile GPUs had to be left out for now, and apparently descriptor buffers are also a poor match for NVIDIA GPUs. The plan here is to wait until Khronos completes work on a descriptor pool replacement which AFAIK will be a mix of descriptor buffers and D3D12-style descriptor heaps and then port the EXT_descriptor_buffer code over to that new resource binding API
‘synchronization2’ (not a drastic change from the original barrier model, I’m just listing it here for completeness)

Work on the Vulkan backend spans three sub-projects:

sokol-shdc: added Vulkan-flavoured SPIRV output
sokol_app.h: device creation, swapchain management and frame loop
sokol_gfx.h: rendering and compute features

sokol-shdc changes

From the outside, the shader compiler changes are minimal (so minimal that the update is actually already live for a little while).

The only change is that a new output shader format has been added: spirv_vk for ‘Vulkan-flavoured SPIRV. To compile a GLSL input shader to SPIRV:

sokol-shdc -i bla.glsl -o bla.h -l spirv_vk

Internally the changes are also fairly small since sokol-shdc input shaders are already authored in ‘Vulkan-flavoured GLSL’, the only missing information is the descriptor set for resource bindings.

Sokol-shdc shaders only declare a bindslot on resource bindings with different ‘bind spaces’ for uniform blocks, samplers and anything else, for instance:

layout(binding=0) uniform fs_params { ... };
layout(binding=0) uniform texture2D tex;
layout(binding=0) uniform sampler smp;

Sokol-shdc performs a backend-specific bindslot allocation which for SPIRV output assigns descriptor sets (uniform blocks live in descriptor set 0 and everything else in descriptor set 1), and remap sampler bindings to resolve bindslot collisions with textures, storage-buffer and storage-images, so the above code snippet essentially becomes:

layout(set=0, binding=0) uniform fs_params { ... };
layout(set=1, binding=0) uniform texture2D tex;
layout(set=1, binding=32) uniform sampler smp;

The one thing that’s not straightforward is that sokol-shdc does a ‘double-tap’ for SPIRV-output:

the input shader code is compiled from GLSL to SPIRV
SPIRVTools optimizer passes are applied to the SPIRV
bindings are remapped (in this case: simply add descriptor set decorators but keep the bindslots intact)
the SPIRV is translated back to GLSL via SPIRVCross
finally the SPIRVCross output is compiled again to SPIRV

The weird double compilation is a compromise to avoid large structural changes to the sokol-shdc code base and make the Vulkan shader pipeline less of a special case. Essentially, SPIRV is used as an intermediate format in the first compile pass, and then as output bytecode format in the second pass.

sokol_app.h changes

Apart from the actual Vulkan-related update I took the opportunity to do some public API cleanup which was rolling around in my head for a while.

First, the backend-specific config options in the sapp_desc struct are now grouped into per-backend-nested structs, e.g. from this:

sapp_desc sokol_main(int argc, char* argv[]) {
    return (sapp_desc){
        // ...
        .win32_console_utf8 = true,
        .win32_console_attach = true,
        .html5_bubble_mouse_events = true,
        .html5_use_emsc_set_main_loop = true,
    };
}

…to this:

sapp_desc sokol_main(int argc, char* argv[]) {
    return (sapp_desc){
        // ...
        .win32 = {
          .console_utf8 = true,
          .console_attach = true,
        },
        .html5 = {
          .bubble_mouse_events = true,
          .use_emsc_set_main_loop = true,
        }
    };
}

A new enum sapp_pixel_format has been introduced which will play a bigger role in the future to allow more configuration options for the sokol-app swapchain.

A ton of backend-specific functions to query backend-specific objects have been merged to better harmonize with sokol-gfx:

const void* sapp_metal_get_device(void);
const void* sapp_metal_get_current_drawable(void);
const void* sapp_metal_get_depth_stencil_texture(void);
const void* sapp_metal_get_msaa_color_texture(void);
const void* sapp_d3d11_get_device(void);
const void* sapp_d3d11_get_device_context(void);
const void* sapp_d3d11_get_render_view(void);
const void* sapp_d3d11_get_resolve_view(void);
const void* sapp_d3d11_get_depth_stencil_view(void);
const void* sapp_wgpu_get_device(void);
const void* sapp_wgpu_get_render_view(void);
const void* sapp_wgpu_get_resolve_view(void);
const void* sapp_wgpu_get_depth_stencil_view(void);
uint32_t sapp_gl_get_framebuffer(void);

…those have been merged into:

sapp_environment sapp_get_environment(void);
sapp_swapchain sapp_get_swapchain(void);

The new structs sapp_environment and sapp_swapchain conceptually plug into the sokol-gfx structs sg_environment and sg_swapchain (with the emphasis on conceptually, you still need a mapping from the sokol-app structs and enums to the sokol-gfx structs and enums, and this mapping is still peformed by the sokol_glue.h header.

That’s it for the public API changes in sokol_app.h, now on to the Vulkan specific parts:

The new struct sapp_environment contains a nested struct sapp_vulkan_environment vulkan; with Vulkan object pointers (as type-erased void-pointers so that they can be tunneled through backend-agnostic code):

typedef struct sapp_vulkan_environment {
    const void* physical_device;  // VkPhysicalDevice
    const void* device;           // VkDevice
    const void* queue;            // VkQueue
    uint32_t queue_family_index;
} sapp_vulkan_environment;

…and likewise the new struct sapp_swapchain contains a nested struct sapp_vulkan_swapchain vulkan; with Vulkan object pointers which are needed for a sokol-gfx swapchain render pass:

typedef struct sapp_vulkan_swapchain {
    const void* render_image;           // VkImage
    const void* render_view;            // VkImageView
    const void* resolve_image;          // VkImage;
    const void* resolve_view;           // VkImageView
    const void* depth_stencil_image;    // VkImage
    const void* depth_stencil_view;     // VkImageView
    const void* render_finished_semaphore;  // VkSemaphore
    const void* present_complete_semaphore; // VkSemaphore
} sapp_vulkan_swapchain;

The Vulkan-specific startup code path looks like this (the usual boilerplate-heavy initialization dance):

A VkInstance object is created.
A platform- and window-system-specific vkSurfaceKHR object is created, this is essentially the glue between a Vulkan swapchain and a specific window system. In the first release this window system glue code is only implemented for X11 via vkCreateXlibSurfaceKHR.
A VkPhysicalDevice is picked, this is the first time where the sokol-app backend takes a couple of shortcuts, initialization will fail if:
- EXT_descriptor_buffer is not supported (this currently rules out most mobile devices)
- the supported Vulkan API version is not at least 1.3
- no ‘queue family’ exists which supports graphics, compute, transfer and presentation commands all on the same queue
Next a logical VkDevice object is created with the following required features and extensions (with the exception of compressed texture formats which are optional):
- a single queue for all commands
- EXT_descriptor_buffer
- extendedDynamicState
- bufferDeviceAddress
- dynamicRendering
- synchronization2
- samplerAnisotropy
- optional:
  - textureCompressionBC
  - textureCompressionETC2
  - textureCompressionASTC_LDR
The swapchain is initialized:
- a VkSwapchainKHR object is created:
  - pixel format currently either RGBA8 or BGRA8 (no sRGB)
  - present-mode hardwired to VK_PRESENT_MODE_FIFO_KHR
  - composite-alpha hardwired to VK_COMPOSITE_ALPHA_OPAQUE_BIT_KHR
- VkImage and VkImageView objects are obtained or created for the swapchain images, depth-stencil-buffer and optional MSAA surface
Finally a couple of VkSemaphore objects are created for each swapchain image (the number of swapchain images is essentially dictated by the Vulkan driver):
- one render_finished_semaphore which signals that the GPU has finished rendering to a swapchain surface
- one present_complete_semaphore which signals that presenting a swapchain image has completed and the image ready for reuse

At this point, the Vulkan specific code in sokol_app.h is at about 600 lines of code, which is a lot of boilerplate, but OTH is a lot less messy than the combined OpenGL window system code for GLX, EGL, WGL or NSOpenGL (yet still a lot more than the window system glue for the other backends).

The actually interesting stuff happens in the last two Vulkan backend functions:

The internal function _sapp_vk_swapchain_next() is a wrapper around vkAcquireNextImageKHR() and obtains the next free swapchain image. The function will also signal the associated present_complete_semaphore.

The last function in the sokol-app Vulkan backend is _sapp_vk_present(), this is a wrapper for vkQueuePresentKHR(). The present operation uses the render_finished_semaphore to make sure that presentation happens after the GPU has finished rendering to the swapchain image. When the vkQueuePresentKHR() function returns with VK_ERROR_OUT_OF_DATE_KHR or VK_SUBOPTIMAL_KHR, the swapchain resources are recreated (this happens for instance when the window is resized).

There’s a couple of open todo points in the sokol-app Vulkan backend which I’ll take care of later:

Any non-success return values from vkAcquireNextImageKHR() are currently only logged but not handled. Normally the application is either supposed to re-create the swapchain resources or skip rendering and presentation. Since I couldn’t coerce my Kubuntu laptop to ever return a non-success value from vkAcquireNextImageKHR() I would have to implement behaviour I couldn’t test, so I had to skip this part for now. Maybe when moving the code over to my Windows/NVIDIA PC I’ll be able to handle that situation properly.
Currently the swapchain image size must match the window client rectangle size (same as OpenGL via GLX). The Vulkan swapchain API has an optional scaling feature, but I couldn’t get this to work on my Kubuntu laptop. Window-system scaling is mainly useful when the system has a high-dpi display but lower-end GPU, and all other sokol-app backends depend on the system to scale a smaller framebuffer to the window client rectangle when needed.

The main area I struggled with in the sokol-app Vulkan backend was swapchain resizing. Most sokol-app backends kick off any swapchain resize operation from the window system’s resize event, e.g.:

window is resized by user
window system resize event fires giving the new window size
sokol-app listens for the window system resize event and initiates a swapchain resize with the new size coming from the window system event, then stores the new size for sapp_width/height() and finally fires an SAPP_EVENTTYPE_RESIZED event

This doesn’t work on the Vulkan backend, the validation layer would sometimes complain that there’s a difference between actual and expected swapchain surface dimensions (I forgot the exact error circumstances, forgiveable since implementating a Vulkan backend is basically crawling from one validation layer error to the next).

Long story short: I got it to work by leaving the host window system entirely out of the loop and let the Vulkan swapchain take full control of the resize process:

window is resized by user
window system resize event fires, but is now ignored by sokol-app
the next time vkQueuePresentKHR() is called it returns with an error code and this triggers a swapchain-resource resize, with the size coming from the Vulkan surface object instead of the window system, finally an SAPP_EVENTTYPE_RESIZED event is fired

This fixes any validation layer warnings and is in the end a cleaner implementation compared to letting the window system dictate the swapchain size.

There are downsides though: At least on my Kubuntu laptop it looks like the window system and Vulkan swapchain code doesn’t run in lock step. Instead the Vulkan swapchain seems to lag behind the window system a bit and this results in minor artefacts during resizing: sometimes there’s a visible gap between the Vulkan surface and window border, and the frame rate gets slighly out of whack during resize. In comparison, on macOS rendering with Metal during window resize is buttery smooth and without resize-jitter or border-gaps (although tbf, removing the resize-jitter on macOS had to be explicitly implemented by anchoring the NSView object to a window border).

That’s all there is to the Vulkan backend in sokol_app.h, on to sokol_gfx.h!

sokol_gfx.h changes

For the most part, the actual mapping of the sokol-gfx functions to Vulkan API functions is very straightforward, often the mapping is 1:1. This is mainly thanks to using a couple of modern Vulkan features and extensions:

Dynamic rendering (e.g. vkBeginRendering()/vkEndRendering()) is a perfect match for sokol-gfx sg_begin_pass()/sg_end_pass(), this is not very surprising though because the dynamic rendering Vulkan API is basically a ‘de-OOP-ed’ version of the Metal render pass API.
EXT_descriptor_buffers is an absolutely perfect match for sokol-gfx’s sg_apply_bindings() call, and a ‘pretty good’ match for sg_apply_uniforms()

The main areas for future improvements are the barrier system and the staging system, but let’s not get ahead of ourselves.

A 10000 foot view

Apart from the straight mapping of sokol-gfx API calls to Vulkan-API calls, the Vulkan backend has to implement a couple of low-level subsystems. This isn’t all that unusual, other backends also have such subsystems, but the Vulkan backend definitely is the most ‘subsystem heavy’.

OTH some concepts of modern Vulkan are quite similar to WebGPU, Metal and even D3D11 - and this conceptual overlap significantly simplified the Vulkan backend implementation.

In some areas the Vulkan backend has even more straightforward implementations than some of the other backends, for instance the implementation of the resource binding call sg_apply_bindings in the Vulkan backend is one of the most straightforward of all backends and especially compared to the WebGPU backend. In Vulkan it’s literally just a bunch of memcpy’s followed by a single Vulkan API call to record an offset into the descriptor buffer (ok, it’s actually a bit more complicated because of the barrier system). Compared to that, the WebGPU backend needs to use a ‘hash-and-cache’ approach for baked BindGroup objects, e.g. calling sg_apply_bindings() may involve creating and destroying WebGPU objects.

The low-level subsystems in the sokol-gfx Vulkan backend are:

a ‘delete queue’ system for delayed Vulkan object destruction
the GPU memory allocation system (very rudimentary at the moment)
the frame-sync system (e.g. ensuring that the CPU and GPU can work in parallel in typical render frames)
the uniform update system
the bindings update system
two ‘staging systems’ for copying CPU-side data into GPU-side resources:
- a ‘copy’ staging system
- a ‘stream’ staging system
the resource barrier system

Let’s look at those one by one:

The Delete Queue System

Vulkan doesn’t have any automatic lifetime management like some other 3D APIs (e.g. no D3D-style reference counting). When you call a destroy function on an object, it’s gone. When you do that while the object is still in flight (e.g. referenced in a queue and waiting to be consumed by the GPU), hilarity ensues.

IMHO this is much better than any automatic lifetime management system, because it avoids any confusion about reference counts (e.g. questions like: when I call this function to get an object reference, will that bump the refcount or not?), but this means that a Vulkan backend needs to implement some sort of garbage collection on its own.

Sokol-gfx uses a double-buffered delete-queue system for this. Each ‘double-buffer-frame-context’ owns a delete queue which is a simple fixed-size array of pointer-pairs. Each queue item consists of:

one type-erased Vulkan object pointer (e.g. a void-pointer)
a function pointer for a destructor function which takes a void* as argument and knows how to destroy that Vulkan object

All Vulkan object types which may be referenced in command buffers will not call their vkDestroy*() functions directly, but instead add them to the delete-queue that’s associated with the currently recorded command buffer. At the start of a new frame (what ‘new frame’ actually means is explained down in the ‘frame-sync system’), the delete-queue for that frame-context is drained by calling the destructor function with the Vulkan object pointer of a queue item. This makes sure that any Vulkan objects are kept alive until the GPU has finished processing any command buffers which might hold references to those objects.

The GPU Memory Allocation System

Currently GPU allocations do not go through a custom allocator, instead all granular allocations directly call into vkAllocateMemory(). Originally I had intended to use SebAaltonen’s OffsetAllocator as the default GPU allocator, but also expose an allocator interface to allow users to inject more complex allocators like VMA.

Historically a custom allocator was pretty much required because some Vulkan drivers only allowed 4096 unique GPU allocations. Today though it looks like pretty much all (desktop) Vulkan drivers allow 4 billion allocations (at least according to the Vulkan hardware database).

The plan is still to at least allow injecting a custom GPU allocator via an allocator interface, and also maybe to integrate OffsetAllocator as default allocator, but without knowing the memory allocation strategy of Vulkan drivers this may be redundant. E.g. if a Vulkan driver essentially integrates something like VMA anyway there’s not much point stacking another allocator on top of it, at least for a fairly high level API wrapper like sokol-gfx.

In any case, the current GPU memory allocation implementation is prepared for a bit more abstraction in the future. All GPU allocations go through a single internal function _sg_vk_mem_alloc_device_memory() which takes a ‘memory type’ enum and a VkMemoryRequirements pointer as input. The memory type enum is sokol-gfx specific and includes:

storage buffer (an sg_buffer object with storage buffer usage)
generic buffer (all other sg_buffer types)
image (all usages)
internal staging buffer for the ‘copy-staging system’
internal staging buffer for the ‘stream-staging system’
internal uniform buffer
internal descriptor buffer

Currently all resources are either in ‘device-local’ memory, or in ‘host-visible + host-coherent’ memory. Having the mapping from sokol-specific memory type to Vulkan memory flags in one place makes it easier to tweak those flags in the future (or delegate that decision to an external memory allocator).

The Frame Sync System

The frame sync system is mainly concerned about letting the CPU and GPU work in parallel without stepping on each other’s feet. This basically comes down to double-buffering all resources which are written by the CPU and read by the GPU, and to have one sync-point in a sokol-gfx frame where the CPU needs to wait for the oldest ‘frame-context’ to become available (e.g. is no longer ‘in flight’).

This single CPU <=> GPU sync point is implemented in a function _sg_vk_acquire_frame_command_buffers(). The name indicates the main feature of that function: it acquires command buffers to record the Vulkan commands of the current frame. Command buffers are reused, so this involves waiting for the command buffers to become available (e.g. they are no longer read from by the GPU). “Command buffers” is plural because there are two command buffers per frame: one which records all staging-commands, and one for the actual compute/render commands - more on that later in the staging system section.

For this CPU <=> GPU synchronization, each double-buffered frame-context owns a VkFence which is signalled when the GPU is done processing a ‘queue submit’.

So the first and most important thing the _sg_vk_acquire_frame_command_buffers() function does is to wait for the fence of the oldest frame-context with a call to vkWaitForFences().

This potential-wait-operation is the reason why sokol-gfx applications should move sokol-gfx calls towards the end of the frame callback and try to do all heavy non-rendering-related CPU work at the start of the frame callback. More specifically calls to:

sg_begin_pass()
sg_update_buffer()
sg_update_image()
sg_append_buffer()

…these are basically the ‘potential new-frame entry points’ of the sokol-gfx API which may require the CPU to wait for the GPU.

The _sg_vk_acquire_frame_command_buffers() function does a couple more things after vkWaitForFences() returns:

first (actually before the vkWaitForFences() call) it checks if the function had already been called in the current frame, if yes it returns immediately
vkResetFences() is called on the fence we just waited on
the delete-queue is drained (e.g. all resources which were recorded for destruction in the frame-context we just waited on are finally destroyed)
any command buffers associated with the new frame are reset via vkResetCommandBuffer()
…and recording into those command buffers is started via vkBeginCommandBuffer()
additionally the other subsystems are notified because they might want to do their own thing:
- _sg_vk_uniform_after_acquire()
- _sg_vk_bind_after_acquire()
- _sg_vk_staging_stream_after_acquire()

The other internal function of the frame-sync system is _sg_vk_submit_frame_command_buffers(). This is called at the end of a ‘sokol-gfx frame’ in the sg_commit() call. The main job of this function is to submit the recorded command buffers for the current frame via vkQueueSubmit(). This submit operation uses the two semaphores we got handed from the outside world (e.g. sokol-app) as part of the swapchain information in sg_begin_pass():

the present_complete_semaphore is used as the wait-semaphore of the vkQueueSubmit() call (the GPU basically needs to wait for the swapchain image of the render pass to become available for reuse)
the render_finished_semaphore is used as the signal-semaphore to be signalled when the GPU is done processing the submit payload

Before the vkQueueSubmit() call there’s a bit more housekeeping happening:

the other subsystems are notified about the submit via:
- _sg_vk_staging_stream_before_submit()
- _sg_vk_bind_before_submit()
- _sg_vk_uniform_before_submit()
recording into the command buffers which are associated with the current frame context is finished via vkEndCommandBuffers()

It’s also important to note that there is one other potential CPU <=> GPU sync-point in a frame, and that’s in the first sg_begin_pass() for a swapchain render pass: the swapchain-info struct that’s passed into sg_begin_pass() contains a swapchain image which must be acquired via vkAcquireNextImageKHR() (when using sokol_app.h this happens in the sapp_get_swapchain() call - usually indirectly via sglue_swapchain()).

That is all for the frame-sync system in sokol-gfx, all in all quite similar to Metal or WebGPU, just with more code bloat (as is the Vulkan way).

Resource binding via EXT_descriptor_buffer

…a little detour into Vulkan descriptors and how the sokol-gfx resource binding model maps to Vulkan.

Conceptually and somewhat simplified, a Vulkan descriptor is an abstract reference to a Vulkan buffer, image or sampler which needs to be accessible in a shader. Basically what shows up on the shader side whenever you see a layout(binding=x) .... In sokol-gfx lingo this is called a ‘binding’.

In an ideal world, such a binding would simply be a ‘GPU pointer’ to some opaque struct living in GPU memory which describes to shader code how to access bytes in a storage buffer, pixels in a storage image, or how to perform a texture-sampling operation.

In the real world it’s not that simple because this is exactly the one main area where GPU architectures still differ dramatically: on some GPUs this information might be hardwired into register tables and/or involves fixed-function features instead of being just ‘structs in GPU memory’ - and unfortunately those differences are not limited to shitty mobile GPUs, but are also still present in desktop GPUs. Intel, AMD and NVIDIA all have different opinions on how this whole resource binding thing should work - and I’m not sure anything has changed in the last decade since Vulkan promised us a more-or-less direct mapping to the underlying hardware.

So in the real world 3D APIs still need to come up with some sort of abstraction layer to get all those different hardware resource binding models under a common programming model (and yes, even the apparently ‘low-level’ Vulkan API had to come up with a highlevel abstraction for resource binding - and this went quite poorly… but I disgress).

(side note: traditional vertex- and index-buffer-bindings are not performed through Vulkan descriptors, but through regular ‘bindslot-setter’ calls like in any other 3D API - go figure).

A Vulkan descriptor-set is a group of such concrete bindings which can be applied as an atomic unit instead of applying each binding individually. In the end the traditional Vulkan descriptor model isn’t all that different from the ‘old’ bindslot model used in Metal V1 or D3D11, the one big and important difference is that bindings are not applied individually but as groups.

The downside of such a ‘bind group model’ is of course that specific binding combinations may be unpredictable - which is the one big recurring topic in Vulkan’s (very slow) API evolution.

In ‘old Vulkan’ pretty much all state-combinations in all areas of the API need to be known upfront in order to move as much work as possible into the init-phase and out of the render-phase. Theoretically a pretty sensible plan, but unfortunately only theoretically. In practice there are a lot of use cases where pre-baking everything is simply not possible, especially outside the game engine world, and even in gaming it doesn’t quite work - whenever you see stuttering when something new appears on screen in modern games built on top of state-of-the-art engines calling into modern 3D APIs - that’s most likely the core design philosophy of Vulkan and D3D12 crashing and burning after colliding with reality. Thankfully - but unfortunately very slowly - this is changing. Most of Vulkan’s progress in the last decade was about rolling the core API back to a more ‘dynamic’ programming model.

Ok, back to Vulkan’s resource binding lingo:

A Vulkan descriptor-set-layout is the shape of a descriptor-set. It basically says ‘there will be a sampled texture at binding 0, a buffer at binding 1 and a sampler at binding 2’, but not the concrete texture, buffer or sampler objects (those are referenced in the concrete descriptor-sets).

And finally a Vulkan pipeline-layout groups all descriptor-set-layouts required by the shader stages of a Vulkan pipeline-state-object.

When coming from WebGPU this should all sound quite familiar since the WebGPU bindgroups model is essentially the Vulkan 1.0 descriptor model (for better or worse):

WebGPU BindGroupEntry maps to Vulkan descriptors
WebGPU BindGroup maps to Vulkan descriptor sets
WebGPU BindGroupLayout maps to Vulkan descriptor set layouts
WebGPU PipelineLayout maps to Vulkan pipeline layouts

‘Old Vulkan’ then adds descriptor pools on top of that but tbh I didn’t even bother to deal with those and skipped right to EXT_descriptor_buffer.

With the descriptor buffer extension, descriptors and descriptor sets are ‘just memory’ with opaque memory layouts for each descriptor type which are specific to the Vulkan driver (depending on the driver and descriptor type, such opaque memory blobs seem to be between 16 and 256 bytes per descriptor).

Binding resources with EXT_descriptor_buffers essentially looks like this:

In the init-phase:

create a descriptor buffer big enough to hold all descriptors needed in a worst-case frame
for each item in a descriptor-set-layout, ask Vulkan for the descriptor size and relative offset to the start of the descriptor-set data in the descriptor buffer
similar for all concrete descriptors, ask Vulkan to copy their opaque memory representation into some private memory location and keep those around for the render phase (of course it’s also possible to move this step into the render phase)

In the render-phase:

memcpy the concrete descriptor blobs we stored upfront into the descriptor buffer to compose an adhoc descriptor set, using the offsets we also stored upfront
finally record the start offset in the descriptor buffer into a Vulkan command buffer via a Vulkan API call, and that’s it!

This is pretty much the same procedure how uniform data updates are performed in the sokol-gfx Metal and WebGPU backends, now just extended to resource bindings.

E.g. TL;DR: both uniform data snippets and resource bindings are ‘just frame-transient data snippets’ which are memcpy’ed into per-frame buffers and the buffer offsets recorded before the next draw- or dispatch-call.

In sokol-gfx, the VkDescriptorSetLayout and VkPipelineLayout objects are created in sg_make_shader() using the shader interface reflection information provided in the sg_shader_desc arg (which is usually code-generated by the sokol-shdc shader compiler).

the first descriptor set layout (set 0) describes all uniform block bindings used by the shader across all shader stages
the second descriptor set layout (set 1) describes all texture, storage buffer, storage image and sampler bindings

…additionally, sg_make_shader() queries the descriptor sizes and offsets within their descriptor set.

The uniform update system:

Conceptually uniform updates in the Vulkan backend are similar to the Metal backend:

a double-buffered uniform buffer big enough to hold all uniform updates for a worst-case frame, allocated in host-visible memory (so that the memory is directly writable by the CPU and directly readable by the GPU)
a call to sg_apply_uniforms() memcpy’s the uniform data snippet into the next free uniform buffer location (taking alignment requirements into account), this happens individually for the up to 8 ‘uniform block slots’
before the next draw- or dispatch-call, the offsets into the uniform buffer for the up to 8 uniform block slots are recorded into the current command buffer

The last step of recording the uniform-buffer offsets is delayed into the next draw- or dispatch-call to avoid redundant work. This is because sg_apply_uniforms() works on a single uniform block slot, but in Vulkan all uniform block slots are grouped into one descriptor set, and we only want to apply that descriptor-set at most once per draw/dispatch call.

The actual sg_apply_uniforms() call is extremely cheap since no Vulkan API calls are performed:

a simple memcpy of the uniform data snippet into the per-frame uniform buffer
writing the ‘GPU buffer address’ and snippet size into a cached array of VkDescriptorAddressInfoEXT structs
setting a ‘uniforms dirty flag’.

…then later in the next draw- or dispatch-calls if the ‘uniforms dirty flag’ is set the actual uniform block descriptor set binding happens:

for each uniform block used in the current pipeline/shader, a opaque descriptor memory blob is directly written into the frame’s descriptor buffer via a call to vkGetDescriptorEXT()
the start offset of the descriptor-set in the descriptor buffer is recorded into the current frame command buffer via vkCmdSetDescriptorBufferOffsetsEXT()

…delaying the operation to record the uniform buffer offsets into the draw- or dispatch-call to avoid redundant API calls is actually something that I will also need to implement in the WebGPU backend (I was taking notes while implementing the Vulkan backend which improvements could be back-ported to the WebGPU backend, and I’ll take care of those right after the Vulkan backend is merged).

The resource binding system

Updating resource bindings via sg_apply_bindings() is very similar to the uniform update system, but actually even simpler because no extra uniform buffer is involved, and some more initialization can be moved into the init-phase when creating view objects:

When creating a texture-, storage-buffer- or storage-image-view object via sg_make_view() or a sampler object via sg_make_sampler), the concrete descriptor data (those little 16..256 byte opaque memory blobs) is copied into the sokol-gfx view or sampler object via vkGetDescriptorEXT().

Then sg_apply_bindings() is just a couple of memcpy’s and a Vulkan call:

for each view and sampler in the sg_bindings argument, a memcpy of the descriptor memory blob which was stored in the sokol-gfx object into the current frame’s descriptor buffer happens - e.g. no Vulkan calls for that…
finally a single call to vkCmdSetDescriptorBufferOffsetsEXT() records the descriptor buffer offset into the current frame’s command buffer

Vertex- and index-buffer bindings happen via traditional bindslot calls (vkCmdBindVertexBuffers and vkCmdBindIndexBuffer). Additionally, barriers may be inserted inside sg_apply_bindings() but that will be explained further down in the barrier system.

The two staging systems

Sokol-gfx currently has two separate staging systems for uploading CPU-side data into GPU-memory with the rather arbitrary names ‘copy-staging-system’ and ‘stream-staging-system’. Both can upload data into buffers and images, but with different compromises:

the ‘copy-staging-system’ can upload large amounts of data through a single small staging buffer (default size: 4 MB), with the downside that the Vulkan queue needs to be flushed (e.g. a vkQueueWaitIdle() is involved)
the ‘stream-staging-system’ can upload a limited amount of data per-frame through a fixed-size double-buffered staging buffer (default size: 16 MB - but this can be tweaked in the sg_setup() call of course), this doesn’t cause any frame-pacing ‘disruptions’ like the copy-staging-system does

The copy-staging-system is currently used:

to upload initial content into immutable buffers and images within sg_make_buffer() and sg_make_image()
to upload data into usage.dynamic_update images and buffers in the sg_update_buffer(), sg_append_buffer() and sg_update_image() calls

The stream-staging system is only used for usage.stream_update resources when calling sg_update_buffer(), sg_append_buffer() and sg_update_image().

This means that the correct choice of usage.dynamic_update and usage.stream_update for buffers and images is much more important in the Vulkan backend than in other backends.

In general:

creating an immutable buffer or image with initial content in the render-phase will ‘disrupt’ rendering (how bad this disruption actually is remains to be seen though)
the same disruption happens for updating a buffer or image with usage.dynamic_update,
make sure to use usage.stream_update for buffers and images that need to be updated each frame, but be aware that those uploads go through a single per-frame staging buffer which needs to be big enough to hold all stream-uploads in a single frame (staging buffer sizes can be adjusted in the sg_setup() call)

The strategy for updating usage.dynamic_update resources may change in the future. For instance I was considering treating dynamic-updates exactly the same as stream-updates (e.g. going through the per-frame staging buffer to avoid the vkQueueWaitIdle()), and when the staging buffer would overflow fall back to the copy-staging system (also for stream-updates). This felt too unpredictable to me, so I didn’t go that way for now.

Note that the staging system is the most likely system to drastically change in the future (together with the barrier system). One of the important planned changes in my mental sokol-gfx roadmap is a rewrite of the resource update API, and this rewrite will most likely ‘favour’ modern 3D APIs and not worry about OpenGL as much as the current very restrictive resource update API does.

The common part in both staging systems is how the actual upload happens:

staging buffers are allocated in CPU-visible + cache-coherent memory (the copy-staging system uses a single small buffer, while the stream-staging system uses double-buffering)
a staging operation first memcpy’s a chunk of memory into the staging buffer and then records a Vulkan command to copy that data from the staging buffer into a Vulkan buffer or image (via vkCmdCopyBuffer() or vkCmdCopyBufferToImage2()
in the stream-staging system each buffer update is always a single call to vkCmdCopyBuffer() and each image update is always one call to vkCmdCopyBufferToImage2() per mipmap
in the copy-staging-system, staging operations which are bigger than the staging buffer size will be split into multiple copy operations, each copy-step involving a vkQueueWaitIdle
overflowing the stream-staging buffer is a ‘soft error’, e.g. an error will be logged but otherwise this is a no-op

There is another notable implementation detail in the stream-staging system which is related to the barrier system:

All stream-staging copy commands are recorded into a separate Vulkan command buffer object so that they are not interleaved with the compute/render commands which are recorded into the regular per-frame command buffer.

This is done to move any staging commands out of render passes which is pretty much required for barrier management (I don’t quite remember though if the Vulkan validation layer only complained about issuing barriers inside vkBeginRendering/vkEndRendering or if copy commands were also prohibited during the render phase).

Long story short: all Vulkan commands used for staging operations are recorded into a separate command buffer so that all GPU => CPU copies can be moved in front of any computer/render commands because of various Vulkan API usage restrictions. This was necessary because sokol-gfx allows to call the resource update functions at any point in a frame, most importantly within render passes.

The resource barrier system

This was by far the biggest hassle and took a long time to get right, involving several rewrites (and there’s still quite a lot of room for improvement).

The first implementation phase was basically to come up with a general barrier insertion strategy which isn’t completely dumb yet still satisfies the Vulkan default validation layer, the second and much harder step was then to also satisify the optional synchronization2 validation layer (which even most ‘official’ Vulkan samples don’t seem to get right - go figure).

I won’t bore you with what Vulkan barriers are or why they are necessary, just that barriers are usually needed when a Vulkan buffer or image changes the way it is accessed by the GPU (for instance when a resource changes from being a staging-upload target to being accessed by a shader, or when an image object changes from being used as a pass attachment to being sampled as a texture).

In sokol-gfx I tried as much as possible to use a ‘lazy barrier system’, e.g. a barrier is inserted at the latest possible moment before a resource is used.

The basic idea is that sokol-gfx buffers and images keep track of their current ‘access state’, this may be a combination of:

staging upload target
vertex buffer binding
index buffer binding
read-only storage buffer binding
read-write storage buffer binding
texture binding
storage image binding (always read-write)
a pass attachment (in the flavours color, resolve, depth or stencil)
a special ‘discard’ access modifier for pass attachments (used with SG_LOADACTION_DONTCARE)
swapchain presentation

Implicity those access states carry additional information which may be needed for picking the right barrier type, like whether shader accesses are read-only, read-write or write-only, and whether the access may happen exclusively in compute passes, render passes, or both.

Ideally barriers would always be inserted right at the point before a resource is bound (because only at that point it’s clear what the new access state is).

Unfortunately it’s not that simple: there’s a metric shitton of arbitrary restrictions in Vulkan where exactly barriers may be inserted. The main limitation is that no barriers can be inserted between vkBeginRendering and vkEndRendering (which is hella weird, it would be obvious to disallow barriers that involve the current pass attachments, but not for any other resources used in the pass).

This limitation is currently the main reason why the sokol-gfx barrier system is not optimal in some cases, because it requires to move any barriers that would be inserted inside render passes before the start of the render pass. However sokol-gfx can’t predict what resources will actually be used in the render pass (spoiler: there’s a surprisingly simple solution to this problem which I should have thought of myself much earlier - but that will be for a later Vulkan backend update).

Currently, barrier insertion points are in the following sokol-gfx functions:

sg_begin_pass()
sg_apply_bindings()
sg_end_pass()
sg_update/append_*()

The obvious barriers in begin- and end-pass are for image objects transitioning in and out of attachment state.

In sg_apply_bindings() barriers are only inserted inside compute passes (because of the above mentioned ‘no barriers inside render passes’ rule).

In staging operations, barriers are issued at the start and end of the staging operation, the ‘after-barrier’ is not great and eventually needs to be moved elsewhere.

Now the tricky part: moving barriers out of render passes… there is one situation where this is relevant: a compute pass writes to a buffer or image, and that buffer or image is then read by a shader in a render pass. Ideally the barrier for this would happen inside the render pass in sg_apply_bindings(), but Vulkan validation layer says “no”.

What happens instead is that any resource that’s (potentially) written in a compute pass is tracked as ‘dirty’, and then in the sg_end_pass() of the compute pass, very conservative barriers are inserted for all those dirty resources. ‘Conservative’ means that I cannot predict how the resource will be used next, so buffers are generally transitioned into ‘vertex+index+storage-buffer access state’ and images are generally transferred into ‘texture access state’.

This generally appears to work but is not optimal. We’d like to delay those barriers to when the resources are actually used, and also tighten the scope of the barriers to their actual usage.

The solution for this is surprisingly simple: use the same ‘time warp’ that is used for recording staging operations by recording barrier commands that would need to be issued from within sokol-gfx render passes into a separate command buffer which can then be enqueued before another command buffer which holds all render/compute commands for the pass.

This is a perfect solution but requires a couple of changes which I didn’t want to do in the first Vulkan backend release to not push that out even further:

instead of a single command buffer per frame to hold all render/compute commands, one command buffer per sokol-gfx pass is needed
for render passes, a separate command buffer per pass is needed to record barrier commands so that the barriers can be moved out of Vulkan’s vkBeginRendering/vkEndRendering

…inside sg_apply_bindings() and sg_end_pass() we’re now doing some serious time-travelling-shit:

Each resource that’s used in a render pass will keep track of all the ‘access states’ it’s used as in the sg_apply_bindings call (for buffers that may be vertex-, index- or read-only-storage-buffer-binding and for images it can only be texture-binding), additionally the resource is uniquely-added to a tracking array.

In sg_end_pass() we now have a list of all bound resources and their binding types, and this information can be used to record ‘just the right’ barriers into the separate command buffer that’s been set aside for render pass barriers. This barrier command buffer is then enqueued before the command buffer which holds the render commands for that pass and voila: perfectly scoped render pass barriers. But as I said, this will need to wait until a followup update.

Everything else…

The rest of the Vulkan backend is so straightforward that it’s not worth writing about, essentially 1:1 mappings from sokol-gfx API functions to Vulkan API functions (the blog post is long enough as it is).

Apart from the resource update system (which is overly restrictive and conservative in sokol-gfx, mainly because of OpenGL/WebGL), the sokol-gfx API actually is a really good match for Vulkan. There are no expensive operations (like creating and discarding Vulkan objects) happening in the ‘hot-path’. The use of EXT_descriptor_buffer is not a great choice for some GPU architectures, but as I said at the start: I’m waiting for Khronos to finish their new resource binding API which apparently will be a mix of D3D12-style descriptor heaps and EXT_descriptor_buffer.

The next steps will most likely be:

porting the backend to Windows (still limited to Intel GPU though)
port the backend to NVIDIA (will have to wait until around January because I’ll be away from my NVIDIA PC for the rest of the year)
expose a GPU memory allocator interface, and add a sample which hooks up VMA
…maaaybe integrate SebAaltonen’s OffsetAllocator as default allocator (still not clear if I need that when all modern Vulkan drivers no longer seem to have that infamous 4096 unique allocations limit)
tinker around with GPU memory heap types for uniform- and descriptor-buffers on GPUs without unified memory (e.g. host-visible + device-local)
figure out why exactly RenderDoc doesn’t work (apparently it’s because of EXT_descriptor_buffer, but RenderDoc claims to support the extension since 1.41)
add support for debug labels (not much point to implement this before RenderDoc works)
implement the improved resource barrier system outlined above
add support for multiple swapchain passes (not needed when used with sokol_app.h, but required for any ‘multi-window-scenario’)
improve interoperability with Vulkan code that exists outside sokol-gfx (injecting Vulkan buffers and images into sg_make_buffer/sg_make_image and add the missing sg_vk_query_*() functions to expose internal Vulkan object handles)

Originally I also had a long rant about the Vulkan API design in this blog post, maybe I’ll put that into a separate post and also change the style from rant into ‘constructive criticism’ (as hard as that will be lol).

My verdict about Vulkan so far is basically: Not great, not terrible.

It’s better than OpenGL but not as good (from an API user’s perspective) as pretty much any other 3D API. In many places Vulkan is already the same mess as OpenGL. Sediment layers of outdated, deprecated or competing features and extensions which is incredibly hard to make sense of when not closely following Vulkan’s development since its initial release in 2016 (which is the exact same problem that ruined OpenGL).

At the very least, please, please, PLEASE aggressively remove cruft and reduce the ‘optional-features creep’ in minor Vulkan API versions (which I think should actually be major versions - 4 breaking versions in 10 years sounds just about right).

For instance when I’m working against the Vulkan 1.3 API I really don’t care about any legacy features which have been replaced by newer systems (like synchronization2 replacing the old synchronization API). Don’t expose the extensions that have been incorporated into core up to 1.3, and also let me filter out all those outdated declarations from the Vulkan headers so that code-completion doesn’t suggest outdated API types and functions. Don’t require me to explicitly enable every little feature (like anisotropic filtering) when creating a Vulkan device. If some shitty old-school GPU doesn’t have anisotropic filtering, then just silently ignore it instead of polluting the 3D API for all eternity just for this one GPU model which probably wasn’t even produced anymore even back in 2016.

Vulkan profiles are a good idea in theory, but please move them into the core API instead of implementing them as a Vulkan SDK feature. Give me a vkCreateSystemDefaultDevice(VK_PROFILE_*) function to get rid of those 500 lines of boilerplate that every single Vulkan programmer needs to duplicate line by line (people who need more control over the setup process can still use that traditional initialization dance).

And PLEASE get somebody into Khronos who has the power to inject at least a minimal amount of taste and elegance into Vulkan and who has a clear idea what should and shouldn’t go into the core API, because just promoting random vendor extensions into core is really not a good way to build an API (and that was clear since OpenGL - and the one thing that Vulkan should have done better).

Also, a low-level and explicit API DOES NOT HAVE TO BE a hassle to use.

Somehow modern software systems always seem be built around the ‘no pain, no gain’ philosophy (see Rust, Vulkan, Wayland, …), this sort of self-inflicted suffering for the sake of purity is such a weird Christian flex that I’m starting to wonder if ‘religious memes’ surviving under the surface in even the most rational and atheist developer brains is actually a thing…

Maybe we should return to the ‘Californian hippie attitude’ for building computer systems and software - apparently that had worked pretty great in the 70’s and 80’s ;)

…ok I’m getting into old-man-yells-at-cloud-mode again, so I’ll better stop here :D

The sokol-gfx resource view update.

Sun, 17 Aug 2025 00:00:00 +0000

Update: merge happened on 23-Aug-2025.

In a couple of days I will merge the next big (and breaking) sokol-gfx update which adds resource view objects and in turn removes pre-baked pass-attachment objects.

The update also requires to update sokol-shdc and recompile shaders.

The root PR is here: https://github.com/floooh/sokol/pull/1287

After merging the update I will spend a couple of weeks to take care of pending issues and PRs before moving on to a followup resource views update 2.

What are resource view objects?

If you’re familiar with D3D10 and later you’ll feel right at home since resource views are a fundamental concept in D3D, and sokol-gfx’s concept of resource views is closest to D3D11. Other 3D APIs either don’t have view objects at all (WebGL2 and GL before version 4.3), or only associate resource views with texture data but not buffer data (GL >= 4.3, Metal and WebGPU).

Typically resource views have a number of different purposes in the various 3D-APIs:

they specialize a parent resource object for a specific usage in shaders (for instance sampling an image object as a texture versus using the same image object as render target)
they can reinterpret the data in a resource object (for instance to a different pixel format or image type)
they can define a subset of the data in the resource object (for instance selecting a specific mipmap or range of mipmaps in a texture)

In sokol-gfx you can think of view objects mainly as specializations of an sg_image or sg_buffer object for how the image or buffer is going to be accessed in shaders:

sampling a texture in a shader requires a texture view
writing to a storage image in a compute shader requires a storage image view
accessing a storage buffer in a shader requires a storage buffer view
each render pass attachment type requires its own view object type:
- color-attachment views
- resolve-attachment views
- depth-stencil-attachment views

Alternatively you can think of view objects as specializations of a resource object for a specific bindings type (I was actually considering calling this new object type sg_binding, but since ‘view’ is the more established term I went with sg_view instead).

In sokol-gfx, resource view types are ‘runtime flavours’ of the same handle type sg_view. This means that setting the wrong resource type on a bindslot won’t be a compilation error, but a runtime error in the sokol-gfx validation layer, so please make sure to test your code in debug build mode from time to time.

New unlocked features

This first sokol-gfx resource view update unlocks the following features:

Storage buffer bindings can now have an offset. Binding storage buffers with offsets is mainly useful when the same buffer contains different types of items in different sections of the buffer, and processing those items in separate compute shaders - or if you only need to access a section of a buffer with a compute shader.
Texture views can define a subset of the parent image by defining their own mipmap- and slice-ranges (not on WebGL, GLES3 or GL4.1 - e.g. macOS)
Storage images are no longer ‘compute pass attachments’, but instead bound like regular textures in the sg_apply_bindings() call. This allows writing to many different storage images in the same compute pass (the number of simultaneously bound storage images is still very restricted though)
Combinations of render pass attachment images are no longer ‘pre-baked’ into sg_attachments objects. Instead sg_attachments is now a transient struct like sg_bindings. This relaxes another ‘combinatorial explosion scenario’ because rendering code longer needs to predict all possible render-pass attachment combinations upfront.

Current restrictions and planned features

The following resource view features are planned for a followup ‘resource view update 2’:

Reinterpret the pixel format and image type of image objects in a view object.
Change the max number of per-shader-stage resource bindings of the same type from hardwired conservative limits to dynamic device limits exposed in the sg_limits struct (e.g. more than 4 storage image, 8 storage buffer or 16 texture bindings - instead try to push those limits closer to 32)

For more details about planned ‘update 2’ features see:

https://github.com/floooh/sokol/issues/1302

High level overview of public API changes

the sg_attachments object type and related functions have been removed
a new object type sg_view has been added along with related functions
sg_features gained a new flag .gl_texture_views, when this is false the GL backend doesn’t have full texture view support (e.g. it’s not possible to limit a view to a miplevel or slices subset)

the sg_attachments name has been repurposed for a transient struct of render pass attachment views:

  typedef struct sg_attachments {
      sg_view colors[SG_MAX_COLOR_ATTACHMENTS];
      sg_view resolves[SG_MAX_COLOR_ATTACHMENTS];
      sg_view depth_stencil;
  } sg_attachments;

the sg_bindings struct now has a unified array for views instead of separate arrays for each ‘shader resource type’ (textures, storage images and storage buffers):
```
  typedef struct sg_bindings {
      // ...
      sg_view views[SG_MAX_VIEW_BINDSLOTS];
      // ...
  } sg_bindings;
```

the sg_image_usage struct now has more detailed usage flags for render pass attachments, and the .storage_attachment usage flag has been renamed to .storage_image:

  typedef struct sg_image_usage {
      bool storage_image;
      bool color_attachment;
      bool resolve_attachment;
      bool depth_stencil_attachment;
      // ...
  } sg_image_usage;

in sg_image_desc the items to directly inject backend-specific view objects have been removed:
- d3d11_shader_resource_view
- wgpu_texture_view
in sg_shader_desc:
- the internals of the sg_shader_desc struct to describe the shader binding interface has been changed to a unified array of sg_shader_view structs:
```
  typedef struct sg_shader_desc {
      // ...
      sg_shader_view views[SG_MAX_VIEW_BINDSLOTS];
      // ...
  } sg_shader_desc;
```
- some renaming to better differentiate between ‘(storage) image and texture bindings’, for instance ‘image-sampler-pairs’ are now called ‘texture-sampler-pairs’, since only texture bindings are ‘sampled’, but not storage-image bindings
many new items in the sg_frame_stats struct, mostly not directly related to resource views, but filling some gaps

Shader Authoring Changes

TL;DR: When recompiling existing shaders you might get new errors about bindslot collisions which need to be resolved by changing the layout(binding=N) decorations.

When using sokol-shdc, the only change on the shader side is that textures, storage buffers and storage images now share a common bindslot range, previously each binding type had its own slot range:

@cs cs
layout(binding=0) uniform texture2D cs_inp_tex;
layout(binding=0, rgba8) uniform writeonly image2D cs_outp_tex;
// ...
@end

Note how in this (old) code-snippet the texture- and storage-image bindings use the same bindslot 0 because previously textures and storage images had their own bindslot space.

This code will now produce a ‘bindslot collision error’ when compiled with sokol-shdc, because texture- and storage-image bindings now use the same bindslot space, so bindings for texture-, storage-buffer- and storage-image-bindings across all shader stages need to be fixed to not collide:

@cs cs
layout(binding=0) uniform texture2D cs_inp_tex;
layout(binding=1, rgba8) uniform writeonly image2D cs_outp_tex;
// ...
@end

This bindslot fixup is the only change required on the shader side.

Working with Texture Views

Sample code:

texcube-sapp (simple textured rendering): C code, GLSL code, WebGPU sample
dyntex-sapp (CPU-update dynamic texture): C code, GLSL code, WebGPU sample

Let’s say a shader defines a texture binding at slot 3:

layout(binding=3) uniform texture2D tex;

To ‘populate’ this bindslot on the CPU side you need two objects now: an image object, and a texture view on the image object:

sg_image img = sg_make_image(&(sg_image_desc){
    .width = 4,
    .height = 4,
    .data.subimage[0][0] = ...,
});
sg_view tex_view = sg_make_view(&(sg_view_desc){
    .texture = { .image = img },
});

Since this is C you can also chain the designated initializers which looks a bit more compact (unfortunately this isn’t supported in most other languages):

sg_view tex_view = sg_make_view(&(sg_view_desc){ .texture.image = img });

The sg_apply_bindings() call now has an array of sg_view handles instead of separate arrays for images and storage buffers:

sg_apply_bindings(&(sg_bindings){
    .vertex_buffers[0] = ...,
    .views[VIEW_tex] = tex_view,
    .samplers[SMP_smp] = ...,
});

Since the texture binding was defined as layout(binding=3) it’s also safe to just use the bind slot index directly instead of the code-generated constant:

sg_apply_bindings(&(sg_bindings){
    .vertex_buffers[0] = ...,
    .views[3] = tex_view,
    .samplers[SMP_smp] = ...,
});

In many situations you only need the view handle and don’t need the separate image handle, this means you can nest the sg_make_image() inside the sg_make_view() call:

sg_view tex_view = sg_make_view(&(sg_view_desc){
    .texture.image = sg_make_image(&(sg_image_view){
        .width = 4,
        .height = 4,
        .data.subimage[0][0] = ...,
    }),
});

If you need the image handle later you can extract it from the view object via sg_query_view_image():

sg_image img = sg_query_view_image(tex_view);

Texture views can select a subrange of mipmaps and slices of their parent image (not supported on WebGL2, GLES3 or GL4.1):

sg_view tex_view = sg_make_view(&(sg_view_desc){
    .texture = {
        .image = img,
        .mip_levels = { .base = 1, .count = 3 },
        .slices = { .base = 5, .count = 2 },
    },
});

If .count is left at default-zero it means ‘all remaining mipmaps or slices’. For instance this will only skip the most detailed mipmap but keep the remaining mipmap chain in place:

sg_view tex_view = sg_make_view(&(sg_view_desc){
    .texture = {
        .image = img,
        .mip_levels = { .base = 1 },
    },
});

View vs parent resource lifetime considerations

Before moving on to the other view types, a little interlude about lifetimes and resource states:

If you’re coming from 3D APIs with ref-counted lifetime management like D3D, WebGPU or Metal you might be tempted to ‘release’ a view’s parent resource object right after creating its view object if the image object handle isn’t needed anymore:

sg_image img = sg_make_image(&(sg_image_desc){
    .width = 4,
    .height = 4,
    .data.subimage[0][0] = ...,
});
sg_view tex_view = sg_make_view(&(sg_view){ .texture.image = img });
sg_destroy_image(img);

In sokol-gfx lifetimes are explicit, if you pull the rug under a view like this nothing catastrophic will happen (e.g. no crashes or hard validation layers errors), but rendering operations involving such ‘dangling views’ will be silently skipped (this is basically the same behavior as before when trying to render with images or buffers in a non-valid resource state).

Another slightly counter-intuitive behavior might be that a view object remains in valid resource state despite its parent resource being destroyed, e.g. following the above example code:

// get the destroyed image's resource state
if (sg_query_image_state(img) == SG_RESOURCESTATE_INVALID) {
    // if-branch taken, since the image had been destroyed
    // ...
}
// get the image's texture view resource state
if (sg_query_view_state(tex_view) == SG_RESOURCESTATE_VALID) {
    // if-branch *also* taken!
    // ...
}

I went a bit back and forth on this decision but I think the behavior makes sense from the perspective that all resource state changes in sokol-gfx are explicit (e.g. there are no ‘automatic’ state changes as a side effect of a ‘remote’ state change of another object, instead all resource state changes are directly caused by a function call on that resource object). The same has always been true for pipelines and their shader object, just not specifically documented.

If you want to check whether a view is ‘renderable’ you can use the following shortcut:

if (sg_query_image_state(sg_query_view_image(tex_view)) == SG_RESOURCESTATE_VALID) {
    // the view is 'renderable'
}

// or for storage buffer views:
if (sg_query_buffer_state(sg_query_view_buffer(sbuf_view)) == SG_RESOURCESTATE_VALID) {
    // the view is 'renderable'
}

This works because no matter what state the view object is in (or even exists), sq_query_view_image() will either return an image handle or an invalid handle and both can be passed into sg_query_image_state(). An invalid image handle will return SG_RESOURCESTATE_INVALID while a valid image handle will return the actual SG_RESOURCESTATE_* of the image object.

Tracking uninit => init cycles

If the parent resource goes through a ‘destroy => make’ or ‘uninit => init’ cycle, all views which had been created from this parent resource must also be re-initialized, otherwise rendering operations involving such ‘dangling views’ will silently be skipped.

A common pattern for this situation is to use the ‘uninit => init’ calls instead of ‘destroy => make’ because the handles will remain valid (e.g. you don’t need to distribute new object handles into all corners of your code base):

// first uninit/init the parent image with new params:
sg_uninit_image(img);
sg_init_image(img, &(sg_image_desc){ ... });
// then 'cycle' the image's view objects
sg_uninit_view(tex_view);
sg_init_view(tex_view, &(sg_view_desc){ .texture.image = img });

I was at first considering to add a ‘managed mode’ for views which would track the state of their parent resource and automatically go through an uninit/init cycle when needed, but this just didn’t fit into the sokol philosophy of explicit lifetimes and resource states, and having this one special case for view objects caused more confusion which wasn’t worth the small gain in convenience (this decision also wasn’t purely based on gut feeling since I actually had implemented the ‘managed mode’ already but then kicked it out again after actually starting to port the sokol sample code over - it just didn’t ‘feel right’).

When porting existing code over to resource view objects, don’t forget that you need to destroy at least two objects now for complete cleanup (views and their parent resource).

The order in which you destroy the views and parent resources doesn’t matter, this:

sg_destroy_view(view);
sg_destroy_image(img);

…works just as well as this:

sg_destroy_image(img);
sg_destroy_view(view);

BUT BE AWARE OF THIS TRAP:

sg_destroy_view(view);
sg_destroy_image(sg_query_view_image(view));

Since the view is already destroyed, sg_query_view_image() will return the invalid handle, and passing the invalid handle into sg_destroy_image() is a silent no-op (e.g. your image will leak).

…this is actually a nice example of how convenience in one situation (calling sg_query_view_image(view) and sg_destroy_image() with an invalid handle being a silent no-op) can cause trouble in other situations. I’ll need to think about whether this should at least be logged as an error instead.

Working with render pass attachment views

Sample code:

offscreen-sapp (simple offscreen rendering): C code, GLSL code, WebGPU sample
offscreen-msaa-sapp (multi-sampled offscreen rendering): C code, GLSL code, WebGPU sample
mrt-sapp (multiple-render-target, multi-sampled offscreen rendering): C code, GLSL code, WebGPU sample
mrt-pixelformats-sapp (multiple render target rendering with different pixel formats): C code, GLSL code, WebGPU sample
shadows-sapp (shadow-mapping with regular shadow map texture): C code, GLSL code, WebGPU sample
shadows-depthtex-sapp (shadow-mapping with a depth-buffer texture): C code, GLSL code, WebGPU sample
miprender-sapp (render into mipmaps): C code, GLSL code, WebGPU sample
layerrender-sapp (render into array slice): C code, GLSL code, WebGPU sample

In the previous sokol-gfx version, when doing offscreen rendering into an image object a ‘pre-baked’ attachments object had to be created which was then passed into sg_begin_pass():

E.g. old code:

// create a color and depth-buffer image for offscreen rendering
sg_image color_img = sg_make_image(&(sg_image_desc){
    .usage = { .render_attachment = true },
    // ...
});
sg_image depth_img = sg_make_image(&(sg_image_desc){
    .usage = { .render_attachment = true },
    // ...
});

// create an attachments object from those images...
sg_attachments atts = sg_make_attachments(&(sg_attachments_desc){
    .colors[0].image = color_img,
    .depth_stencil.image = depth_img,
});

// ... in the render loop for the offscreen render pass:
sg_begin_pass(&(sg_pass){ .attachments = atts });
// ...
sg_end_pass();

// ... and in the swapchain pass, bind the color image as texture:
sg_apply_bindings(&(sg_bindings){
    // ...
    .images[TEX_tex] = color_img,
    // ...
});

Now, instead of creating a pre-baked attachments object, separate ‘attachment-view’ objects are created upfront, but their combined use for rendering is no longer pre-baked but defined on-the-fly in the sg_begin_pass() call, much like bindings in the sg_apply_bindings() call:

// create color- and depth-buffer images
// NOTE the more detailed usage flags
sg_image color_img = sg_make_image(&(sg_image_desc){
    .usage = { .color_attachment = true },
    // ...
});
sg_image depth_img = sg_make_image(&(sg_image_desc){
    .usage = { .depth_stencil_attachment = true },
    // ...
});

// create color- and depth-stencil attachment views
sg_view color_att_view = sg_make_view(&(sg_view_desc){
    .color_attachment.image = color_img,
});
sg_view depth_att_view = sg_make_view(&(sg_view_desc){
    .depth_stencil_attachment.image = depth_img,
});

// since the color-attachment image is also sampled as texture,
// we'll also need a texture view:
sg_view color_tex_view = sg_make_view(&(sg_view_desc){
    .texture.image = color_img,
});

// later in the offscreen render pass, the attachment views
// are passed directly into sg_begin_pass:
sg_begin_pass(&(sg_pass_desc){
    .attachments = {
        .colors[0] = color_att_view,
        .depth_stencil = depth_att_view,
    },
});
// ...
sg_end_pass();

// and in the swapchain pass, the texture view is bound
// to sample the offscreen-rendered image as texture:
sg_apply_bindings(&(sg_bindings){
    // ...
    .views[VIEW_tex] = color_tex_view,
    // ...
});

Working with storage image views

Samples:

write-storageimage-sapp (write into storage image with compute shader): C code, GLSL code, WebGPU sample
imageblur-sapp (image blurring with compute shaders): C code, GLSL code, WebGPU sample

Storage image bindings are no longer defined as compute-pass attachments in sg_begin_pass(), but instead like regular texture- or storage-buffer-bindings in sg_apply_bindings().

// first create an image object with storage-image usage:
sg_image img = sg_make_image(&(sg_image_desc){
    .usage = { .storage_image = true },
    // ...
});

// to write to the image with a compute shader, a storage image view is needed:
sg_view simg_view = sg_make_view(&(sg_view_desc){
    .storage_image = {
        .image = img,
        .mip_level = ...,   // optional: select a specific miplevel
        .slice = ...,       // optional: select a specific slice
    },
});

// ...and to sample that same image as a texture for rendering, a texture view is needed:
sg_view tex_view = sg_make_view(&(sg_view_desc){
    .texture.image = img,
});

// storage image views are now applied as regular bindings in a compute pass:
sg_begin_pass(&(sg_pass){ .compute = true });
// ...
sg_apply_bindings(&(sg_bindings){
    .views[VIEW_simg] = simg_view,
})
sg_dispatch(...);
sg_end_pass();

// and to use the compute-shader-updated image as a texture in a render pass,
// bind the texture view as usual:
sg_begin_pass(...);
// ...
sg_apply_bindings(&(sg_bindings){
    // ...
    .views[VIEW_tex] = tex_view,
    .samplers[SMP_smp] = smp,
});
sg_draw(...);
sg_end_pass();

Working with storage buffer views

Samples:

vertexpull-sapp (vertex pulling from storage buffer): C code, GLSL code, WebGPU sample
sbuftex-sapp (access storage buffer in fragment shader): C code, GLSL code, WebGPU sample
instancing-compute-sapp (update instancing data with compute shader): C code, GLSL code, WebGPU sample
sbufoffset-sapp (demonstrate storage buffer bindings with offset): C code, GLSL code, WebGPU sample

To bind a buffer object as storage buffer for vertex-pulling or compute-shader access you now need a storage-buffer-view object:

// create a buffer with storage-buffer usage:
sg_buffer buf = sg_make_buffer(&(sg_buffer_desc){
    .usage = { .storage_buffer = true },
    // ...
});

// create a storage buffer view
sg_view sbuf_view = sg_make_view(&(sg_view_desc){
    .storage_buffer = {
        .buffer = buf,
        .offset = ...,  // optional 256-byte aligned offset
    }
});

// ...later in a render- or compute-pass bind the storage buffer view:
sg_apply_bindings(&(sg_bindings){
    .views[VIEW_ssbo] = sbuf_view,
});

The 256-byte-alignment restriction for the offset is a bit unfortunate, since vertex-buffer and index-buffer bind offsets don’t have that restriction. The alignment restriction is coming in via WebGPU which on some Android devices requires this 256 byte alignment, but the only realistic lower choice would be 64 bytes which frankly isn’t that much better (see: https://vulkan.gpuinfo.org/displaydevicelimit.php?platform=android&name=minStorageBufferOffsetAlignment) and would still exclude about 8 percent of Android devices which is quite a lot.

When not using sokol-shdc…

Samples:

for D3D11
for Metal
for desktop GL
for WebGL2
for WebGPU

Some tweaks on the manually populated sg_shader_desc structs are needed when not using sokol-shdc:

The separate bindslot reflection arrays for images, storage-buffers and storage-images have been unified into a views[] array which mirrors the views[] array in the sg_bindings struct. The actual reflection information in each view bindslot has remained the same though.
The .image_sampler_pair array has been renamed to .texture_sampler_array, and the struct member .image_slot has been renamed to .view_slot.

Example from the wgpu/mrt_wgpu.c sample:

sg_shader fsq_shd = sg_make_shader(&(sg_shader_desc){
    // ...
    .views = {
        [0].texture = { .stage = SG_SHADERSTAGE_FRAGMENT, .wgsl_group1_binding_n = 0 },
        [1].texture = { .stage = SG_SHADERSTAGE_FRAGMENT, .wgsl_group1_binding_n = 1 },
        [2].texture = { .stage = SG_SHADERSTAGE_FRAGMENT, .wgsl_group1_binding_n = 2 },
    },
    .samplers = {
        [0] = { .stage = SG_SHADERSTAGE_FRAGMENT, .wgsl_group1_binding_n = 3 },
    },
    .texture_sampler_pairs = {
        [0] = { .stage = SG_SHADERSTAGE_FRAGMENT, .view_slot = 0, .sampler_slot = 0 },
        [1] = { .stage = SG_SHADERSTAGE_FRAGMENT, .view_slot = 1, .sampler_slot = 0 },
        [2] = { .stage = SG_SHADERSTAGE_FRAGMENT, .view_slot = 2, .sampler_slot = 0 },
    },
});

Shader code changes are only needed on WebGPU when using storage images. Those have moved from @group(2) into @group(1) (this is because storage images are no longer special compute-pass-attachments, but regular bindings just like texture- and storage-buffer bindings).

Q & A

Why no vertex- and index-buffer views

I had actually implemented vertex- and index-buffer views at first because it would have reduced the size of sg_bindings by 36 bytes (32 bytes vertex-buffer-offsets and 4 bytes index-buffer-offset). In the end I rolled that change back since none of the backend 3D APIs require creating view objects for binding vertex- and index-buffers, but some rendering scenarios (like writing a renderer backend for Dear ImGui) heavily depend on dynamic offsets for vertex- and index-data.

I might come back to that idea once additional drawing functions with base-offsets are added (which is planned for the ‘not-too-distant future’). Also adding a D3D12 backend would require adding view objects for vertex- and index-buffers, since D3D12 has removed the ability to bind vertex- and index-buffers directly with a dynamic offset (at least that’s what I’m seeing in the D3D12 docs).

Update: Nvm, I was wrong here, D3D12 just uses the name ‘view’ both for transient structs and for baked objects, and D3D12_VERTEX_BUFFER_VIEW and D3D12_INDEX_BUFFER_VIEW are such a transient struct. Thanks to ‘@[email protected]` for making me aware of my misconception!

Why no ‘texture’ field in sg_image_usage to indicate that texture views may be created for an image object?

Simply because creating a texture view is always supported for image objects, so that flag could be implicitly hardwired to true anyway (with one ‘legacy edge case’: WebGL2 and GL4.1 not supporting binding multi-sampled images as textures). In that edge-case, an explicit .usage.texture flag would allow to fail already at image object creation instead of failing to create a texture view on a multi-sampled image object, but since this is such a minor detail which only affects ‘legacy APIs’ (WebGL2 and GL 4.1) that I didn’t think adding an explicit texture usage flag was worth it.

What’s up with SG_MAX_VIEW_BINDSLOTS being this odd 28 instead of some 2^N value?

That way the sg_bindings struct is a nice round 256 bytes (64 bytes for vertex buffer handles and offsets, 8 bytes for index buffer and offset, 112 bytes for view handles, 64 bytes for sampler handles plus 2*4 bytes for the start and end canaries).

16 separate samplers might be overkill, so I might tweak the number of views vs samplers a bit in the ‘resource view update 2’.

The sokol-gfx 'compute milestone 2' update

Mon, 19 May 2025 00:00:00 +0000

Update: merge happened on 24-May-2025

In a couple of days I will merge the next breaking sokol_gfx.h update (aka the compute-ms2 update) which makes working with buffer objects a bit more flexible and will allow compute shaders to write to sg_image objects via ‘compute pass attachments’.

The update also comes with a matching sokol-shdc update which writes additional reflection information for storage images used in compute shaders into the code-generated sg_shader_desc struct.

NOTE: all WASM sample URLs in the blog post require a WebGPU capable browser and will only be valid after the merge.

The implementation ticket is here, and this also has links to all related PRs: https://github.com/floooh/sokol/issues/1244

Updated documentation sections

in sokol_gfx.h, re-read the updated section ON COMPUTE PASSES
if you’re not using sokol-shdc for shader compilation, also re-read the updated section ON SHADER CREATION (most of that information is only needed when not using sokol-shdc though)
read the new doc section ON STORAGE IMAGES

An important behaviour change for immutable buffer objects

The initial ‘compute shader’ update allowed to create immutable buffers without initial data and guaranteed that the buffer content would be zero-initialized. On some backend APIs this required a temporary memory allocation of the buffer size which obviously wasn’t great.

This guaranteed zero-initialization has been rolled back now and the rules for creating immutable buffer objects have been changed like this:

when creating an immutable non-storage-buffer object (e.g. the buffer cannot be written to with a compute shader), initial data must be provided
when creating an immutable storage-buffer object, no initial data needs to provided, but in that case the buffer content will be ‘undefined’

In practice this means that when you use a compute shader to initialize storage buffer content you can no longer rely on the initial buffer content being zero-initialized, instead write all buffer items in the compute shader, even when they are supposed to be zero.

Multi-purpose buffer objects

It’s now possible to bind the same buffer object to different bind points (e.g. bind the same buffer as vertex buffer, index buffer and/or storage buffer). This means the following scenarios are now enabled:

It’s possible to stash vertices and indices into the same buffer (with the exception of WebGL2 where this is explicitly disallowed)
It’s now possible to use a compute shader to write data to a buffer, and then bind this buffer as vertex- or index-buffer.

To achieve this, the sg_buffer_desc struct has been changed to merge the previous buffer type and buffer usage enum items into a new sg_buffer_usage struct which is a boolean flag group:

typedef struct sg_buffer_usage {
    bool vertex_buffer;
    bool index_buffer;
    bool storage_buffer;
    bool immutable;
    bool dynamic_update;
    bool stream_update;
} sg_buffer_usage;

The default setup configures an immutable vertex buffer (just as before), e.g. creating a buffer object like this:

const sg_buffer buf = sg_make_buffer(&(sg_buffer_desc){
    .data = SG_RANGE(vertices),
})

…is identical with:

const sg_buffer buf = sg_make_buffer(&(sg_buffer_desc){
    .usage = {
        .vertex_buffer = true,
        .immutable = true,
    },
    .data = SG_RANGE(vertices),
});

…to create an immutable index buffer:

const sg_buffer buf = sg_make_buffer(&(sg_buffer_desc){
    .usage = {
        .index_buffer = true,
    },
    .data = SG_RANGE(indices),
});

…to create an index buffer with stream-update hint:

const sg_buffer buf = sg_make_buffer(&(sg_buffer_desc){
    .usage = {
        .index_buffer = true,
        .stream_update = true,
    },
    .size = ...,
});

…to create a buffer that can be written by a compute shader and then bound to a vertex buffer bindpoint:

const sg_buffer buf = sg_make_buffer(&(sg_buffer_desc){
    .usage = {
        .vertex_buffer = true,
        .storage_buffer = true,
    },
    .size = ...,
});

…and the same as index buffer:

const sg_buffer buf = sg_make_buffer(&(sg_buffer_desc){
    .usage = {
        .index_buffer = true,
        .storage_buffer = true,
    },
    .size = ...,
});

To stash both vertices and indices into the same buffer object:

const sg_buffer buf = sg_make_buffer(&(sg_buffer_desc){
    .usage = {
        .vertex_buffer = true,
        .index_buffer = true,
    },
    .data = SG_RANGE(vertices_and_indices),
});

Note that ‘multi-purpose buffer usage’ is explicitly disallowed on WebGL2 (which is only relevant for using a single buffer to hold vertex- and index-data, since storage buffers are not available on WebGL2 anyway). To check for this restriction use the new sg_features.separate_buffer_types boolean:

if (!sg_query_features().separate_buffer_types) {
    const sg_buffer buf = sg_make_buffer(&(sg_buffer_desc){
        .usage = {
            .vertex_buffer = true,
            .index_buffer = true,
        },
        .data = SG_RANGE(vertices_and_indices),
    });
}

Any invalid combination of usage flags will also be checked in the sokol-gfx validation layer.

The following new sample uses a combined vertex/index buffer:

The instancing-compute-sapp sample has been updated to bind the compute-shader-updated storage buffer as vertex buffer with hardware instancing:

There is no sample yet which uses a compute shader to write index data.

Breaking changes when creating image objects

Similar to the above sg_buffer_desc change, usage hints in the sg_image_desc struct are now provided through a new sg_image_usage struct looking like this:

typedef struct sg_image_usage {
    bool render_attachment;
    bool storage_attachment;
    bool immutable;
    bool dynamic_update;
    bool stream_update;
} sg_image_usage;

E.g. creating a ‘render-target texture’ for offscreen rendering now looks like this:

const sg_image img = sg_make_image(&(sg_image_desc){
    .usage = {
        .render_attachment = true,
    },
    ...
});

…and creating a image updated dynamically with CPU data with stream-update behaviour:

const sg_image img = sg_make_image(&(sg_image_desc){
    .usage = {
        .stream_update = true,
    },
    ...
});

As with sg_buffer_usage, invalid usage flag combinations are caught in the sokol-gfx validation layer.

Compute pass attachments (aka storage images)

It’s now possible to use compute shaders to write to sg_image objects. The way this is currently implemented is very similar to offscreen rendering (but will change in a future ‘resource view update’, more info on that at the end of the blog post).

Let’s first write a simple compute shader in the sokol-shdc GLSL flavour which writes some animated color gradient to a storage image:

@cs cs
layout(binding=0) uniform cs_params {
    float offset;
};
layout(binding=0, rgba8) uniform writeonly image2D cs_out_tex;
layout(local_size_x=16, local_size_y=16) in;

void main() {
    ivec2 size = imageSize(cs_out_tex);
    ivec2 pos = ivec2(mod(vec2(gl_GlobalInvocationID.xy) + vec2(size) * offset, size));
    vec4 color = vec4(vec2(gl_GlobalInvocationID.xy) / float(size), 0, 1);
    imageStore(cs_out_tex, pos, color);
}
@end
@program compute cs

On the CPU side, create an sg_image object with ‘storage attachment usage’:

const sg_image img = sg_make_image(&(sg_image_desc){
    .usage = {
        .storage_attachment = true,
    },
    .width = WIDTH,
    .height = HEIGHT,
    .pixel_format = SG_PIXELFORMAT_RGBA8,
});

Next the image must be wrapped in an sg_attachments object. This allows to pick a specific image surface (mip-level and/or slice) for the compute shader to access. Up to 4 (or SG_MAX_STORAGE_ATTACHMENTS) images can be defined in a single attachment:

const sg_attachments atts = sg_make_attachments(&(sg_attachments_desc){
    .storages[SIMG_cs_out_tex] = {
        .image = img,
        // optionally pick a mip level and slice:
        .mip_level = 0,
        .slice = 0,
    },
});

…next a compute pipeline object which wraps the above compute shader:

const sg_pipeline pip = sg_make_pipeline(&(sg_pipeline_desc){
    .compute = true,
    .shader = sg_make_shader(compute_shader_desc(sg_query_backend)),
});

In the frame loop, run a compute pass and provide the attachments object, apply the compute pipeline and uniform data, and finally call sg_dispatch() to kick off the compute shader:

sg_begin_pass(&(sg_pass){ .compute = true, .attachments = atts });
sg_apply_pipeline(pip);
sg_apply_uniforms(UB_cs_params, &SG_RANGE(cs_params));
sg_dispatch(WIDTH / 16, HEIGHT / 16, 1);
sg_end_pass();

…after the compute pass the image object can then be used as a texture binding in a regular render pass.

Find the complete sample here:

…and a more advanced example which has been ported from WebGPU:

Detailed change list

sokol_app.h:

The D3D11/DXGI backend now creates a D3D_FEATURE_LEVEL_11_1 device (with a fallback to D3D_FEATURE_LEVEL_11_0). Feature Level 11.1 is needed to allow more than 8 UAV (Unordered Access View) bindings. D3D11.1 was released around 2011 with Windows 8, so this is only an issue if support for Windows 7 is still required or on very old GPUs (Win7 is now at 0.12% on Steam Hardware Survey, but even if this turns out to be a problem, only the bindslot allocation strategy in sokol-shdc for HLSL5 UAV bindslots needs to be changed).

sokol_gfx.h:

A new constant SG_MAX_STORAGE_ATTACHMENTS = 4 has been added (most likely bumped to at least 8 in the future)
The struct sg_pixelformat_info has gained two new flags:
- bool read: true if the pixel format supports compute shader read access
- bool write: true if the pixel format supports compute shader write access
Currently the list of compute shader accessible pixel formats is hardwired to the following list which is safe to use across all GPUs and backend APIs (all those formats support read+write access):
- SG_PIXELFORMAT_RGBA8
- SG_PIXELFORMAT_RGBA8SN/UI/SI
- SG_PIXELFORMAT_RGBA16UI/SI/F
- SG_PIXELFORMAT_R32UI/SI/F
- SG_PIXELFORMAT_RG32UI/SI/F
- SG_PIXELFORMAT_RGBA32UI/SI/F
A new feature flag sg_features.separate_buffer_types has been added, this is only true on WebGL2. The only effect of that flag is that the same buffer object cannot be used as vertex- and index-buffer bindings.
The enums sg_usage and sg_buffer_type have been removed.
The struct sg_buffer_usage has been added.
The enum field sg_buffer_desc.type has been removed and replaced by boolean flags in sg_buffer_usage.
The enum field sg_buffer_desc.usage has been repurposed as nested struct item of type sg_buffer_usage.
The struct sg_image_usage has been added.
The boolean sg_image_desc.render_target has been removed and replaced by sg_image_usage.render_attachment
The enum feld sg_image_desc.usage has been repurposed as nested struct item of type sg_image_usage.
A new struct sg_shader_storage_image has been added, this is nested in in sg_shader_desc and holds reflection information about storage image bindings in compute shaders.
A new array sg_shader_desc.storage_images[] has been added to communicate reflection information about storage image usage in compute shaders to sokol_gfx.h
A new array sg_attachments_desc.storages[] has been added to describe ‘storage image attachments’ for compute passes.
The function sg_query_buffer_usage() now returns a struct sg_buffer_usage.
The function sg_query_image_usage() now returns a struct sg_image_usage.

What’s next

Long story short: while working on the storage image update it became clear that sokol_gfx.h needs resource-view objects.

This will allow more flexible resource bindings without creating temporary 3D-backend objects in the ‘hot path’ while keeping the sokol_gfx.h backend implementations simple (e.g. I want to avoid a dynamic ‘hash-and-cache’ approach for 3D-backend resource objects as much as possible, it’s already bad enough that this is needed with WebGPU BindGroups).

Currently resource view objects are managed under the hood, for instance in the D3D11 backend:

sg_buffer objects with storage buffer usage generally create a Shader Resource View for readonly-access in vertex-, fragment- and compute-shaders, and if the buffer is immutable, also an Unordered Access View for write-access in compute shaders. Notably, any starting offsets are hardwired to zero in both view objects.
sg_image objects generally create a Shader Resource View object, but without allowing to specify a mip-level range, array-slice range or different pixel format.
sg_attachments objects create:
- one Render Target View object per color attachment
- an optional Depth Stencil View object for the depth-stencil attachment
- one Unordered Access View object per storage attachment

The reason why storage images are currently treated as pass attachments instead of regular bindings applied via sg_apply_bindings() is because storage image bindings need to pick a mip-level and/or slice, and at least on D3D11 this requires a baked UAV object. Likewise, binding the same storage buffer with different offsets would require one SRV or UAV object per offset.

The current plan for view objects in sokol_gfx.h looks like this:

a single new resource object type is added: sg_view, with matching structs and functions (sg_view_desc, sg_make_view(), sg_destroy_view(), etc…)
in return, the sg_attachments resource object type is removed (along with sg_attachments_desc, sg_make_attachments(), sg_destroy_attachments() etc…)
view objects can be thought of as specialization of a resource object for a specific bindslot type (I actually thought about calling the new resource type sg_binding, but ‘view’ is the established name for this type of thing across backend 3D APIs), e.g. views will come in the following ‘runtime flavours’:
- texture views
- storage buffer views
- storage image views
- color attachment views
- resolve attachment views
- depth-stencil attachment views
…and maybe (but not sure yet):
- vertex buffer views
- index buffer views
…vertex- and index-buffer-views would allow to remove the bind offset for vertex- and index-buffers from sg_bindings, with the downside that one view object would be required per offset, but I can’t think of a situation where a highly dynamic starting offset would be required for vertex- and index-data. To be clear: there is no backend API which requires a view object for vertex- and index-buffer bindings, it would be purely a sokol_gfx.h thing (this also means that it would be very cheap to build and destroy vertex- and index-buffer-view objects on the fly since no calls into backend APIs would happen)

the new sg_bindings struct would then look like this (notably storage images for compute shader access would move from ‘pass attachments’ to regular ‘bindings’)

typedef struct sg_bindings {
    sg_view vertex_buffers[SG_MAX_VERTEXBUFFER_BINDINGS]
    sg_view index_buffer;
    sg_view textures[SG_MAX_TEXTURE_BINDINGS];
    sg_view storage_buffers[SG_MAX_STORAGEBUFFER_BINDINGS];
    sg_view storage_images[SG_MAX_STORAGEIMAGE_BINDINGS]
    sg_sampler samplers[SG_MAX_SAMPLER_BINDINGS];
} sg_bindings;

sg_attachments would become a ‘transient struct’ similar to sg_bindings:

typedef struct sg_attachments {
    sg_view colors[SG_MAX_COLOR_ATTACHMENTS];
    sg_view resolves[SG_MAX_COLOR_ATTACHMENTS];
    sg_view depth_stencil;
} sg_attachments;

This ‘view update’ would have the following advantages:

storage buffer bindings can have a starting offset, which simplifies managing different types of data in the same buffer
texture and storage image bindings can (to some extent) reinterpret the image data (e.g. casting to a different pixel format or selecting a miplevel and slice range - this will have to be behind a feature flag though)
multiple-render-target combinations no longer need to be prebaked

No ETA yet on the ‘view update’ though, first I want to fix a couple of internal things:

the GL texture creation code is currently an unholy combination of glTexStorage and glTexImage functions. I want to cleanly split this into two code paths (unfortunatly macOS being stuck at GL 4.1 doesn’t have the glTexStorage functions, although I heard that those functions are implemented but just not present in the core GL headers - which I’ll need to investigate)
I want to improve the internal ‘lifetime tracking’ for referenced resources (e.g. one resource object holding a reference to another object). Currently it’s not possible to detect when such a referenced object has gone through an ‘uninit/init’ cycle because this keeps the same public handle while discarding and recreating backend 3D API objects. Especially for view objects (which need to track their original resource object) it is important that views can detect when their referenced resource object is discarded (and I’m thinking about ‘auto-managed’ view objects which can recreate themselves on the fly when their resource object goes through uninit/init - no promises yet though).

More info on those planned updates are in the following planning tickets:

resource views: https://github.com/floooh/sokol/issues/1252
better internal reference tracking: https://github.com/floooh/sokol/issues/1260
glTexStorage vs glTexImage: https://github.com/floooh/sokol/issues/1263

…and that is all for today :)

The sokol-gfx compute shader update

Mon, 03 Mar 2025 00:00:00 +0000

Update: merged happened on 08-Mar-2025

In the next couple of days I will merge initial compute shader support for sokol_gfx.h (and sokol-shdc). The update is surprisingly ‘low-profile’ in terms of API changes, the only breaking change is that the runtime feature flag sg_features.storage_buffer has been renamed to sg_features.compute (this is because the same backends that supported storage buffers before now also support compute shaders).

Availability and Restrictions

Compute shader support is available on the following platform/backend combos:

macOS and iOS with Metal
Windows with D3D11 and GL
Linux with GL
Web with WebGPU

…which means that compute shaders are not available on:

macOS with GL
iOS with GLES3
Web with WebGL2
Android with GLES3

The initial compute shader support comes with a couple of restricitions which will most likely be lifted in later updates (in about that order):

storage buffers cannot be bound as vertex- or index-buffers
no storage textures, e.g. compute shaders can only write buffer data but not texture data
there’s no way to read data from GPU resources back to the CPU side (or copy data between GPU resources)

Right now compute shaders are mostly useful for replacing dynamic- and streaming-buffer update scenarios, where dynamic render data is computed on the CPU and uploaded to buffers via sg_update_buffer().

New compute shader samples

To get an idea how compute shaders work in sokol-gfx, it’s best to read the new sample code:

This is an evolution of the instancing-sapp sample, and moves all particle computations into compute shaders.

The other compute shader sample is a straight port of the WebGPU compute boids sample to sokol-gfx:

Those two samples use ‘cross-backend’ GLSL shader code compiled to the underlying shading languages via sokol-shdc.

For authoring compute shaders with sokol-shdc it might make sense to read up on GLSL compute shaders in the GL Wiki - note though that not all features have been properly tested yet (like sampling textures in compute shaders, or accessing shared memory).

For using sokol-gfx compute shaders without sokol-shdc, check out the following backend specific versions of the instancing-compute sample:

Also check out the updated documentation of sokol-shdc, and the new documentation comment section on compute shaders in the sokol_gfx.h header (search for: ON COMPUTE PASSES and re-read the updated section ON SHADER CREATION).

Shader Authoring Changes

The sokol-gfx update comes with a matching sokol-shdc update for authoring compute shaders.

A new tag @cs [name] (similar to the existing @vs [name] and @fs [name]) is used to identify a compute shader snippet, e.g. everything inside @cs / @end will be compiled as a GLSL compute shader.

NOTE that the distinction between readonly and read/write storage buffer bindings is important, e.g.:

layout(binding=0) readonly buffer cs_ssbo_in { particle prt_in[]; };
layout(binding=1) buffer cs_ssbo_out { particle prt_out[]; };

If your compute shader only reads (but doesn’t write) storage buffer content, its binding declaration should be marked as readonly. This information will be extracted by sokol-shdc and used by sokol-gfx for hazard-tracking needed in some 3D-APIs.

The other notable shader specialty is the ‘workgroup size’, which in GLSL is defined as:

layout(local_size_x=X, local_size_y=Y, local_size_z=Z) in;

…if you’re used to HLSL, this is the same as [numthreads(X,Y,Z)], or in WGSL @workgroup_size(X,Y,Z). On Metal this is called threadsPerThreadGroup and is not defined in the shader code, but on the CPU side when issuing a dispatch call (this is another case where sokol-shdc comes in handy, since it extracts the workgroup size from the GLSL shader and passes it into sokol-gfx as sg_shader_desc.mtl_threads_per_threadgroup).

Other then that you mainly need to be aware that your compute shader code must be thread safe because compute shaders allow random write access into storage buffers and the GPU is spawning many invocations of your shader running in parallel.

On the CPU side

The sg_setup() call gets a new config item sg_desc.max_dispatch_calls_per_pass (default: 1024). This is used to allocate an internal array to keep track of written storage buffers in a compute pass for hazard tracking purposes.

There’s a minor change when creating buffers: It’s now allowed to create immutable buffers without initial content, and such buffers will be zero-initialized (note though that dynamic- and streaming-buffers may still have undefined buffer content after creation). Zero-initialization is useful when using a compute shader to write the initial buffer content instead of providing the data from the CPU side during the sg_make_buffer() call.

Shaders, pipelines and passes now come in two runtime flavours: ‘render’ vs ‘compute’, where the ‘render flavours’ are fully compatible with existing code.

For shaders, nothing changes either when using sokol-shdc for shader authoring. In that case you just write a compute shader and sokol-shdc will code-generate a matching sg_shader_desc struct which can be plugged directly into the sg_make_shader() call.

A compute pipeline is a regular pipeline object without any render state, but with a compute shader attached:

sg_pipeline pip = sg_make_pipeline(&(sg_pipeline_desc){
  .compute = true,
  .shader = a_compute_shader,
});

Finally, kicking off ‘compute workloads’ happens with a new function sg_dispatch() inside ‘compute passes’:

sg_begin_pass(&(sg_pass){ .compute = true });
sg_apply_pipeline(pip);
sg_apply_bindings(...);
sg_apply_uniforms(...);
sg_dispatch(x, y, z);
sg_end_pass();

The sg_dispatch() call takes the number of ‘workgroups’ as arguments (same convention as GL, D3D11 and WebGPU, but different from Metal’s dispatchThreads method).

Compute- vs render-passes now impose a couple of restrictions (checked by the validation layer):

the following functions must only be called in render passes:
- sg_apply_viewport[f]()
- sg_apply_scissor_rect[f]()
- sg_draw()
sg_dispatch() must only be called in a compute pass
sg_apply_bindings() in a compute pass must not attempt to bind vertex- or index-buffers
the sg_apply_pipeline() pipeline type must match the pass type (e.g. render pipeline objects can only be applied in render passes, and compute pipeline objects only in compute passes)

When not using sokol-shdc

If you don’t use sokol-shdc for shader authoring you’ll need to populate the all-important sg_shader_desc struct passed into sg_make_shader() yourself with information that matches your shader code:

A nested struct compute_func has been added (similar to existing vertex_func and fragment_func) to pass a compute shader function as backend-specific source code or bytecode blob
A Metal-specific mtl_threads_per_threadgroup nested struct which defines the ‘workgroup size’ to the Metal API (this is in sg_shader_desc because those values are normally extracted from shader code via reflection)
The readonly boolean in the storage buffer bindslot declaration is now allowed to be false, but only in compute shaders. This flag is now used by sokol-gfx as hint for ‘resource hazard tracking’ in some backend APIs.
A new HLSL/D3D11 specific item uint8_t register_u_n has been added to the nested storage_buffers[] declarations (struct sg_shader_storage_buffer), this is used to communicate the HLSL bindslot for writable storage buffer bindings (which are bound as D3D11 ‘unordered access views’, while readonly storage buffers continue to be bound as ‘shader resource views’).

Also please carefully review the backend-specific compute shader samples which directly pass backend-specific shader code into sokol-gfx:

Under the hood

Most of the new code in sokol_gfx.h is just a straight-forward mapping from sokol-gfx types and functions into backend 3D-API types and functions.

Only two details are worth mentioning:

On Metal, and only on systems without unified memory, GPU-written managed storage buffers are ‘synchronized’ at the end of a compute pass inside sg_end_pass(). This synchronization basically updates the CPU-side shadow copy of the buffer with the new data that’s been written by a compute shader. This requires keeping track of all read/write storage buffer bindings inside a compute pass (this is what the new sg_desc.max_dispatch_calls_per_pass config item is used for).
On GL, glMemoryBarrier() calls are issued (at most once per sg_apply_bindings() call) when a storage buffer was previously bound as read/write (which sets an internal ‘gpu_dirty’ flag).

What’s next

…mainly patching remaining feature gaps in a couple of minor updates:

allow storage buffers to be bound as vertex- and index-buffers
introducing storage textures which can be written by compute shaders
more ‘feature coverage’ by writing a handful more interesting compute samples

…and what will most likely a bigger update: figure out a proper sub-API for CPU => GPU, GPU => CPU and GPU => GPU copies.

Upcoming Sokol header API changes (Nov 2024)

Mon, 04 Nov 2024 00:00:00 +0000

Update: the ‘bindings cleanup’ update has been merged on 07-Nov-2024

In a couple of days I will merge the next breaking sokol_gfx.h update (aka the “Bindings Cleanup”). The update also affects sokol-shdc, so if you’re using sokol-shdc for shader compilation make sure to update that as well.

Overview
Updated documentation and example code
- When using sokol-shdc:
- When not using sokol-shdc
Change Recipes
- When using sokol-shdc:
- When not using sokol-shdc:

Overview

In general, the update makes the relationship between the shader resource interface and the sokol-gfx resource binding model more explicit, but also more flexible. Another motivation for the change was to prepare the sokol-gfx API for compute shader support.

The root PR is here: https://github.com/floooh/sokol/pull/1111.

The TL;DR is:

When using sokol-shdc for shader compilation, the input GLSL source now requires explicit binding annotations via layout(binding=N), where N directly maps to bindslot indices in the sokol-gfx resource binding API.
The concept of ‘shader stages’ mostly disappears from the sokol-gfx API, shader stages are now only a minor detail of the shader interface reflection information in the sg_shader_desc struct passed into the sg_make_shader() function.
When not using sokol-shdc there’s now an explicit mapping from sokol-gfx bindslots to 3D backend-specific bindslots. This reduces the sokol-gfx internal magic for mapping the backend-agnostic sokol-gfx binding model to the specific binding models of the backend 3D APIs (there are still some restrictions but only when they allow a more efficient resource binding implementation in sokol-gfx).

In general, all changes result in compile errors, and cleaning up the compile errors by following the ‘change recipes’ below should be enough to make your existing code work.

The following parts of the public sokol_gfx.h API have changed:

In the sg_bindings struct, the nested vertex- and fragment-stage structs for the image-, sampler- and storage-buffer-bindings have been removed, and the bindings arrays have moved up into the root struct.
In the sg_apply_uniforms() call, the shader stage parameter has been removed
The interior of the sg_shader_desc struct and the typename of nested structs have changed completely (but if you are using sokol-shdc for shader authoring you don’t need to worry about that, since sokol-shdc will code-generate the sg_shader_desc struct.
A number of public API constants have been removed or renamed (but those should rarely show up in user code).
The enum items in sg_shader_stage have been renamed, and those are now only used in the sg_shader_desc struct and nowhere else:
- SG_SHADERSTAGE_VS => SG_SHADERSTAGE_VERTEX
- SG_SHADERSTAGE_FS => SG_SHADERSTAGE_FRAGMENT

The update also has some minor behaviour changes:

Resource bindings can now have gaps, and validation for sg_apply_bindings() has been relaxed to allow bindslots in the sg_bindings struct to be occupied even when the current shader doesn’t use those bindings. This allows to use the same sg_bindings struct for different but related shader variants.
Likewise, uniform block bindslots can now be explicitly defined in the shaders which allows to ‘share’ bindslot indices across shaders. Trying to call sg_apply_uniforms() for a bindslot that isn’t used by the current shader is still an error though (not sure yet if this makes sense, could probably be relaxed in a later update)
There’s now a new (debug-mode only) error check in sg_draw() to make sure that sg_apply_bindings() and/or sg_apply_uniforms() had been called since the last sg_apply_pipeline() when required.

Updated documentation and example code

NOTE: these links will only be uptodate after PR #1111 has been merged.

When using sokol-shdc:

Please re-read the sokol-shdc documentation:

https://github.com/floooh/sokol-tools/blob/master/docs/sokol-shdc.md

Especially the section Shader Authoring Considerations.

In the sokol_gfx.h header, re-read the documentation header above the sg_bindings struct.

Check the updated sokol samples here:

https://github.com/floooh/sokol-samples/tree/master/sapp

When not using sokol-shdc

In the sokol_gfx.h header, re-read the updated documentation section ON SHADER CREATION.

Next read the updated documentation above the sg_shader_desc and sg_bindings structs.

Finally check the updated backend-specific samples:

for Metal: https://github.com/floooh/sokol-samples/tree/master/metal
for D3D11: https://github.com/floooh/sokol-samples/tree/master/d3d11
for desktop GL: https://github.com/floooh/sokol-samples/tree/master/glfw
for WebGL/GLES3: https://github.com/floooh/sokol-samples/tree/master/html5
for WebGPU: https://github.com/floooh/sokol-samples/tree/master/wgpu

Especially note the sg_shader_desc struct interiors in the sg_make_shader() calls.

Change Recipes

General rule of thumb: fix all places that throw compile errors and you should be good.

When using sokol-shdc:

First you’ll need to fix your shaders and add explicit binding annotations. When running sokol-shdc over your current shader code you’ll get errors looking like this:

error: 'binding' : uniform/buffer blocks require layout(binding=X)

…or this:

error: 'binding' : sampler/texture/image requires layout(binding=X)

To fix those errors for the different resource types add layout(binding=N) annotations:

layout(binding=0) uniform vs_params { ... };
layout(binding=0) uniform texture2D tex;
layout(binding=0) uniform sampler smp;
layout(binding=0) readonly buffer ssbo { ... };

Note that each resource type (uniform blocks, textures, samplers and storage buffers) has its own bindslot space which is shared across shader stages. Trying to use bindslot indices outside those ranges, or using the same bindslot for a resource type in different shader stages will cause a compilation error.

The binding ranges per resource type are:

uniform blocks: 0..7
textures: 0..15
samplers: 0..15
storage buffers: 0..7

…these are also the maximum number of resources of that type that can be bound on a shader across all shader stages.

Next fix the compile errors on the CPU side, you should see errors when initializing an sg_bindings struct, when calling sg_apply_uniforms() and possibly when setting up vertex attributes in the sg_pipeline_desc struct:

in the sg_bindings struct, the nested structs for the vertex and fragment shader stage have been removed, and the former per-stage binding arrays have moved up into the root
in the sg_apply_uniforms() call, the shader stage argument has been removed
all code-generated slot constants have new naming schemes (also the vertex attribute slot constants)

For instance if your shader resource interface looks like this:

@vs
// a vertex shader uniform block
layout(binding=0) uniform vs_params { ... };
// a vertex shader texture and sampler
layout(binding=0) uniform texture2D vs_tex;
layout(binding=0) uniform sampler vs_smp;
// a vertex shader storage buffer
layout(binding=0) readonly buffer vs_ssbo { ... };
@end

@fs
// a fragment shader uniform block
layout(binding=1) uniform fs_params { ... };
// diffuse, normal and specular textures
layout(binding=1) uniform texture2D diffuse_tex;
layout(binding=2) uniform texture2D specular_tex;
layout(binding=3) uniform texture2D normal_tex;
// a common sampler for the above textures
layout(binding=1) uniform sampler smp;
@end

…the matching sg_bindings struct on the CPU side needs to look like this - note how the array indices match the shader layout(binding=N):

const sg_bindings bnd = {
    .vertex_buffer[0] = ...,
    .index_buffer = ...,
    .images = {
        [0] = vs_tex,
        [1] = diffuse_tex,
        [2] = specular_tex,
        [3] = normal_tex,
    },
    .samplers = {
        [0] = vs_smp,
        [1] = smp,
    },
    .storage_buffers = {
        [0] = vs_ssbo,
    },
};

…and the sg_apply_uniforms() calls to write the uniform data for the vs_params and fs_params uniform blocks now look like this:

sg_apply_uniforms(0, &SG_RANGE(vs_params));
sg_apply_uniforms(1, &SG_RANGE(fs_params));

…instead of hardwired numeric indices you can also use code-generated constants (note that those have been renamed from a generic SLOT_* to a per-resource-type naming scheme):

const sg_bindings bnd = {
    .vertex_buffer[0] = ...,
    .index_buffer = ...,
    .images = {
        [IMG_vs_tex] = vs_tex,
        [IMG_diffuse_tex] = diffuse_tex,
        [IMG_specular_tex] = specular_tex,
        [IMG_normal_tex] = normal_tex,
    },
    .samplers = {
        [SMP_vs_smp] = vs_smp,
        [SMP_smp] = smp,
    },
    .storage_buffers = {
        [SBUF_vs_ssbo] = vs_ssbo,
    },
};

…or for the uniform block updates:

sg_apply_uniforms(UB_vs_params, &SG_RANGE(vs_params));
sg_apply_uniforms(UB_fs_params, &SG_RANGE(fs_params));

…using the code-generated constants has the advantage that changing the bindslots in the shader code doesn’t require updating the CPU-side code, but other then that it’s totally fine to use numeric indices.

The naming scheme for the code-generated vertex attribute slots has changed to use the shader program name for ‘namespacing’ instead of the vertex shader snippet name.

For instance with the following shader fragment:

@vs vs
in vec4 position;
in vec4 color0;
...
@end

@fs fs
...
@end

@program cube vs fs

The generated vertex attribute slot constants ATTR_* previously looked like this (in the sg_pipeline_desc struct):

const sg_pipeline_desc desc = {
    .layout = {
        .attrs = {
            [ATTR_vs_position].format = ...,
            [ATTR_vs_color0].format = ...,
        },
    },
    ...
};

…now the ATTR_* names look like this (e.g. ATTR_vs_* to ATTR_cube_*):

const sg_pipeline_desc desc = {
    .layout = {
        .attrs = {
            [ATTR_cube_position].format = ...,
            [ATTR_cube_color0].format = ...,
        },
    },
    ...
};

…it’s also possible to use explicit attribute locations and ignore the code-generated constants, for instance:

@vs vs
layout(location=0) in vec4 position;
layout(location=1) in vec4 color0;
...
@end

const sg_pipeline_desc desc = {
    .layout = {
        .attrs = {
            [0].format = ...,
            [1].format = ...,
        },
    },
    ...
};

…note though that it’s still not allowed to have gaps in the vertex attribute slots (this may be supported at a later time).

When not using sokol-shdc:

The interior of sg_shader_desc has changed to match the new ‘shader-stage-agnostic’ sokol-gfx binding model. The toplevel-structure now looks like this:

const sg_shader_desc desc = {
    .vertex_func = { ... },         // vertex shader source or bytecode
    .fragment_func = { ... },       // fragment shader source or bytecode
    .attrs = { ... },               // vertex attribute reflection info
    .uniform_blocks = { ... },      // reflection info for uniform block bindings
    .storage_buffers = { ... },     // reflection info for storage buffer bindings
    .images = { ... },              // reflection info for texture bindings
    .samplers = { ... },            // reflection info for sampler bindings
    .image_sampler_pairs = { ... }, // how images and samplers are used together in the shader
};

The array indices in the uniform_blocks[] array match the ub_slot parameter in the sg_apply_uniforms() call:

sg_shader_desc.uniform_blocks[N] => sg_apply_uniforms(N, ...)

The array indices in the storage_buffers[], images[] and samplers[] arrays match the respective indices in the sg_bindings struct:

sg_shader_desc.images[N] => sg_bindings.images[N]
sg_shader_desc.samplers[N] => sg_bindings.samplers[N]
sg_shader_desc.storage_buffers[N] => sg_bindings.storage_buffers[N]

Fields that are only required for a specific 3D backend now have consistent prefixes:

D3D11/HLSL: hlsl_*
GL/GLSL: glsl_*
Metal/MSL: msl_*
WebGPU/WGSL: wgsl_*

The resource binding slots now require two new types of information:

the shader stage this resource binding appears on
a 3D backend specific bindslot

The backend specific bindslot struct members need to be filled with the shader language specific resource bindslot numbers which also need to lie within specific ranges:

for uniform block items:
- .hlsl_register_b_n = N; <= HLSL register(bN) where (N >= 0) && (N < 8)
- .msl_buffer_n = N; <= >MSL [[buffer(N)]] where (N >= 0) && (N < 8)
- .wgsl_group0_binding_n = N; <= WGSL @group(0) @binding(N) where (N >= 0) && (N < 8)
for images:
- .hlsl_register_t_n = N; <= HLSL register(tN) where (N >= 0) && (N < 24)
- .msl_texture_n = N; <= MSL [[texture(N)]] where (N >= 0) && (N < 16)
- .wgsl_group1_binding_n = N; <= WGSL @group(1) @binding(N) where (N >= 0) && (N < 128)
for samplers:
- .hlsl_register_s_n = N; <= HLSL register(sN) where (N >= 0) && (N < 16)
- .msl_sampler_n = N; <= MSL [[sampler(N)]] where (N >= 0) && (N < 16)
- .wgsl_group1_binding_n = N; <= WGSL @group(1) @binding(N) where (N >= 0) && (N < 128)
for storage buffers:
- .hlsl_register_t_n = N; <= HLSL register(tN) where (N >= 0) && (N < 24)
- .msl_register_b_n = N; <= MSL [[buffer(N)]] where (N >= 8) && (N < 16)
- .wgsl_group1_binding_n = N; <= WGSL @group(1) @binding(N) where (N >= 0) && (N < 128)
- .glsl_binding_n = N; <= GLSL layout(binding=N) where (N >= 0) && (N < 16)

These backend-specific bindslots allow a more flexible mapping from the sokol-gfx resource binding model to the backend 3D-API binding models, but there are still some restrictions (which typically exist to allow a more efficient resource binding implementation in sokol_gfx.h):

in WebGPU/WGSL, all uniform blocks must be in @group(0) and all other resource types in @group(1)
in Metal/MSL, the [[buffer(N)]] slots 0..7 are reserved for uniform blocks, and [[buffer(N)]] slots 8..15 are reserved for storage buffers

For code examples, check out the backend-specific samples:

for Metal: https://github.com/floooh/sokol-samples/tree/master/metal
for D3D11: https://github.com/floooh/sokol-samples/tree/master/d3d11
for desktop GL: https://github.com/floooh/sokol-samples/tree/master/glfw
for WebGL/GLES3: https://github.com/floooh/sokol-samples/tree/master/html5
for WebGPU: https://github.com/floooh/sokol-samples/tree/master/wgpu

…and that should be it! Next big thing on the roadmap: compute shader support :)

Zig and Emulators

Sat, 24 Aug 2024 00:00:00 +0000

Some quick Zig feedback in the context of a new 8-bit emulator project I started a little while ago:

https://github.com/floooh/chipz

Currently the project consists of:

a cycle-stepped Z80 CPU emulator (similar to the emulator described here: https://floooh.github.io/2021/12/17/cycle-stepped-z80.html
chip emulators for Z80 PIO, Z80 CTC and three variants of the AY-3-8910 sound chip
system emulators for Bombjack, Pengo and Pacman arcade machines, and the East German KC85/2../4 home computer series
a code generation tool to create the Z80 instruction decoder code block
various tests to check Z80 emulation correctness

With the exception of an external C dependency for ‘host system glue’ (the cross-platform sokol headers used for wrapping the platform-specific windowing, input, rendering and audio-output code), the project is around 16 kloc of pure Zig code.

I’m not yet sure how this new project will evolve in relation to the original C/C++ ‘chips’ emulator project, but I expect that the Zig project will overtake the C/C++ project at some point in the future.

Dev Environment

I’m coding on an M1 Mac in VSCode with the Zig Language Extension, and CodeLLDB for step-debugging.

The Zig and ZLS (Zig Language Server) installation is managed with ZVM.

For the most part this setup works pretty well, with a few tweaks:

I’m doing ‘build-on-save’ to get more complete error information as described here: Improving Your Zig Language Server Experience (I’m not bothering with creating separate non-install build targets though)
With the default Zig VSCode extension settings I was seeing that in long coding session (5..6 hours or so) saving would take longer and longer until it would eventually get stuck. After asking around on the Zig Discord this could be solved by explicitly setting the Zig Language Server as ‘VSCode Formatting Provider’ in the Zig Extension settings.
When debugging, there’s a somewhat annoying issue that the debug line information seems to be off in some places, the debugger appears to step into the last line of an inactive if-else block for instance. Again, Discord to the rescue, this seems to be a known issue.

All in all, not yet perfect, but good enough to get shit done.

Zig Comptime and Generics

Before diving into language details, I’ll need to provide some minimal background information of how the chipz emulators work:

Microchips of the 70s and 80s were very much like ‘software libraries, but implemented in hardware’, they followed a minimal standard for interoperability so that chips from different manufacturers could be combined into computer systems without requiring too much custom glue logic between them. I think it’s fair to say that this ‘competition through interoperability’ was the main driver for the Cambrian Explosion of cheap 8-bit computer systems in the 70s and 80s.

Microchips communicate with the outside world via input/output pins, and a typical 8-bit home computer system is essentially just a handful of microchips talking to each other through their ‘pin API’.

The chipz project follows that same idea: The basic building blocks are self-contained chip emulators which communicate with other chip emulators via virtual input/output pins which are mapped to bits in an integer.

Chips of that era typically had up to 40 pins which makes them a good fit for 64-bit integers used in today’s CPUs.

The API of such a chip emulator only has one important function:

pub fn tick(pins: u64) u64

This tick function executes exactly one clock cycle, it takes an integer as input where the bits represent input/output pins, and returns that same integer with modified bits.

Fitting a CPU emulator into such a ‘cycle-stepped model’ can be a bit of a challenge and is described in these blog posts (for the 6502 and Z80):

A whole computer system is then emulated by writing a ‘system tick function’ which emulates a single clock cycle for the whole system by calling the tick functions of each chip emulator and passing pin-state integers from one chip emulator to the next.

There’s two related problems to solve with the above approach:

There’s not enough bits in a 64-bit integer to assign one bit for each inter-chip connection of a complete computer system. This means a system tick function will need to maintain one pin-state integer for each chip, and shuffle bits around before each chip’s tick function is called.
For direct pin-to-pin connections it makes sense to assign the same bit position in different chip emulators to avoid ‘runtime bit shuffling’ from an output pin position of one chip to a different input pin position of another chip. Those direct pin-to-pin connections are different in each emulated computer system, so to make this idea work a specialized chip emulator needs to be ‘stamped out’ for each computer system.

Both problems can be solved quite elegantly in Zig:

Instead of 64-bit integers for the pin-state we can switch to wide integers (u128, u192, u256, …) with enough bits to assign each chip in a system its own reserved bit range instead of juggling with multiple 64-bit integers.
With Zig’s comptime generics it’s possible to stamp out chip emulators which are specialized by a specific mapping of pins to bit positions in the shared wide integer.

This means a chip emulator is specialized by two comptime configuration values:

a Bus type which is an unsigned integer type with enough bits for all pin-to-pin connections in a system
a Pins structure which defines a bit position for each input/output pin of a chip emulator

For the Z80 CPU emulator this pin definition struct looks like this:

pub const Pins = struct {
    DBUS: [8]comptime_int,
    ABUS: [16]comptime_int,
    M1: comptime_int,
    MREQ: comptime_int,
    IORQ: comptime_int,
    // ...more pins...
};

…which is used as nested struct in a TypeConfig struct which holds all generic parameters to stamp out a specialized Z80 emulator:

pub const TypeConfig = struct {
    pins: Pins,
    bus: type,
};

This TypeConfig struct is used as parameter for a comptime Zig function which returns a specialized type (this is how Zig does generics):

pub fn Type(comptime cfg: TypeConfig) type {
  return struct {
    // the returned struct is a new type which is comptime-configured
    // by the 'cfg' type configuration parameter
  };
}

…now we can stamp out a Z80 CPU emulator that’s specialized for a specific computer system by the system bus integer type and the Z80 pins mapped to specific bit positions of this integer type:

const z80 = @import("z80");

const Z80 = z80.Type(.{
  .bus = u128,
  .pins = .{
    .DBUS = .{ 0, 1, 2, 3, 4, 5, 6, 7 },
    .ABUS = .{ 8, 9,  // ... },
    // ...
  }
});

This specific Z80 type uses a 128-bit pin-state integer and maps its own pins to bit positions starting at bit 0, with the first 8 bits being the data bus (most other chips in any computer system will also map their data bus pins to the same bit range, since the data bus is usually shared between all chips in a system).

Note that Z80 is just a type, not a runtime object. To get a default-initialized Z80 CPU object:

var cpu = Z80{};

This example doesn’t look like much, it’s “just Zig code” after all, but this is exactly what makes generic programming in Zig so elegant and powerful.

Arbitrarily complex comptime config options can be ‘baked’ into types, and dynamic runtime configuration options can be passed in a ‘construction’ function on that type, and all is just regular Zig code from top to bottom:

var obj = Type(.{
  // comptime options...
  .bus = u128,
  .pins = .{ ... },
}).init(.{
  // additional runtime options...
});

…and this is just scratching the surface. There’s a couple of really interesting side effects of this 2-step approach (first build the type, then build an object from that type):

Can use designated-init-syntax for configuring the type which is just *chef’s kiss* because it makes the code very readable (no guessing what a generic parameter actually does because the name is right there in the code).
TypeConfig structs can be composed by nesting other TypeConfig structs, or generic parameters in general, which then can be used to build types inside types (Yo Dawg…).
It’s possible to build different struct interiors based on comptime parameters (for instance the different KC85 models have different runtime-config struct interiors for configuring model-specific features, which makes ‘accidential misconfiguration’ an immediate compile error).

In conclusion, the idea to use Zig’s comptime features to stamp out specialized per-system chip and system emulators works exceptionally well and is (IMHO) much more enjoyable than C++ or Rust generic programming (I’m sure C++ and Rust can do the same things with sufficient template magic, but this code definitely won’t look as straightforward as the Zig version).

Bit Twiddling and Integer Math can be awkward

This section is hard to write because it’s criticizing without offering an obviously better solution, please read it as ‘constructive criticism’. Hopefully Zig will be able to fix some of those things on the road towards 1.0.

Zig’s integer handling is quite different from C:

arbitrary bit-width integers are the norm, not the exception
there is no concept of integer promotion in math expressions (not that I noticed at least)
implicit conversion between different integer types is only allowed when no data loss can happen (e.g. an u8 can be assigned to an u16, but assigning an u16 to an u8 requires an explicit cast)
mixing signed and unsigned values in expressions isn’t allowed
overflow is checked in Debug and ReleaseSafe mode, and there are separate operators for ‘intended wraparound’

At first glance these features look pretty nice because they fix some obvious footguns in C and C++. Arbitrary width integer types are especially useful for emulator code, because hardware chips are full of ‘odd-width’ counters and registers (3, 5, 20 bits etc…). Directly mapping such registers to types like u3, u5 or u20 should potentially allow for more readable and ‘expressive’ code.

Unfortunately, in reality it’s not so clear cut. While C is definitely too sloppy when it comes to integer math, Zig might swing the pendulum a bit too far into the other direction by requiring too much explicit casting.

The most extreme example I stumbled over was implementing the Z80’s indexed addressing mode (e.g. those instructions involving (IX+d) or (IY+d). This takes the byte d and adds it as a signed quantity and with wraparound to a 16 bit address (e.g. the byte is sign-extended to a 16-bit value before the addition).

In C this is quite straightforward:

uint16_t addi8(uint16_t addr, uint8_t offset) {
  return addr + (int8_t)offset;
}

The simplest way I could come up with to do the same in Zig is:

fn addi8(addr: u16, offset: u8) u16 {
  return addr +% @as(u16, @bitCast(@as(i16, @as(i8, @bitCast(offset)))));
}

Note how the integer conversion gets totally drowned in ‘@-litter’.

Both functions result in the same x86 and ARM assembly output (with -O3 for C and any of the Release modes in Zig):

addi8:
  movsx eax, sil  ; move low byte of esi into eax with sign-extension
  add eax, edi    ; eax += edi
  ret

For ARM (looks like ARM handles the sign-extension right in the add instruction, not very RISC-y but neat!):

addi8:
  add w0, w0, w1, sxtb
  ret

IMHO when the assembly output of a compiler looks so much more straightforward than the high level compiler input, it becomes a bit hard to justify why high level programming languages had been invented in the first place ;)

Apart from that extreme case (which only exists once in the whole code base), narrowing conversions are much more common when writing code that mixes different integer widths, and those narrowing conversions require explicit casts, and those explicit casts may reduce readability quite a bit.

The basic idea to only allow implicit conversions that can’t lose data is definitely a good one, but very often a cast is required even though the compiler has all the information it needs at compile time to prove that no information is lost.

For instance this Zig code currently is an error:

fn trunc4(val: u8) u4 {
  return val & 0xF;
}

The expression result would fit into an u4, yet an @intCast or @truncate is required to make it work:

fn trunc4(val: u8) u4 {
  return @intCast(val & 0xF);
}

Similar situation with a right-shift:

fn broken(val: u8) u4 {
  return val >> 4;
}

fn works(val: u8) u4 {
  return @truncate(val >> 4);
}

Somewhat surprisingly, this works fine though:

  const a: u8 = 0xFF;
  const b: u4 = a & 0xF;
  const c: u4 = a >> 4;

A similar problem exists with loop variables, which are always of type usize and which need to be explicitly narrowed even if the loop count is guaranteed to fit into a smaller type:

for (0..16) |_i| {
  const i: u4 = @intCast(_i);
}

There’s also surprising cases like this:

Assuming that:

a: u16 = 0xF000
b: u16 = 0x1000
c: u32 = 0x10000

This expression creates an overflow error:

  const d = a + b + c;

…but this doesn’t:

  const e = c + a + b;

The type of d and e is both u32 btw (which I find also a bit surprising, it means that Zig already picks the widest input type as the result type, but it doesn’t promote the other inputs to this widest type).

And here’s another surprising behaviour I stumbled over:

// self.sprite_coords[] is an array of bytes
const px: usize = 272 - self.sprite_coords[sprite_index * 2 + 1];

This produces the error error: type 'u8' cannot represent integer value '272'. Why Zig tries to fit the constant 272 into an u8 instead of picking a wider type is a bit of a mystery tbh.

One solution is to widen the value read from the array:

const px: usize = 272 - @as(usize, self.sprite_coords[sprite_index * 2 + 1]);

But this works too:

const px: usize = @as(u9, 272) - self.sprite_coords[sprite_index * 2 + 1];

In conclusion, I only understood that C’s integer promotion actually has an important purpose after missing it so badly in Zig :D

I think C’s main problem with integer promotion is that it promotes to int, and int being stuck at 32-bits even on 64-bit CPUs (not moving the int type to 64 bits during the transition from 32- to 64-bit CPUs was a pretty stupid decision in hindsight).

TBF though, just extending to the natural word size (e.g. 64 bits) wouldn’t help much in Zig when using wide integers like u128.

In any case, I hope that the current status quo isn’t what ends up in Zig 1.0 and that a way can be found to reduce ‘@-litter’ in mixed-width integer expressions without going back entirely to C’s admittedly too sloppy integer promotion and implicit conversion rules.

Asking around on the Zig Discord there seems to be a proposal which lets operators narrow the result type for comptime known values (which if I understand it right would make the result type of the expression a & 0xF a u4 instead of whatever wider type a is).

Another idea that might make sense is to promote integers to the widest input type. Currently the compiler already seems to use the widest input type in an expression as result type, promoting the other inputs to this widest type looks like a logical step to me.

I would keep the strict separation of signed and unsigned integer types though, e.g. mixed-sign expressions are not allowed, and any theoretical integer promotion should never happen ‘across signedness’.

From my own experience in C (where I don’t allow implicit sign-conversion via -Wsign-conversion warnings) I can tell that this will feel painful in the beginning for C and C++ coders, but it makes for better code and API design in the long run.

This experience (of transitioning to more restrictive but also more correct C code by enabling certain warnings) is also why I’m giving Zig some slack about its integer conversion strictness. After all, maybe I’m just not used to it yet. But OTH, I have by now written enough Zig code that I should slowly get used to it, but it still feels bumpy. All in all I think this is an area where ‘strict design purity’ can harm the language in the long run though, and a better balance should be found between strictness, coding convenience and readability.

Using wide integers with bit twiddling code is fast

Using a 128 bit integer variable for the emulator system bus works nicely and doesn’t have a relevant performance impact. In fact, with a bit of care (by not using bit twiddling operations that cross a 64-bit boundary) the produced assembly code is identical to doing the same operation on a simple 64-bit variable.

For instance extracting an 8-bit value from the upper half of an 128-bit integer:

fn getu8(val: u128) u8 {
  return @truncate(val >> 64);
}

…is just moving the register which holds the upper 64 bits into the return value register:

getu8:
  mov rax, rsi
  ret

…which is the same cost as extracting an 8-bit value from a 64-bit variable:

fn getu8(val: u64) u8 {
  return @truncate(val);
}

getu8:
  mov rax, rdi
  ret

…just make sure that the operation doesn’t cross 64-bit boundaries:

fn getu8(val: u128) u8 {
  return @truncate(val >> 60);
}

…because this now involves actual bit twiddling:

getu8:
  shl esi, 4
  shr rdi, 60
  lea eax, [rdi + rsi]
  ret

Debug Performance

Release performance of my C emulator code (with -O3) and my Zig code (with -ReleaseFast) is roughly in the same ballpark, but I’m seeing a pretty big difference in Debug performance:

in C, debug performance is roughly 2x slower than -O3
in Zig, debug performance is roughly 3..4x slower than ReleaseFast

I haven’t figured out why yet, but it’s not the most obvious candidate (range and overflow checks) since ReleaseSafe performance is nearly identical with ReleaseFast (interestingly ReleaseSmall is the slowest Release build config, it’s about 40% slower than both ReleaseFast and ReleaseSafe).

One important difference between my C and Zig code is that in C I’m using tons of small preprocessor macros to make bit twiddling expressions more readable. In Zig these are replaced with inline functions (inline in Zig isn’t just an optimization hint, it causes the function body to be inlined also in debug mode).

At first glance Zig’s inline functions seem to be a good replacement for C preprocessor macros, but when looking at the generated code in debug mode, the compiler still pushes and pops function arguments through the stack even though the function body is inlined.

Consider this Zig code:

inline fn add(a: u8, b: u8) u8 {
    return a +% b;
}

fn add_1(a: u8, b: u8) u8 {
    return add(a, b);
}

fn add_2(a: u8, b: u8) u8 {
    return a +% b;
}

…in release mode, both functions produce the same code as expected:

add_1:
  lea eax, [rsi + rdi]
  ret

add_2:
  lea eax, [rsi + rdi]
  ret

But in debug mode, the function which calls the inline function has a slightly higher overhead because of additional stack traffic:

add_1:
  push rbp
  mov rbp, rsp
  sub rsp, 5
  mov cl, sil
  mov al, dil
  mov byte ptr [rbp - 4], al
  mov byte ptr [rbp - 3], cl
  mov byte ptr [rbp - 2], al
  mov byte ptr [rbp - 1], cl
  add al, cl
  mov byte ptr [rbp - 5], al
  mov al, byte ptr [rbp - 5]
  movzx eax, al
  add rsp, 5
  pop rbp
  ret

add_2:
  push rbp
  mov rbp, rsp
  sub rsp, 2
  mov cl, sil
  mov al, dil
  mov byte ptr [rbp - 2], al
  mov byte ptr [rbp - 1], cl
  add al, cl
  movzx eax, al
  add rsp, 2
  pop rbp
  ret

TBH though it’s unlikely that inline function overhead is the only contributor to the slower debug performance, but it could be many such small papercuts combined.

Conclusion

I enjoy working with Zig immensely despite the few warts I encountered, for the most part the code just ‘flows out of the hand’ which IMHO is an important property of a programming language. It’s encouraging to see how areas which were a bumpy ride during the 0.10 to 0.11 versions have improved and stabilized (most importantly the build and package management system).

It’s also interesting how the ‘most popular design fault’ that comes up in every single Zig discussion (currently that’s ‘unused variables are errors’) is a complete non-issue (for me at least, not once in that 16-kloc project was that an annoyance), while the issue that actually mildly annoyed me in real world code (the @-litter in mixed-width integer expressions) is still very much under the radar. Maybe also because mixed-width and bit twiddling code might not be all that common in typical Zig projects, most integer code is probably about computing array indices or data offsets and happen in usize.

I also completely left out a whole chapter about code generation with Zig (which would have been mostly about string processing and memory management), simply because the blog post would have become too big, and it is probably an interesting enough topic for its own blog post. This is also an area where Zig is different enough from C, mid-level languages like C++ or Rust, and high level memory-managed languages that I don’t feel quite confident enough yet to have found the right solution to questions like ‘who owns the underlying memory of a slice returned from a function’ - I have solutions of course, but I’m not entirely happy with them because it feels like a throwback to my first forays into C and C++.

In short, I don’t want to burden myself (too much) with memory ownership questions, even in low level systems programming languages. Typically in C I avoid such problems with a ‘mostly value-driven approach’ instead of returning references to data, I return a copy of the data (unless of course it’s about bulk data like images, 3d meshes, file content etc.. but those are special cases which are easy to deal with using manual memory management).

Zig is leaning in heavily on slices though, which are just pointer/size pairs without any concept of ownership. It would be nice if Zig had some syntax sugar to make working with arrays just as flexible as with slices, because arrays are value types and avoid all the ownership footguns of slices. I think mostly this comes down to implementing a handful ‘missing features’ from C99 designated initialization (like #6068) or maybe even looking at languages like JS and TS (…shock and gasps from the audience!!! I know but bear with me) for a couple of features which make working with struct and array values more convenient (like destructuring and spreading).

…but I’m already halfway into that other blog post which I wanted to avoid, so let’s end it here lol.

Upcoming Sokol header API changes (May 2024)

Mon, 06 May 2024 00:00:00 +0000

Aka: “the storage buffer update”

In a couple of days I will merge the next sokol-gfx feature update which adds initial storage buffer support. The update also affects other headers and tools (most notably sokol_app.h, all headers with embedded shaders, and sokol-shdc - the cross-backend shader compiler).

The bad news first:

This is ‘gpu-readonly’ support, e.g. it’s not possible (yet) to write to storage buffers from shader code, gpu-write support will come in a future ‘compute shaders’ update.
The following platform/backend combos don’t get storage buffer support:
- all GLES3 backends (WebGL2, iOS+GLES3, Android): for WebGL2 and iOS there is no other choice since they are stuck with GLES 3.0, for Android, storage buffer support may be added later
- macOS+GL: macOS is stuck at GL 4.1, while storage buffers require at least GL 4.3
This leaves the following platform/backend combos which support storage buffers:
- macOS + Metal
- iOS + Metal
- Windows + D3D11
- Windows + GL
- Linux + GL
- Web + WebGPU

Storage buffers provide a convenient way to communicate large array-like data to shaders (the minimum guaranteed size for storage buffers is 128 MBytes), for instance:

for ‘vertex pulling’ to load per-vertex and/or per-instance data from storage buffers instead of relying on the fixed function vertex input stage
as a more convenient and flexible way to load random access data in shaders compared to the old-school way of using ‘data textures’.

…and as a ‘drive-by’ feature: sokol-gfx now finally allows to kick off a draw call without any resource bindings and instead synthesize vertices ‘out of thin air’ in the vertex shader.

The root PR for the update is here: #1007.

New sample code

The following backend-agnostic samples have been added (those use sokol_app.h and sokol-shdc).

NOTE: You’ll need a recent Chrome for the WebGPU sample links to work, also expect some general breakage and rendering artifacts depending on the platform (for instance Chrome on Android straight up crashes the tab on most samples). Also please note that the source code links in those samples will not be valid until all the update PRs have been merged.

triangle-bufferless-sapp: this demonstrates rendering without buffers (and is the only new sample that also works on backends without storage buffer support):
- WebGPU: triangle-bufferless-sapp.html
- C code: sapp/triangle-bufferless-sapp.c
- GLSL code: sapp/triangle-bufferless-sapp.glsl
vertexpull-sapp: the cube-sapp sample ported to vertex pulling:
- WebGPU: vertexpull-sapp.html
- C code: sapp/vertexpull-sapp.c
- GLSL code: sapp/vertexpull-sapp.glsl
sbuftex-sapp: a sample which uses a storage buffer in the fragment shader stage:
- WebGPU: sbuftex-sapp.html
- C code: sapp/sbuftex-sapp.c
- GLSL code: sapp/sbuftex-sapp.glsl
instancing-pull-sapp: vertex pulling and instancing via storage buffers:
- WebGPU: instancing-pull-sapp.html
- C code: sapp/instancing-pull-sapp.c
- GLSL code: sapp/instancing-pull-sapp.glsl
ozz-storagebuffer-sapp: the ozz-skin sample rewritten to pull vertices, instance- and skinning-matrices from storage buffers:
- WebGPU: ozz-storagebuffer-sapp.html
- C code: sapp/ozz-storagebuffer-sapp.cc
- GLSL code: sapp/ozz-storagebuffer-sapp.glsl

The following backend-specific samples demonstrate how to use storage buffers without the sokol-shdc shader compiler:

D3D11 d3d11/vertexpulling-d3d11.c
Metal: metal/vertexpulling-metal.c
WebGPU: wgpu/vertexpulling-wgpu.c
desktop GL: glfw/vertexpulling-glfw.c

How to check for storage buffer support

To check for storage buffer support at runtime, call sg_query_features() and check the storage_buffer boolean in the result:

if (sg_query_features().storage_buffer) {
    // storage buffers are supported...
} else {
    // storage buffers are *NOT* supported...
}

Desktop GL version caveats (and a minor breaking change)

The sokol_gfx.h desktop-GL backend will now query what GL version it runs on to decide whether storage buffers are supported (storage buffers were added in GL 4.3).

The expected minimal version has been bumped to 4.1 on macOS and 4.3 on other platforms, this also means that sokol_app.h will now by default create a 4.1 context on macOS, and 4.3 context on other platforms.

Since the GL version is now flexible, the configuration define SOKOL_GLCORE33 doesn’t make much sense anymore and has been renamed to SOKOL_GLCORE. You’ll get a proper compile error when trying to build with the old SOKOL_GLCORE33 define.

Apart from rebuilding your shaders via an updated sokol-shdc, this is the only required change for existing code.

In sokol-shdc, the target language glsl330 has been removed and replaced with glsl410 and glsl430. When targeting the macOS GL backend, use glsl410, otherwise glsl430.

A simple vertex pulling example

First let’s rewrite the cube-sapp.glsl shader to pull vertices from a storage buffer instead of the fixed function vertex input.

The original shader declares the vertex input with vertex attributes:

in vec4 position;
in vec4 color0;

NOTE: the cube-sapp.glsl shader makes use of a fixed function vertex input feature which extends float[3] vertex data on the CPU side to vec4 with a w-component 1.0 on the GPU side. Magic like this isn’t supported when reading from storage buffers (as far as I’m aware at least).

For vertex pulling the input vertex attributes are replaced with a flexible-array struct inside a buffer interface block.

struct sb_vertex {
    vec3 pos;
    vec4 color;
};

readonly buffer ssbo {
    sb_vertex vtx[];
};

NOTE: I’m using sb_vertex for the struct name here because vertex is a reserved keyword in the Metal Shading Language and would cause a compile error when outputting MSL.

Do not use an attribute like layout(std430, binding=0) for the buffer interface block, sokol-shdc will take care of those details.

The original vertex shader looks like this:

void main() {
    gl_Position = mvp * position;
    color = color0;
}

Converted to vertex pulling it looks like this:

void main() {
    vec4 position = vec4(vtx[gl_VertexIndex].pos, 1.0);
    gl_Position = mvp * position;
    color = vtx[gl_VertexIndex].color;
}

Note how gl_VertexIndex (not gl_VertexID!) is used to index into the storage buffer, this is because sokol-shdc shaders are written in ‘Vulkan style’, not ‘GL style’.

We also need to expand the vec3 input pos manually to a vec4 with w-component = 1.0.

That’s all the changes needed on the shader side. Next compile the modified shader with:

sokol-shdc -i shader.glsl -o shader.h -l metal_macos:hlsl5:glsl430:wgsl -f sokol

Apart from the ‘traditional’ code-generation output, sokol-shdc will create two new declarations:

A define #define SLOT_ssbo (0), this is the bind slot index to be used in the sg_bindings struct

A C struct sb_vertex_t which maps the GLSL struct sb_vertex to the C side looking like this:

  SOKOL_SHDC_ALIGN(16) typedef struct sb_vertex_t {
      float pos[3];
      uint8_t _pad_12[4];
      float color[4];
  } sb_vertex_t;

NOTE: with the right @ctype tags at the top of the shader we could also map the struct members to C or C++ types, for instance with HandmadeMath.h types:

SOKOL_SHDC_ALIGN(16) typedef struct sb_vertex_t {
    hmm_vec3 pos;
    uint8_t _pad_12[4];
    hmm_vec4 color;
} sb_vertex_t;

Next let’s see how the cube-sapp C code needs to be changed:

The original code creates a vertex buffer like this:

float vertices[] = {
    -1.0, -1.0, -1.0,   1.0, 0.0, 0.0, 1.0,
     1.0, -1.0, -1.0,   1.0, 0.0, 0.0, 1.0,
    ...
};
sg_buffer vbuf = sg_make_buffer(&(sg_buffer_desc){
    .data = SG_RANGE(vertices),
    .label = "cube-vertices"
});

By default sg_make_buffer() creates a vertex buffer, so the above is identical with a more explicit:

sg_buffer vbuf = sg_make_buffer(&(sg_buffer_desc){
    .type = SG_BUFFERTYPE_VERTEXBUFFER,
    .data = SG_RANGE(vertices),
    .label = "cube-vertices"
});

…when changing the code to use storage buffers we can use the code-generated sb_vertex_t struct to initialize the vertex data. This has the advantage that we don’t need to care about the obscure std430 memory layout rules:

sb_vertex_t vertices[] = {
    { .pos = { -1.0, -1.0, -1.0 }, .color = { 1.0, 0.0, 0.0, 1.0 } },
    { .pos = {  1.0, -1.0, -1.0 }, .color = { 1.0, 0.0, 0.0, 1.0 } },
    ...
};
sg_buffer sbuf = sg_make_buffer(&(sg_buffer_desc){
    .type = SG_BUFFERTYPE_STORAGEBUFFER,
    .data = SG_RANGE(vertices),
    .label = "cube-vertices",
});

…note how the buffer type has changed to SG_BUFFERTYPE_STORAGEBUFFER.

On to the sg_pipeline object. In the original code, a vertex layout must be defined in the sg_pipeline_desc struct to configure the fixed function vertex input stage:

state.pip = sg_make_pipeline(&(sg_pipeline_desc){
    .layout = {
        .attrs = {
            [ATTR_vs_position].format = SG_VERTEXFORMAT_FLOAT3,
            [ATTR_vs_color0].format   = SG_VERTEXFORMAT_FLOAT4
        }
    },
    .shader = shd,
    .index_type = SG_INDEXTYPE_UINT16,
    .cull_mode = SG_CULLMODE_BACK,
    .depth = {
        .write_enabled = true,
        .compare = SG_COMPAREFUNC_LESS_EQUAL,
    },
    .label = "cube-pipeline"
});

When pulling vertex data from storage buffers such a vertex layout description isn’t needed, so the pipeline creation can be simplified to this:

state.pip = sg_make_pipeline(&(sg_pipeline_desc){
    .shader = shd,
    .index_type = SG_INDEXTYPE_UINT16,
    .cull_mode = SG_CULLMODE_BACK,
    .depth = {
        .write_enabled = true,
        .compare = SG_COMPAREFUNC_LESS_EQUAL,
    },
    .label = "cube-pipeline"
});

…the original sg_bindings struct that’s passed into sg_apply_bindings():

state.bind = (sg_bindings) {
    .vertex_buffers[0] = vbuf,
    .index_buffer = ibuf
};

…is changed like this (e.g. replace the vertex buffer binding with a storage buffer binding on the vertex shader stage):

state.bind = (sg_bindings) {
    .index_buffer = ibuf
    .vs.storage_buffers[SLOT_ssbo] = sbuf,
};

…and that’s it! On the CPU side, storage buffers actually simplify a lot of code because you don’t need a vertex layout in the sg_pipeline_desc struct, and you get a properly aligned and padded C struct for the storage buffer content from sokol-shdc.

NOTE: A ‘proper’ cross-backend sample should also check whether storage buffers are actually supported via sg_query_features().storage_buffer and render some sort of fallback.

Shader Authoring Caveats

Shader authoring via sokol-shdc is a bit more restricted than vanilla GLSL:

A storage buffer interface block must contain exactly one item, and this item must be a flexible struct array member. In vanilla GLSL you can have additional ‘header items’ in front of the flexible array member, but this turned out tricky to map to CPU-side non-C languages that don’t allow flexible array members (I actually need to research the various target languages a bit more, maybe this rule can be relaxed in the future for some of the target languages).
Currently the following types are valid inside a storage buffer struct:
- bool, bvec2..4: mapped to int32_t, and int32_t[2..4]
- int, ivec2..4: mapped to int32_t, and int32_t[2..4]
- uint, uvec2..4: mapped to uint32_t, and uint32_t[2..4]
- float, vec2..4: mapped to float and float[2..4]
- matNxM where N=2..4 and M=1..4 mapped to float[2..64]
nested structs
arrays of the above

Please note that only few of those combinations are tested, especially when it comes to correct array item padding and alignment. If you stumble over any problems please write a ticket at https://github.com/floooh/sokol-tools/issues.

To load packed vertex components from storage buffers, use the following GLSL builtins:

vec2 unpackUnorm2x16(uint p)
vec2 unpackSnorm2x16(uint p)
vec4 unpackUnorm4x8(uint p)
vec4 unpackSnorm4x8(uint p)

Under the hood

NOTE: the following information about shader bind slots are only relevant if you do not use the sokol shader compiler (sokol-shdc), but instead pass ‘raw’ HLSL, MSL, GLSL or WGSL shaders into sokol_gfx.h. Also, this information will become obsolete/irrelevant with another future update I have in mind which will allow more flexibility when mapping sokol-gfx bind slots to backend 3D API bind slots (see this planning ticket for more info: #1037)

Metal

On Metal there is no ‘buffer zoo’ like in other 3D APIs, uniform-, vertex-, index- and storage-buffers are all the same thing. The vertex- and fragment-shader stages have their own buffer bind slot spaces though.

The following bind slot ranges are used for the various sokol-gfx buffer types:

on the vertex shader stage:
- slots 0..3 for uniform buffer bindings (sokol-gfx internally manages an uniform buffer which might be bound at up to four different offsets)
- slots 4..11 for vertex buffer bindings
- slots 12..19 for storage buffer bindings
on the fragment shader stage:
- slots 0..3 for uniform buffer bindings
- slots 4..11 for storage buffer bindings

When authoring Metal shaders directly you’ll need to use the above bind slots (also see the low-level Metal backend samples).

D3D11

On D3D11, so called Byte Address Buffers are used for storage buffers which makes their direct usage in manually written HLSL a bit awkward (but is not an issue when using sokol-shdc).

If this turns out to be a problem I might add D3D11-specific creation flags to sg_buffer_desc to allow using different D3D11 buffer and buffer-view types under the hood, details like this might also change again once compute shader support is added.

On D3D11 and HLSL storage buffers share a bind slot range with texture bindings, that’s why sokol-gfx defines the following bind ranges for textures and storage buffers in HLSL:

register(t0..t15): reserved for texture bindings
register(t16..t23): reserved for storage buffer bindings

Also see the low-level D3D11 backend samples for details.

WebGPU

Storage buffers are created with WGPUBufferUsage_Storage. WebGPU uses a common bind slot space across all shader resource types and shader stages. Sokol-gfx reserves the following bind slot ranges for the different shader stages and resource types, use those when feeding manually written WGSL shaders into sokol-gfx:

vertex shader stage:
- textures: @group(1) @binding(0..15)
- samplers: @group(1) @binding(16..31)
- storage buffers; @group(1) @binding(32..47)
fragment shader stage:
- textures: @group(1) @binding(48..63)
- samplers: @group(1) @binding(64..79)
- storage buffers: @group(1) @binding(80..95)

Also see the low-level WebGPU backend samples for details

GL

In GL, storage buffers are bound to the GL_SHADER_STORAGE_BUFFER target. Sokol-gfx does not lookup GLSL storage buffer interface blocks by name, but instead expects that the GLSL code that’s passed into sg_make_shader() uses a layout(std430, binding=N) annotation to define the bind slot.

The vertex- and fragment-shader stage use a common bind space:

on the vertex shader stage, use binding 0..7
on the fragment shader stage, use binding 7..15

Also see the low-level desktop GL backend samples for details.

sokol-shdc updates

Sokol-shdc has been massively refactored, mainly with the goal to have a more robust base for extracting reflection information from shaders and a more ‘structured’ approach to code generation so that supporting additional CPU-side languages will be easier in the future (I’m not yet sure if that last goal was actually achieved though, but time will tell).

Unfortunately this massive refactoring also means that there’s a possibility that new bugs have sneaked in. If you notice anything weird, please write tickets here:

https://github.com/floooh/sokol-tools/issues.

A couple of unrelated lingering bugs have been fixed as well:

C++ exceptions are now enabled and exceptions coming out of SPIRVCross are now caught and turned into proper error messages. Previously sokol-shdc would simply appear to crash if SPIRVCross emitted an error (because without C++ exceptions enabled, those errors would be turned into a panic which looks like a segfault).
Error and warning line numbers had been off by a couple of lines recently. This has been fixed and error messages now point to the correct line again.
A couple of somewhat esoteric code generation bugs in non-C code generators were fixed (but as I said, it’s also quite likely that I have introduced new bugs in that area, since code generators were completely rewritten)

What’s next:

In short:

A resource binding cleanup (see #1037), the main motivation for this is that the sg_bindings struct is growing quite large and would grow even larger if a new compute shader stage is added. Furthermore, the artificial separation of shader stages when binding resources also doesn’t map particularly well to some modern 3D APIs.
After that it’s finally time to tackle compute shaders. For this I need to come up with a resource synchronization strategy, but I will most likely just copy what WebGPU does.

But first I will probably take a little break and dabble a bit with Zig and emulator coding :)

Upcoming Sokol header API changes (Feb 2024)

Mon, 26 Feb 2024 00:00:00 +0000

In a couple of days I will merge the first big API update of 2024 for sokol_gfx.h (with some related changes in sokol_app.h, sokol_glue.h and sokol_gfx_imgui.h).

NOTE: most links to code examples will only point to the right code after PR #985 has been merged!

The API update in sokol_gfx.h is a BREAKING CHANGE for all code, but for most use cases the required changes are fairly minimal.

Apologies for the broken syntax highlighting, apparently Rouge doesn’t understand C99.

Table of Contents
Overview and Motivation
Detailed change list
Link collection with example code changes
Detailed Change Recipes
Q: Why still have a baked pass attachments object?

Overview and Motivation

The general topic of this update is a cleanup of the sokol-gfx render pass functions and how external swapchain information is passed into sokol-gfx.

Previously there was a special ‘default render pass’ into a ‘default framebuffer’, and the concept of ‘contexts’ to allow switching between different rendering contexts and their default framebuffers (very similar to traditional OpenGL contexts, and in fact this old behavior only ever matched OpenGL, but not the other backend APIs).

This setup was needlessly complicated for people who want to use sokol-gfx to render into multiple windows, leading to planning ticket #904, and then to PR #985.

The gist is:

There is now only a single ‘unified’ sg_begin_pass() function which covers both rendering into sokol-gfx render target textures (aka ‘offscreen passes’) and externally managed ‘swapchains’ (aka ‘swapchain passes’).
The entire concept of contexts has been removed from sokol_gfx.h.
External swapchain properties are now passed directly into sg_begin_pass() in a transient structure.

Instead of having a special and unique ‘default-render-pass’ per frame and context, an application can now simply call sg_begin_pass() multiple times per frame, each time with properties for a different swapchain, and all that without having to create ‘context objects’ upfront or ‘switching contexts’.

Most simple applications that don’t render into offscreen passes and use sokol_gfx.h together with sokol_app.h and sokol_glue.h only need to change two calls: sg_setup() and sg_begin_default_pass(), for other situations please check the ‘Change Recipes’ section further down.

In addition to this blog post, please also re-read the documentation headers in sokol_gfx.h and sokol_app.h, and specifically the struct documentation for the new sokol-gfx structs sg_environment and sg_swapchain.

Detailed change list

sokol_gfx.h

The following public API structs and functions have been removed:

sg_begin_default_pass()
sg_begin_default_passf()
struct sg_context_desc
struct sg_context
sg_setup_context()
sg_activate_context()
sg_discard_context()

The following top-level structs have been added:

struct sg_environment: this is passed as a nested struct of sg_desc into the sg_setup() call to provide information about the environment sokol-gfx runs in (most importantly 3D API device pointers).
struct sg_swapchain: this is passed into sg_begin_pass() for render passes which should render into an externally managed swapchain. The struct contains the following information:
- the pixel format of the swapchain’s rendering surface
- the pixel format of the optional depth/stencil surface
- an MSAA sample count
- 3D backend specific resource handles, like D3D11/WebGPU texture views, Metal drawables, or GL framebuffers

The resource handle type sg_pass has been renamed to sg_attachments (to free the name for another purpose), this also causes related renames:

sg_pass => sg_attachments
sg_pass_desc => sg_attachments_desc
sg_pass_info => sg_attachments_info
sg_make_pass() => sg_make_attachments()
sg_destroy_pass() => sg_destroy_attachments()
sg_query_pass_state() => sg_query_attachments_state()
sg_query_pass_info() => sg_query_attachments_info()
sg_query_pass_desc() => sg_query_attachments_desc()
sg_alloc_pass() => sg_alloc_attachments()
sg_dealloc_pass() => sg_dealloc_attachments()
sg_init_pass() => sg_init_attachments()
sg_fail_pass() => sg_fail_attachments()
sg_[*]_pass_info() => sg_[*]_attachments_info() (where ‘*’ is ‘d3d11 gl metal wgpu’)

Inside the sg_attachments_desc struct there has been some renaming to reduce redundancy:

.color_attachments[] => .colors[]
.resolve_attachments[] => .resolves[]
.depth_stencil_attachment => .depth_stencil

The typename sg_pass has been repurposed to serve as the sg_begin_pass() parameter, e.g. the begin-pass function signature now looks like this:

void sg_begin_pass(const sg_pass* pass);

With the struct sg_pass now looking like this (with omitted start/end canaries):

typedef struct sg_pass {
    sg_pass_action action;
    sg_attachments attachments;
    sg_swapchain swapchain;
    const char* label;
} sg_pass;

For an ‘offscreen-render-pass’, an .attachments item must be provided, but no .swapchain:

sg_begin_pass(&(sg_pass){
    .action = pass_action,
    .attachments = attachments,
});

…and for a ‘swapchain-render-pass’, a .swapchain item must be provided, but no .attachments:

sg_begin_pass(&(sg_pass){
    .action = pass_action,
    .swapchain = sglue_swapchain(),
});

Other unrelated ‘drive-by-changes’ in sokol_gfx.h:

sg_limits.gl_max_vertex_uniform_vectors has been replaced with sg_limits.gl_max_vertex_uniform_components (see #714)
the start and end canaries in sg_pass_action have been removed (since sg_pass_action is now a nested struct of sg_pass, the canaries are redundant)
a new initialization config item sg_desc.mtl_use_command_buffer_with_retained_references has been added, (see: #981)

sokol_app.h

The following public API function has been removed:

sapp_metal_get_renderpass_descriptor()

The following functions have been renamed:

sapp_metal_get_drawable() => sapp_metal_get_current_drawable()
sapp_d3d11_get_render_target_view() => sapp_d3d11_get_render_view()

…and the following functions are new:

sapp_metal_get_depth_stencil_texture()
sapp_metal_get_msaa_color_texture()
sapp_d3d11_get_resolve_view()
sapp_gl_get_framebuffer()

…These functions directly plug into the new sg_swapchain struct in sokol_gfx.h.

sokol_glue.h

sokol_glue.h is now a regular library header without the ‘preprocessor magic’ which created a different API depending on what other sokol headers had been included before sokol_glue.h (this was an ‘interesting’ but ultimately pretty stupid idea).

The API prefix has changed from a somewhat confusing sapp_ to the expected sglue_.

The old function sapp_sgcontext() has been split into two new functions:

sglue_environment() which plugs directly into sg_desc.environment, and…
sglue_swapchain() which plugs into sg_pass.swapchain

Note that sglue_swapchain() may return different values each frame depending on the 3D API backend.

sokol_gfx_imgui.h

In a similar vein, the public API prefix of sokol_gfx_imgui.h has been changed from the weird ‘double prefix’ sg_imgui_ to a more conventional sgimgui_.

Apart from this publicly visible change, all the internals have been updated to reflect the sokol-gfx API changes.

Link collection with example code changes

If you use sokol_gfx.h + sokol_app.h + sokol_glue.h, check out the updated samples here (first click on a sample, and then on the ‘src’ link at the bottom):

sokol samples

Specifically look at clear-sapp for the simple case of only rendering to a default framebuffer, and offscreen-sapp for rendering to an offscreen render target.

If you use sokol_gfx.h with your own window system glue, or a library like GLFW or SDL, check out the updated backend specific examples:

for D3D11: https://github.com/floooh/sokol-samples/tree/master/d3d11
for Metal: https://github.com/floooh/sokol-samples/tree/master/metal
for GL with GLFW: https://github.com/floooh/sokol-samples/tree/master/glfw
for WebGL2: https://github.com/floooh/sokol-samples/tree/master/html5
for WebGPU: https://github.com/floooh/sokol-samples/tree/master/wgpu

The GLFW subdirectory also contains an updated multiwindow-glfw sample, and a metal-glfw sample which demonstrates how to use GLFW in NO_API mode together with the sokol_gfx.h Metal backend.

Also please be aware of the following behaviour and expectation changes if you are using your own window system glue:

For D3D11/DXGI the MSAA resolve operation is now performed in sg_end_pass(), previously this was expected to be performed in the window system glue before presentation.
For Metal it is now expected that the window system glue provides a CAMetalDrawable and optional MTLTexture objects instead of an MTLRenderPassDescriptor. This was also done to better ‘harmonize’ with the other backends (it’s just as easy getting those individual objects from an MTKView as the MTLRenderPassDescriptor).
For GL, sokol-gfx now expects that all rendering goes through a single GL context. This may require changes to existing code which renders into multiple windows (for instance in GLFW, every window has its own GL context). Refer to the new multiwindow-glfw.c example for a possible solution.

Additionally, check out the following PRs for required changes in my toy projects:

When using the language bindings, check out the following PRs:

Detailed Change Recipes

…for sokol_gfx.h + sokol_app.h + sokol_glue.h

When using sokol_gfx.h together with sokol_app.h and sokol_glue.h…

…change your sg_setup() call from this:

sg_setup(&(sg_desc){
    .context = sapp_sgcontext(),
    .logger.func = slog_func,
});

…to this:

sg_setup(&(sg_desc){
    .environment = sglue_environment(),
    .logger.func = slog_func,
});

Change the sg_begin_default_pass() call from this:

sg_begin_default_pass(&pass_action, sapp_width(), sapp_height());

…to this:

sg_begin_pass(&(sg_pass){
    .action = pass_action,
    .swapchain = sglue_swapchain()
});

…for offscreen render passes

Change sg_make_pass() calls from this:

sg_pass pass = sg_make_pass(&(sg_pass_desc){
    .color_attachments[0].image = color_img,
    .resolve_attachments[0].image = resolve_img,
    .depth_stencil_attachment.image = depth_img,
});

…to this:

sg_attachments attachments = sg_make_attachments(&(sg_attachments_desc){
    .colors[0].image = color_img,
    .resolves[0].image = resolve_img,
    .depth_stencil.image = depth_img,
});

Change sg_begin_pass() calls from this:

sg_begin_pass(pass, &pass_action);

…to this:

sg_begin_pass(&(sg_pass){
    .action = pass_action,
    .attachments = attachments,
});

…for custom window system glue

Create two helper functions, one which returns an initialized sg_environment struct and one which returns an initialized sg_swapchain struct. Following are examples how these functions might look like for different backend 3D APIs.

…using D3D11

Example implementations:

sg_environment d3d11_environment(void) {
    return (sg_environment){
        .defaults = {
            .color_format = SG_PIXELFORMAT_BGRA8,
            .depth_format = SG_PIXELFORMAT_DEPTH_STENCIL,
            .sample_count = 4,
        },
        .d3d11 = {
            .device = d3d11_device, // ID3D11Device*
            .device_context = d3d11_device_context, // ID3D11DeviceContext*
        }
    };
}

.defaults.color_format, defaults.depth_format and defaults.sample_count should match the ‘most common’ swapchain surface properties. These defaults will be used to fill in defaults for zero-initialized values in various sokol-gfx calls. .depth_format can also be SG_PIXELFORMAT_NONE if no depth-buffer exists, or SG_PIXELFORMAT_DEPTH if no stencil buffer is used.

The associated DXGI depth-stencil-view pixel formats are:

SG_PIXELFORMAT_DEPTH_STENCIL => DXGI_FORMAT_D24_UNORM_S8_UINT
SG_PIXELFORMAT_DEPTH => DXGI_FORMAT_D32_FLOAT

The helper function to obtain an sg_swapchain struct might look like this:

sg_swapchain d3d11_swapchain(void) {
    return (sg_swapchain){
        .width = state.width,
        .height = state.height,
        .sample_count = state.sample_count,
        .color_format = SG_PIXELFORMAT_BGRA8,
        .depth_format = SG_PIXELFORMAT_DEPTH_STENCIL,
        .d3d11 = {
            .render_view = (state.sample_count == 1) ? state.rt_view : state.msaa_view,
            .resolve_view = (state.sample_count == 1) ? 0 : state.rt_view,
            .depth_stencil_view = state.ds_view,
        }
    };
}

state.rt_view and state.msaa_view are of type ID3D11RenderTargetView and state.ds_view is of type ID3D11DepthStencilView.

Note how a different .d3d11.render_view is selected depending on whether multisampled rendering is used or not. For non-multisampled rendering, sokol-gfx renders into the same view that’s presented. For multisampled rendering, sokol-gfx will render into an intermediate MSAA texture view (state.msaa_view) which is then resolved into the d3d11.resolve_view inside sg_end_pass().

Also check out the example D3D11 window system glue code here:

https://github.com/floooh/sokol-samples/blob/master/d3d11/d3d11entry.c

…using Metal

Example function which returns an initialized sg_environment struct:

sg_environment osx_environment(void) {
    return (sg_environment) {
        .defaults = {
            .sample_count = sample_count,
            .color_format = SG_PIXELFORMAT_BGRA8,
            .depth_format = SG_PIXELFORMAT_DEPTH_STENCIL,
        },
        .metal = {
            .device = (__bridge const void*) mtl_device,
        }
    };
}

The ObjC type of mtl_device is id<MTLDevice>. Note the special __bridge cast to a void pointer for tunneling through the sokol_app.h and sokol_gfx.h C APIs.

…and the function which returns an sg_swapchain struct (in this case using an MTKView to manage the swapchain surfaces):

sg_swapchain osx_swapchain(void) {
    return (sg_swapchain) {
        .width = (int) [mtk_view drawableSize].width,
        .height = (int) [mtk_view drawableSize].height,
        .sample_count = sample_count,
        .color_format = SG_PIXELFORMAT_BGRA8,
        .depth_format = SG_PIXELFORMAT_DEPTH_STENCIL,
        .metal = {
            .current_drawable = (__bridge const void*) [mtk_view currentDrawable],
            .depth_stencil_texture = (__bridge const void*) [mtk_view depthStencilTexture],
            .msaa_color_texture = (__bridge const void*) [mtk_view multisampleColorTexture],
        }
    };
}

Also check out the Metal window system glue code here:

https://github.com/floooh/sokol-samples/blob/master/metal/osxentry.m

…alternatively check out the GLFW+Metal example here which doesn’t use an MTKView (but also doesn’t support a depth-buffer or MSAA rendering):

https://github.com/floooh/sokol-samples/blob/master/glfw/metal-glfw.m

…using WebGPU

The environment- and swapchain-helper-functions look very similar to D3D11:

sg_environment wgpu_environment(void) {
    return (sg_environment) {
        .defaults = {
            .color_format = SG_PIXELFORMAT_...,
            .depth_format = SG_PIXELFORMAT_...,
            .sample_count = state.desc.sample_count,
        },
        .wgpu = {
            .device = (const void*) state.device,
        }
    };
}

For .defaults.color_format you should use the result of wgpuSurfaceGetPreferredFormat() translated to a sokol-gfx pixel format (either SG_PIXELFORMAT_BGRA8 or SG_PIXELFORMAT_RGBA8).

For the depth format use either SG_PIXELFORMAT_DEPTH_STENCIL, SG_PIXELFORMAT_DEPTH or SG_PIXELFORMAT_NONE, which translate to WebGPU pixel formats as follows:

SG_PIXELFORMAT_DEPTH_STENCIL => WGPUTextureFormat_Depth32FloatStencil8
SG_PIXELFORMAT_DEPTH => WGPUTextureFormat_Depth32Float

The type of state.device is WGPUDevice.

The WebGPU swapchain helper function might look like this:

sg_swapchain wgpu_swapchain(void) {
    return (sg_swapchain) {
        .width = state.width,
        .height = state.height,
        .sample_count = state.sample_count,
        .color_format = SG_PIXELFORMAT_...,
        .depth_format = SG_PIXELFORMAT_...,
        .wgpu = {
            .render_view = (state.sample_count == 1) state.rt_view : state.msaa_view,
            .resolve_view = (state.sample_count == 1) ? 0 : state.rt_view,
            .depth_stencil_view = state.ds_view,
        }
    };
}

…note the selection for .wgpu.render_view and .wgpu.resolve_view based on the MSAA sample count, which works the same as in the d3d11_swapchain() function.

The types for all view objects are WGPUTextureView.

Also check out the WebGPU system glue code here:

https://github.com/floooh/sokol-samples/blob/master/wgpu/wgpu_entry.c

…GL with GLFW

The environment-helper-function only returns default pixel formats and sample count:

sg_environment glfw_environment(void) {
    return (sg_environment) {
        .defaults = {
            .color_format = SG_PIXELFORMAT_RGBA8,
            .depth_format = SG_PIXELFORMAT_DEPTH_STENCIL,
            .sample_count = 4,
        },
    };
}

…the swapchain function also returns a GL framebuffer object, for the default framebuffer this is always zero, otherwise this is a handle created with glGenFramebuffers().

sg_swapchain glfw_swapchain(void) {
    int width, height;
    glfwGetFramebufferSize(_window, &width, &height);
    return (sg_swapchain) {
        .width = width,
        .height = height,
        .sample_count = _sample_count,
        .color_format = SG_PIXELFORMAT_RGBA8,
        .depth_format = SG_PIXELFORMAT_DEPTH_STENCIL,
        .gl = {
            .framebuffer = 0,
        }
    };
}

Also see https://github.com/floooh/sokol-samples/blob/master/glfw/glfw_glue.c

Q: Why still have a baked pass attachments object?

I’ve been pondering for a little bit to get rid of pre-baked pass-attachments objects alltogether (e.g. what were formerly sg_pass objects and are now sg_attachments objects), and instead pass a transient struct with the same information that’s in sg_attachments_desc into the sg_begin_pass() function, similar to how sg_apply_bindings() takes a transient sg_bindings struct with all the resource bindings.

I didn’t follow through with that idea because this would mean creating temporary objects inside sg_begin_pass() and discarding them again in sg_end_pass() (or alternatively use a ‘hash-and-cache’ approach).

In D3D11 and WebGPU, one temporary texture view object would need to be created per pass-attachment (which may add up to 9 temporary objects), and in the GL backend, a GL framebuffer object must be created, configured and checked for completeness. All this work currently only happens once in sg_make_attachments(), but would need to happen inside sg_begin_pass() without baked attachments objects.

While these backend API objects should be ‘reasonably cheap’ to create, I still decided against it.

Currently the only other place where such temporary objects are created and discarded on the fly are in the sg_apply_bindings() call for the WebGPU backend, where temporary BindGroup objects are created and discarded dynamically via a ‘hash-and-cache’ approach and I hate it :) I don’t want that type of code to creep into other places.

Now, sg_begin_pass() and sg_end_pass() are by far not as high-frequency-calls as sg_apply_bindings(), and creating view- and framebuffer-objects should be cheap enough, but it still feels ‘wrong’ to create and discard backend API objects willy-nilly during the frame.

VSCode, WASM, WASI

Sun, 31 Dec 2023 00:00:00 +0000

I did a neat little thing during my year-end vacation: A VSCode extension for retro-assembly coding with the assembler and home computer emulator integrated right into VSCode via WASM and WASI.

The extension is here (careful: it must be installed as pre-release, otherwise installing a dependency extension won’t work, more on that later):

https://marketplace.visualstudio.com/items?itemName=floooh.vscode-kcide

This is what it looks like in action when debugging a KC85/4 demo I wrote for dog-fooding the extension:

The VSCode extension project is here:

https://github.com/floooh/vscode-kcide

…and the samples for KC85/4, C64 and Amstrad CPC are here:

https://github.com/floooh/kcide-sample

The extension also integrates the following projects:

a fork of the ASMX multi-cpu assembler
the KC85/4, C64 and CPC emulators from my chips project

Creating a simple VSCode extension is fairly straightforward (see: Your First Extension), so I won’t go into too many details there. What’s interesting is the use of WASM and WASI to integrate projects written in other languages than JS/TS into a VSCode extension.

This allows to bundle the assembler (written in C89) and the emulator (C99 and C++11) directly with the extension as WASM blobs. Similar extensions without WASM components would either need to port the assembler and emulator to JS/TS, ask the user to install and run native tools (most other retro-dev extensions seem to use that approach), or automatically download and install separate platform-specific native tools (the approach used by the Microsoft C/C++ extension), which is asking for a lot of trust from the extension user.

WASM fixes all those issues:

it’s completely hassle-free for the user because the WASM blobs can be bundled with the extension and everything works out of the box
it’s less hassle for the extension developer, because a single WASM blob automatically works on all platforms supported by VSCode (including the VSCode web version)
…and unlike native binaries, WASM and WASI don’t add any more security concerns over regular VSCode extensions written in TS/JS

Also, how cool is it that I can take an assembler written in C89 in the 90’s and safely run that without code changes in the VSCode web version?

(I did actually consider writing my own assembler in Typescript a long time ago just for the purpose of running it in VSCode but quickly abandondend that idea, here are the ruins of that folly: https://github.com/floooh/hcasm)

Paths not taken

I considered various approaches:

a native IDE via Qt similar to Goran Devic’s Z80 Explorer
integrate the IDE features right into the emulator via Dear ImGui (the emulators already have an extensive Dear ImGui debugging UI)
create a VSCode extension which calls into an assembler and emulator written in Typescript
create a VSCode extension which calls into native assembler and emulator binaries
create a VSCode extension which uses WASM for the assembler and emulator

The final decision to use VSCode with WASM comes down to a couple of central problems:

dealing with native tools in a cross-platform scenario is a massive PITA these days:
- running the same binary across different Linux distros is still pretty much an unsolved problem
- on Windows and macOS you’ll get all sorts of scare popups when trying to run an executable downloaded from the internet
porting a code base to TS/JS just so that it can be hooked up into a VSCode extension is almost always a massive waste of time

In the end it was a decision between (2: extend the existing Dear ImGui emulator UI with IDE features), and (4: figure out how to integrate the assembler and emulator as WASM blobs into a VSCode extension).

While I enjoy writing Dear ImGui UIs immensely, a robust text editing experience which can rival a dedicated text editor like VSCode would be a massive project on its own.

…which leaves (4) as the one option which enables the most robust result for the least amount of work (important, since this is a ‘vacation side project’ which shouldn’t increase my spare time software maintenance burden even more).

All in all the extension was finished in about 3 weeks of focused work (spread over 6 real-world weeks, with 2 weeks spent dog-fooding on a little KC85/4 assembly demo).

Of the 3 weeks working on the VSCode extension, about 2 weeks were spent on the Debug Adapter alone (a lot more effort than I initially expected).

The boring parts

I’ll run very quickly over the parts of the extension that are not all that interesting (since all of that is just reading the VSCode extension documentation about what features can be provided by extensions and how to implement them).

The KC IDE extension implements:

a handful of Commands which can be invoked via the Ctrl-P command palette:
- KCIDE: Build: assembles the source code into a binary file compatible with the current emulator
- KCIDE: Debug: builds the source and starts a debugging session
- KCIDE: Open Emulator: (re-)opens the emulator tab
- KCIDE: Reboot Emulator: cold-boots the emulator and stops active debug session
- KCIDE: Reset Emulator: resets the emulator and stops active debug session (on some home computers, a reset preserves the memory content)
two Key Bindings: F5 to start a debug session and F7 to build the project source code into a binary file
a JSON Schema for a kcide.project.json file which defines the target computer system, assembly dialect, file paths and output binary file format loadable by the emulator
a Language Grammar for regex-based syntax highlighting (Z80 and 6502 assembly statements, plus ASMX-specific keywords)
a Debug Adapter to connect the VSCode debugging UI with the (already existing) debugger that’s integrated into the emulator

Some notable VSCode extension features which are not implemented:

No Language Server (to provide error squiggles and code completion while typing), the LSP protocol is a bit of overkill for low level languages like assembly, while it would have been a ‘nice to have’ feature, it wasn’t doable in the available time, and features similar to a full LSP can most likely also be implemented without a full LSP implementation (VSCode has a couple of other language features like semantic highlighting, snippets or programmatic language features). In the end I simply ran out of time, maybe in the next round of updates…
No Task Providers (e.g. proper integration with tasks.json and launch.json). This also seemed like overkill. Just adding two key bindings while the extension is active (F5 for debugging and F7 for building) achieves the same thing with less hassle for the user.

Finally, a VSCode extension may run in 3 environments, which has some subtle consequences for what APIs can be used in the extension code:

desktop: the extension only works in ‘desktop VSCode’ and can use the full set of node.js APIs
web: the extension works in ‘VSCode for the web’, which means only the VSCode extension API and browser APIs can be called
universal: the extension can run both in desktop and web VSCode

The KC IDE is a universal extension, but still has some issues when running in the web version of VSCode (which comes down to a mix of VSCode issues and some file-IO related issues I will most likely need to fix on my side).

Integrating the assembler via WASI

This turned out a lot easier than expected, because the VSCode WASI extension does all the hard work.

What this extension basically does is to allow any POSIX commandline tool to run inside VSCode without requiring changes to the source (most notably, no changes are required for blocking file IO code via fopen/fread/fwrite/fclose).

The only thing I had to fix in the ASMX assembler was a separately provided root path for the assembler’s include statement (which is supposed to work with relative paths). WASI currently doesn’t have the concept of a ‘current working directory’, so all filesystem paths must be resolved to absolute paths within the WASI container’s virtual filesystem (a WASI environment doesn’t use direct filesystem paths of the host system, but instead defines its own virtual filesystem with mount points mapped to host system directories).

The basic procedure to get the assembler working inside VSCode is:

compile the assembler to a WASI blob using the WASI SDK Clang toolchain, this happens manually outside the extension project, the resulting .wasm blob is then simply committed into the extension’s git repo and bundled with the published extension. The size of the WASM blob is about 200 KBytes.

in the VSCode extension code: initialize the WASI runtime, setup a virtual filesystem, and load and compile the assembler WASM blob, this happens only once during the extension’s life cycle:

  export async function requireWasiEnv(ext: ExtensionContext): Promise<WasiEnv> {
      if (wasiEnv === null) {
          const wasm = await Wasm.load();
          const fs = await wasm.createRootFileSystem([ { kind: 'workspaceFolder' } ]);
          const bits = await workspace.fs.readFile(Uri.joinPath(ext.extensionUri, 'media/asmx.wasm'));
          const asmx = await WebAssembly.compile(bits);
          wasiEnv = { wasm, fs, asmx };
      }
      return wasiEnv;
  }

run the assembler WASM blob, capture stdout and stderr and check the exit code, this is quite similar to how a native tool would be launched:

  export async function runAsmx(ext: ExtensionContext, args: string[]): Promise<RunAsmxResult> {
      const wasiEnv = await requireWasiEnv(ext);
      const process = await wasiEnv.wasm.createProcess('asmx', wasiEnv.asmx, {
          rootFileSystem: wasiEnv.fs,
          stdio: {
              out: { kind: 'pipeOut' },
              err: { kind: 'pipeOut' },
          },
          args,
      });
      const decoder = new TextDecoder('utf-8');
      let stderr = '';
      let stdout = '';
      process.stderr!.onData((data) => {
          stderr += decoder.decode(data);
      });
      process.stdout!.onData((data) => {
          stdout += decoder.decode(data);
      });
      const exitCode = await process.run();
      return { exitCode, stdout, stderr };
  }

the KC IDE extension will then parse the assembler error messages in stderr and convert the error messages into VSCode Diagnostic objects, which then show up in the Problems panel and as error squiggles in the text editor
the actual assembler output files are written directly into the host filesystem via the virtual filesystem mapping that was provided when initializing the WASI runtime

Integrating the emulator

The embedded home computer emulators are taken from the chips project, those are implemented in C/C++, use the sokol headers for abstracting platform details and run both as natively compiled executables and in the browser via WASM and WebGL, compiled with the Emscripten SDK.

One emulator WASM blob is about 700..800 KBytes (most of that is the Dear ImGui debugging UI which costs about 450 Kbytes).

Currently the KC IDE extension contains 4 emulators (KC85/3, KC85/4, C64 and CPC) which adds up to about 3 MBytes (if there will be drastically more supported systems in the future I’ll need to come up with a solution to reduce the size of the embedded emulators, either downloading them on demand, merge them into a single ‘multi-system-emulator’ binary, or maybe moving the UI into a shared WASM module that’s loaded like a DLL).

The emulator is running inside a VSCode webview panel. For the most part this is quite straightforward for an Emscripten WebGL application by taking an index.html like this (note the placeholders {{{shell}}} and {{{emu}}}, those must be replaced with runtime-generated URLs), and setup a webview panel object like this.

There’s a couple of interesting details in that code:

The webview panel cannot simply load resources from anywhere in the host file system, instead a localResourceRoot must be provided in the window.createWebviewPanel() call which points to the extension subdirectory media/ (e.g. anything that’s loaded in the webview panel needs to be located in that media/ subdirectory):

    const rootUri = Uri.joinPath(getExtensionUri(), 'media');
    const panel = window.createWebviewPanel(
      // ...
      {
        localResourceRoots: [ rootUri ],
      }
    );

…next, all URLs referenced in the webview panel’s HTML content must be generated via the webview panel API, I’m doing that by loading a HTML template file and then replace the placeholders inside {{{...}}} with generated URLs (and while at it, I also select the correct emulator to load):

    let emuFilename;
    switch (project.emulator.system) {
        case System.KC853:      emuFilename = 'kc853-ui.js'; break;
        case System.C64:        emuFilename = 'c64-ui.js'; break;
        case System.CPC6128:    emuFilename = 'cpc-ui.js'; break;
        default:                emuFilename = 'kc854-ui.js'; break;
    }
    const emuUri = panel.webview.asWebviewUri(Uri.joinPath(rootUri, emuFilename));
    const shellUri = panel.webview.asWebviewUri(Uri.joinPath(rootUri, 'shell.js'));
    const templ = await readTextFile(Uri.joinPath(rootUri, 'shell.html'));
    const html = templ.replace('{{{emu}}}', emuUri.toString()).replace('{{{shell}}}', shellUri.toString());
    panel.webview.html = html;

Communication between VSCode and the WebView panel content works via bi-directional message passing, this means the VSCode extension needs to register a listener function which dispatches received messages to their handler functions:

    panel.webview.onDidReceiveMessage((msg) => {
        if (msg.command === 'emu_cpustate') {
            cpuStateResolved(msg.state as CPUState);
        } else if (msg.command === 'emu_disassembly') {
            disassemblyResolved(msg.result as DisasmLine[]);
        } else if (msg.command === 'emu_memory') {
            readMemoryResolved(msg.result as ReadMemoryResult);
        } else if (msg.command === 'emu_ready') {
            if (state) {
                state.ready = msg.isReady;
            }
        } else {
            KCIDEDebugSession.onEmulatorMessage(msg);
        }
    });

…sending a message into the opposite direction (from the debug session to the webview panel) simply looks like this:

  await state.panel.webview.postMessage({ cmd: 'boot' });

…the message structure is entirely custom (and I’m just noticing that I’m using command in one direction, but cmd in the other direction… but anyway…).

There is one missing step in the communication between VSCode debug session on one side, and the emulator on the other. There’s a Javascript shim running in the context of the webpage which translates between the JSON-like message objects which are sent and received by the VSCode debug session, and a lower level WASM function call interface implemented by the emulator.

When a message is received from the VSCode debug session in the emulator’s HTML page, it’s dispatched to a Javascript function via an event listener added to the window object (note that this code is plain Javascript, not Typescript):

    window.addEventListener('message', ev => {
        const msg = ev.data;
        switch (msg.cmd) {
            case 'boot': kcide_boot(); break;
            case 'reset': kcide_reset(); break;
            case 'ready': kcide_ready(); break;
            case 'load': kcide_load(msg.data); break;
            // ...
            case 'disassemble': kcide_dbgDisassemble(msg.addr, msg.offsetLines, msg.numLines); break;
            case 'readMemory': kcide_dbgReadMemory(msg.addr, msg.numBytes); break;
            default: console.log(`unknown cmd called: ${msg.cmd}`); break;
        }
    });

Such a handler function looks like this:

function kcide_boot() {
    Module._webapi_boot();
}

This is an ‘Emscripten-ism’. The easiest way to export a C function from WASM to Javascript is via the EMSCRIPTEN_KEEPALIVE attribute in the C source, like this:

EMSCRIPTEN_KEEPALIVE void webapi_boot(void) {
    if (state.inited && state.funcs.boot) {
        state.funcs.boot();
    }
}

When Emscripten builds the project, it keeps track of all EMSCRIPTEN_KEEPALIVE C functions and makes them available as Javascript functions on a global Module object created by the Emscripten entry stub. Calling such an EMSCRIPTEN_KEEPALIVE C function from the Javascript side then looks like this:

    Module._webapi_boot();

…and that’s essentially how the communication between VSCode and the WASM emulator works. For instance, when the VSCode palette command KCIDE: Reboot Emulator is executed, eventually the C function webapi_boot() in the WASM emulator will be called, which reboots the emulator.

Currently the emulators implement the following ‘web API’ functions callable from Javascript:

void webapi_dbg_connect(void);          // a VSCode debug session has started
void webapi_dbg_disconnect(void);       // a VSCode debug session has ended
void* webapi_alloc(int size);           // helper function to allocate on the WASM heap from Javascript
void webapi_free(void*);                // helper function to free memory allocated via webapi_alloc()
void webapi_boot(void);                 // reboot the emulator (e.g. switch off and on)
void webapi_reset(void);                // reset the emulator (e.g. press the reset button)
bool webapi_ready(void);                // returns true when the emulator is ready to start a debug session after rebooting
bool webapi_load(void* ptr, int size);  // load binary data into the emulator
void webapi_dbg_add_breakpoint(uint16_t addr);    // add a debug breakpoint at a 16-bit address
void webapi_dbg_remove_breakpoint(uint16_t addr); // delete a debug breakpoint at a 16-bit address
void webapi_dbg_break(void);            // break into the debugger
void webapi_dbg_continue(void);         // continue execution when stopped in debugger
void webapi_dbg_step_next(void);        // execute a 'step over' in the debugger
void webapi_dbg_step_into(void);        // execute a 'step into' in the debugger
uint16_t* webapi_dbg_cpu_state(void);   // request a raw 'CPU state' dump (current register values)
webapi_dasm_line_t* webapi_dbg_request_disassembly(/*...*/); // request a disassembly dump over a range of addresses
uint8_t* webapi_dbg_read_memory(uint16_t addr, int num_bytes); // request a memory dump over a range of addresses

In the opposite direction (from the emulator to the VSCode debug session), the emulator calls into the following C callback functions, which in turn call into Javascript to create a JSON-like message object to send back into the VSCode debug session:

void webapi_event_stopped(int stop_reason, uint16_t addr);    // debugger has stopped at addr for a specific reason
void webapi_event_continued(void);                            // the debugger has continued execution
void webapi_event_reboot(void);                               // the emulator has been rebooted
void void webapi_event_reset(void);                           // the emulator has been reset

…in a nutshell, this is the minimal ‘virtual machine’ interface required to implement a somewhat feature-complete VSCode Debug Adapter.

One downside of the Debug Adapter Protocol is that it is clearly designed towards high level languages, and the protocol feature set has little overlap with debugging features that are desired in an emulator virtual machine.

But thankfully, the Debug Adapter Protocol is also flexible enough that it can work side by side with the much more powerful debugger that’s already integrated in the chips-emulators via Dear ImGui:

…for instance, the embedded Dear ImGui debugger allows to step the emulator forward in single clock cycles, while the VSCode debugger only steps at instruction or source line granularity.

Known Issues and future updates

There’s a couple of issues which are currently worked around or don’t work at all, and which I want to fix in future updates (most of those are only an issue in the VSCode web version, so not exactly show stoppers):

Hopefully the VSCode WASI extension will go out of pre-release-only mode rather sooner than later, at that point I can also move the KC IDE extension out of pre-release. The problem is that trying to install a VSCode extension which depends on a pre-release-only extension will fail to install the dependency with a cryptic error message. Worst case is that I need to implement my own VSCode WASI runtime, or figure out another way to run the assembler inside VSCode (maybe as a regular WASM blob which replaces the C stdlib IO calls with asynchronous functions with completion-callback, delegated to Javascript)
Currently, any binary-blob data that needs to be transferred from VSCode into the emulator needs to go through a base64-encoded string which is expensive to encode and decode. The reason for that hack is that transferring Uint8Array objects doesn’t work when VSCode is running in the web (it’s supposed to work, but the data gets corrupted).
Working directly on Github repositories in the VSCode web version doesn’t work (weird virtual filesystem issues).
…and of course some sort of Language-Server-like editing experience (proper code completion and error squiggles while typing), but without implementing a full-blown language server.

WASM Debugging with Emscripten and VSCode

Sat, 11 Nov 2023 00:00:00 +0000

TL;DR: glueing together VSCode, Cmake and the Emscripten SDK to enable an IDE-like workflow (including debugging).

17-Nov-2024: looks like the problem that ‘early breakpoints’ are not caught is fixed, woohoo!

09-Oct-2024: updated for the latest sokol_gfx.h and VSCode extension versions.

This is written from the perspective of a UNIX-like OS (macOS or Linux), but should also work on Windows with some minor tweaks.

Prerequisites

First make sure that the following tools are in the path:

git
cmake
ninja

You’ll also need VSCode and Chrome installed.

On macOS I’d recommend using Homebrew and on Windows Scoop to install those. On Linux of course, your system’s standard package manager.

Emscripten Hello World

Let’s start from scratch. On the command line:

mkdir hello
cd hello
git init

Add a .gitignore file:

.gitignore

build/
emsdk/

Install the Emscripten SDK, we’ll do so in a way that it doesn’t leave a trace on your system when deleted so don’t worry. Still inside the hello directory:

git clone --depth=1 https://github.com/emscripten-core/emsdk
cd emsdk
./emsdk install latest
./emsdk activate --embedded latest
cd ..

Don’t forget the ./emsdk activate --embedded latest step! (happens to me all the time)

…let’s check if that worked. Create a hello.c source file in the hello project directory:

hello.c

#include <stdio.h>

int main() {
    printf("Hello World!\n");
    return 0;
}

…compile that into a .wasm/.js pair runnable with node.js:

emsdk/upstream/emscripten/emcc hello.c -o hello.js

…there should be a hello.js and hello.wasm file now:

ls
emsdk      hello.c    hello.js   hello.wasm

…run the hello.js file via node.js (depending on the emsdk version the path may differ):

emsdk/node/18.20.3_64bit/bin/node hello.js

…you should see a Hello World! printed to the terminal.

Delete the compiler output, we don’t need that anymore:

rm hello.js hello.wasm

CMake + Emscripten

Let’s bake the build process into a cmake file. Create a CMakeLists.txt file in the hello project directory:

CMakeLists.txt

cmake_minimum_required(VERSION 3.21)
project(hello)
add_executable(hello hello.c)
if (CMAKE_SYSTEM_NAME STREQUAL Emscripten)
    set(CMAKE_EXECUTABLE_SUFFIX .js)
endif()

…and since this is a cross-compilation scenario, let’s also create a CMakeUserPresets.json file. This simplifies calling cmake with the right arguments for cross-compilation, and will help us later when integrating with VSCode:

CMakeUserPresets.json

{
    "version": 3,
    "cmakeMinimumRequired": {
        "major": 3,
        "minor": 21,
        "patch": 0
    },
    "configurePresets": [
        {
            "name": "default",
            "displayName": "Emscripten",
            "binaryDir": "build",
            "generator": "Ninja Multi-Config",
            "toolchainFile": "emsdk/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake"
        }
    ],
    "buildPresets": [
        {
            "name": "Debug",
            "configurePreset": "default",
            "configuration": "Debug"
        },
        {
            "name": "Release",
            "configurePreset": "default",
            "configuration": "Release"
        }
    ]
}

…let’s configure and build with cmake:

cmake --preset default -B build
cmake --build build --preset Debug

…and run with node.js:

emsdk/node/18.20.3_64bit/bin/node build/Debug/hello.js

…this should again print Hello World!.

VSCode + CMake + Emscripten

Let’s integrate what we have so far with VSCode!

You’ll need the following VSCode extensions:

…with those installed, start VSCode from within the hello project directory:

code .

You should see something like this, pay attention to the status bar at the bottom (underlined in red), these items are used to control the cmake build config and target:

(NOTE 09-Oct-2024: the underlined items in the bottom bar have moved into the CMake Tools sidepanel in recent versions).

Clicking those allows you to select a Configure- and Build-Preset, and a build target.

Change those that it looks like this:

Here we also encounter the first wart, the CMake Tools extension isn’t able to communicate the correct Emscripten sysroot include paths over to the C/C++ extension. You’ll see an error squiggle under the stdio.h include path:

I haven’t found a solution to this problem, it looks like a bug in the CMake Tools extension. Annoying for sure, but not a showstopper, because only Intellisense is affected, building should work fine.

You can test that by pressing F7, or run the palette command CMake: Build. You should see something like this in the VSCode Output panel:

[main] Building folder: hello
[build] Starting build
[proc] Executing command: /opt/homebrew/bin/cmake --build /Users/floh/scratch/hello/build --config Debug --target hello
[build] [1/2] Building C object CMakeFiles/hello.dir/Debug/hello.c.o
[build] [2/2] Linking C executable Debug/hello.js
[driver] Build completed: 00:00:00.361
[build] Build finished with exit code

Debugging

…next lets make debugging work!

Create a launch.json file in the .vscode subdirectory:

.vscode/launch.json

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Launch",
            "type": "node",
            "request": "launch",
            "program": "build/Debug/${command:cmake.launchTargetFilename}",
        }
    ]
}

Pressing F5 should now work. You should see a Hello World! in the VSCode Debug Panel.

But when trying to debug there’s the next wart. Try to set a breakpoint in the C source code:

Now hit F5. We’d expect that the execution stops at the breakpoint, but that doesn’t happen.

This is a known issue in the DWARF debugging extension. From the documentation:

Breakpoints in WebAssembly code are resolved asynchronously, so breakpoints hit early on in a program’s lifecycle may be missed. There are plans to fix this in the future. If you’re debugging in a browser, you can refresh the page for your breakpoint to be hit. If you’re in Node.js, you can add an artificial delay, or set another breakpoint, after your WebAssembly module is loaded but before your desired breakpoint is hit.

Hopefully this problem will be fixed soon-ish, since it’s currently the most annoying.

One workaround is to first set a breakpoint in the Javascript launch file at a point where the WASM blob has been loaded.

Load the file build/Debug/hello.js into the editor, search the function callMain, and set a breakpoint there:

Press F5 and execution should stop at that breakpoint. Now press F5 again and execution should stop in the C code’s main() function (assuming that breakpoint is still set):

Yay. This is how debugging works for a Node.js Emscripten application.

Moving into the web browser

Let’s extend our hello.c to render something in WebGL2.

Clone the sokol headers into the hello project directory and copy some headers up into the project:

git clone --depth=1 https://github.com/floooh/sokol
cp sokol/sokol_gfx.h sokol/sokol_app.h sokol/sokol_log.h sokol/sokol_glue.h .

…delete the sokol directory since we don’t need it anymore:

rm -rf sokol

Replace the hello.c file with the following code which just clears the canvas with a dynamically changing color:

hello.c

#define SOKOL_IMPL
#define SOKOL_GLES3
#include "sokol_gfx.h"
#include "sokol_app.h"
#include "sokol_log.h"
#include "sokol_glue.h"

static sg_pass_action pass_action;

static void init(void) {
    sg_setup(&(sg_desc){
        .environment = sglue_environment(),
        .logger.func = slog_func,
    });
    pass_action = (sg_pass_action) {
        .colors[0] = {
            .load_action = SG_LOADACTION_CLEAR,
            .clear_value = { 1.0f, 0.0f, 0.0f, 1.0f }
        }
    };
}

static void frame(void) {
    float g = pass_action.colors[0].clear_value.g + 0.01f;
    pass_action.colors[0].clear_value.g = (g > 1.0f) ? 0.0f : g;
    sg_begin_pass(&(sg_pass){ .action = pass_action, .swapchain = sglue_swapchain() });
    sg_end_pass();
    sg_commit();
}

static void cleanup(void) {
    sg_shutdown();
}

sapp_desc sokol_main(int argc, char* argv[]) {
    (void)argc; (void)argv;
    return (sapp_desc){
        .init_cb = init,
        .frame_cb = frame,
        .cleanup_cb = cleanup,
        .window_title = "clear",
        .icon.sokol_default = true,
        .logger.func = slog_func,
    };
}

…we’ll also need to make a few changes to our CMakeLists.txt file. Emscripten needs to know that we want a program that runs in the browser. To do that we’ll simply change the executable file extension to .html. Next we need to tell Emscripten to link with WebGL2.

Open the CMakeLists.txt file and change the Emscripten if-block like this:

CMakeLists.txt

if (CMAKE_SYSTEM_NAME STREQUAL Emscripten)
    set(CMAKE_EXECUTABLE_SUFFIX .html)
    target_link_options(hello PUBLIC -sUSE_WEBGL2=1)
endif()

In VSCode press F7 to rebuild the program. This should generate three output files in the build/Debug directory:

hello.html
hello.js
hello.wasm

Let’s try to run that in the browser. On the command line in the project directory:

emsdk/upstream/emscripten/emrun build/Debug/hello.html

This should open the system’s default web browser and you should see something like this, with the orange rectangle cycling between yellow and red:

…let’s get rid of the ‘window chrome’ by injecting our own minimal shell.html file.

In the project directory, create a file shell.html looking like this:

shell.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta charset="UTF-8"/>
<title>Clear</title>
<style type="text/css">
.game {
    position: absolute;
    top: 0px;
    left: 0px;
    margin: 0px;
    border: 0;
    width: 100%;
    height: 100%;
    overflow: hidden;
    display: block;
    image-rendering: optimizeSpeed;
    image-rendering: -moz-crisp-edges;
    image-rendering: -o-crisp-edges;
    image-rendering: -webkit-optimize-contrast;
    image-rendering: optimize-contrast;
    image-rendering: crisp-edges;
    image-rendering: pixelated;
    -ms-interpolation-mode: nearest-neighbor;
}
</style>
</head>
<body style="background:black">
  <canvas class="game" id="canvas" oncontextmenu="event.preventDefault()"></canvas>
  <script type="text/javascript">
    var Module = {
        preRun: [],
        postRun: [],
        print: (function() {
            return function(text) {
                text = Array.prototype.slice.call(arguments).join(' ');
                console.log(text);
            };
        })(),
        printErr: function(text) {
            text = Array.prototype.slice.call(arguments).join(' ');
            console.error(text);
        },
        canvas: (function() {
            var canvas = document.getElementById('canvas');
            canvas.addEventListener("webglcontextlost", function(e) { alert('FIXME: WebGL context lost, please reload the page'); e.preventDefault(); }, false);
            return canvas;
        })(),
        setStatus: function(text) { },
        monitorRunDependencies: function(left) { },
    };
    window.onerror = function(event) {
        console.log("onerror: " + event.message);
    };
  </script>
  {{{ SCRIPT }}}
</body>
</html>

…and in the CMakeLists.txt file, change the linker options like this:

CMakeLists.txt

    target_link_options(hello PUBLIC -sUSE_WEBGL2=1 --shell-file=../shell.html)

…build the project again by pressing F7 and try opening the result in the browser:

emsdk/upstream/emscripten/emrun build/Debug/hello.html

…the WebGL canvas should now stretch over the entire window client area:

Browser Remote Debugging

Now on to the last step: making remote debugging work!

First, .vscode/launch.json needs to be changed to start a Chrome remote debug session and a local web server:

.vscode/launch.json

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Launch",
            "type": "chrome",
            "request": "launch",
            "url": "http://localhost:3000/build/Debug/${command:cmake.launchTargetFilename}",
            "preLaunchTask": "StartServer",
        }
    ]
}

…note the preLaunchTask, this will start a web server using the Live Preview VSCode extension.

To define the StartServer task, create a file .vscode/tasks.json and populate it like this:

.vscode/tasks.json

{
    "version": "2.0.0",
    "tasks": [
        {
            "label": "StartServer",
            "type": "process",
            "command": "${input:startServer}"
        }
    ],
    "inputs": [
        {
            "id": "startServer",
            "type": "command",
            "command": "livePreview.runServerLoggingTask"
        }
    ]
}

…and that’s it!

When pressing F5, Chrome should now open and load our program:

…while the program is running in the browser, set a breakpoint in hello.c at the start of function void frame(void). The debugger should now stop at the function and you can step through the code:

And that’s it! You can now make changes to your code and then compile and run/debug with F5. The only downside is the known issue that early breakpoints are not caught. There are two workarounds, first the one already mentioned to set a breakpoint on the JS side in build/Debug/hello.js in the callMain function and stop on that first. This seems to catch any early breakpoints on the C side too.

The second option for programs with a render loop is to simply restart the debug session by pressing the ‘Refresh’ button in the VSCode debugger controls:

This will also catch early breakpoints on the C side, but will popup a warning that the ‘Live Preview…` task is already running. This can simply be ignored.

You can also find the project described in this blog post on Github:

https://github.com/floooh/vscode-emscripten-debugging

Known Issues

The list of issues I stumbled over, hopefully those will be fixed in the future:

The CMake Tools extension doesn’t properly communicate the Emscripten system include path to the C/C++ extension so that Intellisense doesn’t work for system headers.
The WASM DWARF debugging extension doesn’t catch early breakpoints on the C side (known issue).
The Live Preview extension pops up a warning when refreshing a debug session.

The Brain Dump

The experimental Sokol Vulkan backend

sokol-shdc changes

sokol_app.h changes

sokol_gfx.h changes

A 10000 foot view

The Delete Queue System

The GPU Memory Allocation System

The Frame Sync System

Resource binding via EXT_descriptor_buffer

The uniform update system:

The resource binding system

The two staging systems

The resource barrier system

Everything else…

The sokol-gfx resource view update.

What are resource view objects?

New unlocked features

Current restrictions and planned features

High level overview of public API changes

Shader Authoring Changes

Working with Texture Views

View vs parent resource lifetime considerations

Tracking uninit => init cycles

Working with render pass attachment views

Working with storage image views

Working with storage buffer views

When not using sokol-shdc…

Q & A

Why no vertex- and index-buffer views

Why no ‘texture’ field in sg_image_usage to indicate that texture views may be created for an image object?

What’s up with SG_MAX_VIEW_BINDSLOTS being this odd 28 instead of some 2^N value?

The sokol-gfx 'compute milestone 2' update

Updated documentation sections

An important behaviour change for immutable buffer objects

Multi-purpose buffer objects

Breaking changes when creating image objects

Compute pass attachments (aka storage images)

Detailed change list

sokol_app.h:

sokol_gfx.h:

What’s next

The sokol-gfx compute shader update

Availability and Restrictions

New compute shader samples

Shader Authoring Changes

On the CPU side

When not using sokol-shdc

Under the hood

What’s next

Upcoming Sokol header API changes (Nov 2024)

Overview

Updated documentation and example code

When using sokol-shdc:

When not using sokol-shdc

Change Recipes

When using sokol-shdc:

When not using sokol-shdc:

Zig and Emulators

Dev Environment

Zig Comptime and Generics

Bit Twiddling and Integer Math can be awkward

Using wide integers with bit twiddling code is fast

Debug Performance

Conclusion

Upcoming Sokol header API changes (May 2024)

New sample code

How to check for storage buffer support

Desktop GL version caveats (and a minor breaking change)

A simple vertex pulling example

Shader Authoring Caveats

Under the hood

Metal

D3D11

WebGPU

GL

sokol-shdc updates

What’s next:

Upcoming Sokol header API changes (Feb 2024)

Table of Contents