
In production game engines, scenes are way more complicated with things like gameplay logic/physics/etc that all need be loaded and it’s not uncommon to see editor level loads taking greater than 30 minutes, at which point you can’t afford to restart your application every time you need to tweak a shader. Enter shader hot reloading!
For this post we will be focusing on compiling our shaders through the DirectX Shader Compiler on DX12. There is an older shader compiler called FXC and you can ALSO do hot shader reloading using D3DCompileFromFile, however if you want support for raytracing and other modern shader compiler features on DX12, you’ll need to use the DirectX Shader Compiler (also called DXC).
To do this is relatively straightforward:
The end result is something like this where you can live edit shaders in your text editor of choice and then hit a recompile button that will live reload the shader without needing to restart your app or load up the level:

One quick note: If you want to live edit bindings in the shader, you’ll also need to update your root signature as well as binding code so that will take extra work not covered here.
In order to call DXC at runtime you’ll need to do a few things:
Technically you can build your own DXC compiler from their Github page. Unless you have very specific needs I don’t recommend that and you’ll bump into issues with shader signing.
Luckily the code is actually quite small, here’s a direct excerpt from my code. For context my project is a path tracer where everything is done from a single shader that’s stored in m_pRayTracingPSO. In all likelihood your project is more complicated but it does mean we can reference an extremely isolated piece of code that’s simple enough that you can probably copy-paste it and get it working for you with some tweaks. Hopefully the comments speak for themself
#include "dxcapi.h"
#define HANDLE_FAILURE() assert(false);
#define VERIFY(x) if(!(x)) HANDLE_FAILURE();
#define VERIFY_HRESULT(x) VERIFY(SUCCEEDED(x))
void TracerBoy::RecompileShaders()
{
LPCWSTR ShaderFile = L"..\\..\\TracerBoy\\RaytraceCS.hlsl";
// Load up the compiler
ComPtr<IDxcCompiler3> pCompiler;
VERIFY_HRESULT(DxcCreateInstance(CLSID_DxcCompiler, IID_PPV_ARGS(pCompiler.GetAddressOf())));
// Load up DXCUtils, not strictly necessary but has lots of useful helper functions
ComPtr<IDxcUtils> pUtils;
VERIFY_HRESULT(DxcCreateInstance(CLSID_DxcUtils, IID_PPV_ARGS(pUtils.GetAddressOf())));
// Load the HLSL file into memory
UINT32 CodePage = DXC_CP_UTF8;
ComPtr<IDxcBlobEncoding> pSourceBlob;
VERIFY_HRESULT(pUtils->LoadFile(ShaderFile, &CodePage, pSourceBlob.GetAddressOf()));
// The include handler will handle the #include's in your HLSL file so that
// you don't need to handle bundling them all into a single blob. Technically
// the include handler is optional not specifying this will cause the compiler to
ComPtr<IDxcIncludeHandler> pDefaultIncludeHandler;
VERIFY_HRESULT(pUtils->CreateDefaultIncludeHandler(pDefaultIncludeHandler.GetAddressOf()));
DxcBuffer SourceBuffer;
BOOL unused;
SourceBuffer.Ptr = pSourceBlob->GetBufferPointer();
SourceBuffer.Size = pSourceBlob->GetBufferSize();
pSourceBlob->GetEncoding(&unused, &SourceBuffer.Encoding);
// Compiler args where you specify the entry point of your shader and what kind of shader it is (i.e. compute/pixel/vertex/etc)
ComPtr<IDxcCompilerArgs> pCompilerArgs;
VERIFY_HRESULT(pUtils->BuildArguments(ShaderFile, L"main", L"cs_6_5", nullptr, 0, nullptr, 0, pCompilerArgs.GetAddressOf()));
// Finally, time to compile!
ComPtr<IDxcResult> pCompiledResult;
HRESULT hr = pCompiler->Compile(&SourceBuffer, pCompilerArgs->GetArguments(), pCompilerArgs->GetCount(), pDefaultIncludeHandler.Get(), IID_PPV_ARGS(pCompiledResult.GetAddressOf()));
if (SUCCEEDED(hr))
{
// Output errors are all put in a buffer. Warning will also be output here, so don't assume that
// the existence of an error buffer means the shader compile failed
ComPtr<IDxcBlobEncoding> pErrors;
if (SUCCEEDED(pCompiledResult->GetErrorBuffer(pErrors.GetAddressOf())))
{
std::string ErrorMessage = std::string((const char*)pErrors->GetBufferPointer());
OutputDebugString(ErrorMessage.c_str());
}
ComPtr<IDxcBlob> pOutput;
if (SUCCEEDED(pCompiledResult->GetResult(pOutput.GetAddressOf())))
{
// Take the compiled shader code and compile it into a PSO
D3D12_COMPUTE_PIPELINE_STATE_DESC psoDesc = {};
psoDesc.pRootSignature = m_pRayTracingRootSignature.Get();
psoDesc.CS = CD3DX12_SHADER_BYTECODE(pOutput->GetBufferPointer(), pOutput->GetBufferSize());
// Check that the PSO compile worked, failure here could mean issues with root signature mismatches or you're using
// features that the driver doesn't support
ComPtr<ID3D12PipelineState> pNewRaytracingPSO;
if (SUCCEEDED(m_pDevice->CreateComputePipelineState(&psoDesc, IID_GRAPHICS_PPV_ARGS(pNewRaytracingPSO.ReleaseAndGetAddressOf()))))
{
// Overwrite the old PSO. MAKE SURE THE OLD PSO ISN'T BEING REFERENCED ANYMORE!!
// Particularly ensure the GPU isn't running any shaders that are referencing it.
m_pRayTracingPSO = pNewRaytracingPSO;
}
}
}
}After that you’re done. Hook that up to some UI (might I suggest ImGUI?) and reload away!
For more interesting DXC info:
]]>
Raw path traced output at 8 samples per pixel on the left and same output but denoised using Machine Learning on the right
Machine Learning (ML) is everywhere! Even on the real-time/gaming side of things, ML is becoming an increasingly important part of the future with already existing techniques like DLSS beating out top non-ML based techniques in visual quality while still being fast enough to run at 60FPS (or even framegen up to 120!). Despite that, most folks even with a deep understanding of how GPUs work, don’t have much of an intuition on what these ML-based GPU workloads are actually doing.
The way I’ve targetted this post is it’s just enough knowledge to make machine learning feel a less like black magic for rendering engineers and begin to develop a vague understanding of of what is actually going on when we talk about tech like DLSS, Open Image Denoise, XeSS, etc. To go about this, I will narrow in on a single case study: Open Image Denoise. I am far from an expert, and in fact I’d consider myself a beginner when it comes to Machine Learning. Almost all of my knowledge stems from some recent work I did to port Open Image Denoise to DirectML in my hobby path tracer.
Let me breakdown those 2 things I just mentioned as they will be a focus in the rest of this post: Open Image Denoise and DirectML:
Open Image Denoise in an open source, ML-based denoiser from Intel that’s used by a lot of the big offline renderers like RenderMan. I’ll refer to Open Image Denoise interchangebly with it’s acroynym OIDN. It is not intended for real-time use in games the way that DLSS or XeSS. That said, it has been optimized to the point that on extremely high-end GPUs, it can be quite fast. This does allow it to be used for quick interactive preview renders. While very little has been disclosed about XeSS/DLSS, we can take some of the inner workings of OIDN and assume some of this does carry over to how those techniques work despite having very different goals.
DirectML is the DirectX 12 API for abstracting out machine learning operations, akin to how DirectX Raytracing abstracts the hardware implementation around raytracing. There’s one important distinction here however, DirectML allows these ML operators to run even on hardware that doesn’t have ML hardware acceleration by emulating the operation with a compute shader. This is an important distinction, everything I talk about on this post can be run on basically anything that can read/write from a texture and do some multiplies. That means either the CPU or potato GPUs. In fact, I can even run my DirectML port on my laptop with a 5 year old integrated GPU:

Finally, one last thing before we get into the tech details. We can break machine learning down into 2 big categories: learning and inferencing. This post will only cover inferencing (specfically inferencing on the GPU).
“Learning” is an extremely expensive operation where we feed the ML model loads of input data and try to get it to “learn” to do something. In the case of Open Image Denoise, we feed in a bunch of path traced images with an extremely low sample per pixel (lets say 1-4 rays per pixel) and then also give it matching images with a nearly converged image (imagine >1024 samples per pixel).

Stairway rendered with 8 spp

Stairway rendered with 1024 spp
We then ask it to essentially try a bunch of random things to see if it can transform the low sample per pixel images to the converged images. With each attempt it can grade itself by comparing what it output to the converged image. There’s a ton of linear algebra that can help it iteratively learn in a way that’s not just randomly trying things. When it’s done, you’ll get a big list of weights. Later on I’ll talk about how that comes into play.
Now that the learning is done, we can do inferencing, which is MUCH faster and is basically just feeding data in through the model that was created from the “learning” process. Again using Open Image Denoise as an example, we’d feed in a noisy path traced image (likely one it’s never seen before), and it would use it’s learned results to produce a denoised image. This inferencing process is generally what would run on consumer hardware, so optimizing this is important, which is why GPU inferencing comes into play. This process of taking the noisy image and denoising it, that’s what we’re going to dig into.

Stairway rendered with 8 spp but denoised with OIDN
Because I’ve ported Open Image Denoise to DirectML, which is just using DX12 under the hood, we can break this down in a way similar to how folks breakdown a frame in video games: GPU capture! I’ve captured a frame in Pix, lets take a look:

Pix capture of the PBRT kitchen scene ran on an RTX 3070 at 2560x1440
At first glance, there’s a LOT going on here. Most non-ML based denoisers like SVGF would only have around 5-6 operations. And there’s a whole lot of these “DML_OPERATOR” things. But if you take a step back, you’ll notice there’s only 6 unique things happening here:
What I hope to show is that each one of these steps is actually very simple and understandable in terms of what the GPU is doing. Each of these “operators” are actually just compute shaders, and as you’ll see, they’re fairly simple compute shaders. And once you understand all 5 things, you will understand the all the building blocks to a production grade ML denoiser.
Lets start with the conversion from render targets to a tensor (and visa-versa). What is a tensor? It’s a broad N-dimensional math thing, but in the context of Open Image Denoise, we can constrain the definition a little bit. Essentially they’re textures that can have an arbitrary number of channels. By default when we think of a texture we’re talking 3 channels: red, green, and blue. However there’s a lot of scenarios in games for example where we might be rendering out a LOT of channels. Lets think something like a G-buffer where we have normals, albedo, roughness, metal, etc. Normally to support this you have 3-4 individual render targets each with 3-4 channels.

Example of GBuffers from Gears 5: Hivebusters. Left to right: Normals, Albedo, and Material Properties
However imagine you could represent that with a single texture but with 12+ channels. That would be neat! Let me immediately burst your bubble, modern GPUs don’t support anything like this. Why do we care so much about a large channel count? Well, later on we’ll see passes in OIDN that require up to 112 channels in their output. Because there’s no built-in hardware texture format that would allow this, the solution in DirectML is to just linearly represent these textures in a buffer and track the actual channel representation ourselves. We lose the opportunity for HW/drivers to swizzle and arrange channels in a more cache friendly manner but in exchange we get the flexibility of storing as many channels as we want.
And so the resulting compute shader to convert from a 3 channel 2D texture to a tensor is about the most boring shader you can imagine:
Texture2D<float4> inputImage : register(t0);
RWBuffer<half> outputTensor : register(u0);
[numthreads(DIRECTML_THREAD_GROUP_WIDTH, DIRECTML_THREAD_GROUP_HEIGHT, 1)]
void TextureToTensor( uint3 DTid : SV_DispatchThreadID )
{
if (DTid.x >= Constants.OutputResolution.x || DTid.y >= Constants.OutputResolution.y)
return;
uint index = DTid.y * Constants.OutputResolution.x + DTid.x;
float3 val = inputImage[DTid.xy].xyz;
outputTensor[index * 3] = val.x;
outputTensor[index * 3 + 1] = val.y;
outputTensor[index * 3 + 2] = val.z;
}
I’ve omitted one big topic from the shader, which is different layouts. Because the driver no longer knows about the texture layout, it has lost the ability to intelligently pick how to pack the channels. For example you could pack each channel in it’s own slice or have all channels placed locally next to each other. Depending on many factors, one layout maybe better than the others. You’ll hear terms like NCHW or NHWC which refer to this. Just know that this exists and that they can have a big impact on performance (I saw ~3x difference trying different layouts).
Okay! So we have a rough idea of what tensors are now. Every single operation in DirectML results in a single output tensor that is used as input to the next DirectML operation.
To convert from a tensor to a texture is equally simple. However, I’ve snuck in some extra functionality we’ll find useful here. If we’re converting from a 3 channel tensor, this shader would be just simple as the previous one. And indeed the final output of OIDN will result in a 3 channel tensor that we just need to convert back into a texture. However, we may want to be able to visualize some of the intermediate tensor outputs. And to go further, we may want to be able to visualize different channels in the tensor. To handle that, I’ve added a “SliceToOutput”, that lets you specify which 3 channels to output.
Buffer<half> InputTensor : register(t0);
RWTexture2D<float4> OutputTexture : register(u0);
[numthreads(DIRECTML_THREAD_GROUP_WIDTH, DIRECTML_THREAD_GROUP_HEIGHT, 1)]
void main( uint3 DTid : SV_DispatchThreadID )
{
if (DTid.x >= Constants.OutputResolution.x || DTid.y >= Constants.OutputResolution.y)
return;
// Not handling different size inputs for simplicity
uint index = DTid.y * Constants.InputResolution.x + DTid.x;
uint channelOffset = Constants.SliceToOutput * 3;
float4 color;
color.r = InputTensor[index * Constants.InputChannelDepth + channelOffset + 0];
color.g = InputTensor[index * Constants.InputChannelDepth + channelOffset + 1];
color.b = InputTensor[index * Constants.InputChannelDepth + channelOffset + 2];
color.a = 1.0f;
OutputTexture[DTid.xy] = color;
}
Okay now with the background on Tensors out of the way as well as a way to visualize any output tensor in OIDN, we have all the tools we need to visually step through each OIDN operation. Lets take a look:
Gallery containing the the output of each pass along with it’s corresponding resoution and channel count.
It’s worth briefly clicking through the gallery above just to get a sense of what’s going on. I can only show 3 channels at a time so this is not an exhaustive visual of the whole operation. Also this only shows the outputs of convolution passes, I’ll break it down later but the outputs of a pooling/upsampling/join pass aren’t visually interesting so I’ve skipped those.
A couple of observations we can make just going through all the passes:
Okay lets go through the operators now
This one is easy, it’s simply a downsample. If downsampling to half resolution, this will need to take 4 pixels and combine them into a single pixel. Normally in rendering you would take the average of the 4 but Open Image Denoise chooses to take the channel with the highest value (i.e. just the max).

This simply upscales the image using nearest neighbor sampling (i.e. not even bilinear!). In OIDN this just doubles the resolution. I’ll say it again, this uses nearest neighbor so the upsample when taken by itself visually will look quite poor, don’t be fooled into thinking there’s magic here! Easy enough I think, lets move on.

This appends the channels from one tensor to another tensor. So if we have a tensor with 3 channels and another tensor with 6 channels, the resulting tensor would have 9 channels. Some things may also call this a concatenation.

Okay the final and most important piece we need to understand. If we go back and look at the frame times of OIDN, Convolution essentially dominates the GPU usage. Luckily, the idea behind convolutions is fairly simple.
A convolution operator takes a filter matrix and applies it to all neighboring pixels. In the context of Open Image Denoise, the filter is always a 3D Matrix. In terms of width and height, all convolution operations (except the last one) are 3x3 and the depth is the number of channels in the input tensor.
Lets take a look at a concrete example. Lets say you want to extract if there’s a vertical edge in the image. The standard way to do this is with a Sobel filter. Here’s the Sobel matrix:
-1, 0, 1
-2, 0, 2
-1, 0, 1
The way this works is fairly straight-forward. At a high-level, you look at your pixel to the left and your pixel to the right. If they’re roughly the same value, they’ll cancel out and you’ll get a value close to 0. If they’re different, you’ll get something non-zero. The further apart they are, the larger the magnitude. We also account for diagonals but they’re weighted slightly less.
However, our input is RGB, so we have 3 channels as input! So what we want is not just a 3x3 matrix, but a 3x3 matrix per channel, i.e. 3x3x3 matrix. Lets say we want to weight the channels so that it takes human perception into account (i.e. the eye tends to be more sensitive to green than blue). The usual weighting looks something like this:
0.3, 0.59, 0.11
We can combine those weights with our Sobel matrix to get our 3x3x3 matrix that we’ll use for our convolution to determine a vertical edge:
-0.3, 0, 0.3 -0.59, 0, 0.59 -0.11, 0, 0.11
-0.6, 0, 0.6 -1.18, 0, 1.18 -0.22, 0, 0.22
-0.3, 0, 0.3 -0.59, 0, 0.59 -0.11, 0, 0.11
Just this simple filter gets us something like this:

Outlines on the left and input image on the left
Okay great, we’ve got a convolution matrix that can now output one feature! However maybe we want to output another feature…like say, a horizontal edge? Convolution filters allow you to have a list of these 3x3x3 matrices, each one adding a whole new channel to your output. Maybe you also want a filter that detects if the pixel is orange-ish, or if the pixel is a bit blurry. Each of these is an extra 3x3x3 matrix.
Now keep in mind, we’re currently focusing on this very simple case with 3 channels. If your input is 112 channels, then you need a 112x3x3 matrix for each output feature.
Okay that’s all there is to convolution! You now know all there is to know about a single convolution operation in isolation. This by itself might not seem very powerful. However this is where the “deep” part of convolution neural nets come in. We have many, many layers of these convolution operations, each one building on the previous layer. So we were looking at a simple convolution layer that takes in RGB. However the next layer would be taking in inputs like “Is a verticle edge” or “Is horizontal edge”. The next layer you can imagine starts to get more vague like “Has lots of edges” or “Is blurry”. And it gets more and more abstract as OIDN goes down 6 layers of convolution. It’s hard to say what things “mean” at the bottom of the net, but I’d have to imagine it gets to things like “Is skin” or “Is Caustics-ish”.
It’s fairly easy to imagine how this plays well to a GPU’s strengths. You’re doing 3x3 filter * n pixels * m channels * k filters. If we look at the most expensive convolution operation: dec_conv2b, we’re doing 64 filters on a 1280x720 texture with 64 channels: so 3.8 billion 3x3 filters (or 33 billion multiplies).
So who writes these filters? Well, remember those “weights” I talked about when I briefly talked about “learning”. You can start to imagine how that fits in. Machine learning engineers create the arrangement of ML passes and the amount of outputs from each pass, and then a learning pass keeps trying random filters and outputting a final image that gets scored on how well it looks to a golden reference (for path tracing this is easy because we can generate the golden reference by using a very large SPP). There’s a ton of really interesting linear algebra you can use to make sure the learning process is more efficient than just trying random filters, but suffice to say that’s out of the scope of this post.
Okay we have now covered all the building blocks. What I’ve covered is exhaustively all the operations that run on the GPU when running OIDN. This may surprise you as none of these operations in isolation seems to be enough to be able to denoise the scenarios to the degree that OIDN manages to handle. Especially if you’ve ever written a TAA or denoiser before, that code tends to be filled with tons of special case handling (responsive masks, anti-ghosting, etc). In this case however, all we’re doing is a bunch of convolution filters over and over mixed in with a few downscales and upscales.
I’ve clumsily graphed out a big picture graph of how all these operators come together:

You’ll notice that the flow forms a U-shape. This is because OIDN is based on the concept of a U-Net, which by-and-large means that we can think of the ML execution in 2 steps: an encoding phase and a decoding phase.
In the encoding phase, which is represented in the first left half of the OIDN graph, we reduce the resolution in multiple steps while encoding more and more abstract data into deeper channels. This is done by a series of convolution operators followed by downsample operators. At each convolution operator we are generating more and more channels (i.e. each convolution operator has an increasing amount of filters). By the time we get to the bottom, we’re at 1/16th resolution with 112 channels
In the decoding phase we do the opposite. Essentially we are decoding these “deep” layers of information layer while slowly increasing the resolution via upsample operators until we’re back at the resolution we started with. Each convolution has fewer and fewer filters causing each convolution to output a reduced set of channels until we’re finally back down to our 3 channels (RGB). Notice that each time we increase resolution, we also tack on information from an earlier step via a join operator. This is an important step to note because recall that we dropped all the way to 1/16th res at the end of the encoding step. Recall the upsamples are just naive nearest neightbor upsamples, if we just did that all the way back to full resolution, you’d get a blurry mess. However by using the join, we get information that’s still twice the resolution of the thing we just upsampled. The following convolution can then combine the 2 and intelligently upsample based on the vast channels of information. I say “intelligently upsample”, but on the GPU it’s just a ton of multiplies, the smart logic is all coming from the weighting that came from the learning process.
Okay that’s a wrap. If you were hoping to become a machine learning expert after reading this, uh…this is not the post you were looking for. But hopefully if you came from having no idea what machine learning is, you now have some sense of what’s running on your GPU and what the work is involved. I narrowed in on a small slice of machine learning workloads: Convolution Neural Nets. There’s so much more to machine learning so this is just a small case study. If you’re like me, I learn best from this kind of depth-first approach, in contrast to breadth-first which in machine learning is insanely broad.
If you want to learn more about machine learning in a broad sense, here’s a few resources I’ve found extrememly helpful:
Something that comes up often in real-time rendering engines is the need to run a full-screen pixel shader. It’s pretty common for screen space effects, to name the big ones: screen-space reflections and screen-space ambient occlusion. However, in order to render a full-screen pixel shader pass, you need to fill the screen with geometry. And so the big question I want to talk about is…
Should you render with 2 triangles that fit the viewport perfectly?

Or 1 massive triangle that spills over the sides of the viewport?
The blue rectangle represents the viewport. Red is the bounds of the geometry
So before jumping into any theories, lets just do some profiling with Pix!
I’ll be testing with the D3D12 Hello World sample from the DirectX Samples. By default it just splats a triangle on the screen so I just made some tweaks so that triangle fills the screen and a toggle so I can go back and forth between the 1 triangle version and the 2 triangle version. I’m testing on an RTX 2070 at 512x512 (I’ll explain the odd choice of resolution later), but it’s worth pointing out that the results we find occur on AMD as well.
With a pixel shader that just outputs a solid color, it’s insanely fast. In both cases it’s around 8 microseconds. We can infer a couple of things from this: there’s no visible overhead to the clipping happening in the 1-triangle case and no real overhead to the 3 extra triangles in the 2-triangle.
So I changed the pixel shader to this silly thing just to “emulate” a heavy-weight pixel shader:
float4 PSMain(PSInput input) : SV_TARGET
{
float r = input.position.x;
for (uint i = 0; i < 10000; i++)
{
r = sin(r);
}
return float4(r, 0, 0, 1);
}And this is where things get interesting, here’s the performance (in milliseconds):
1 triangle: 3.33 ms
2 triangle: 3.43 ms
So 2 triangle case is about 0.1 ms slower. It’s small enough difference that it could be noise but after re-running it several times, the results are actually pretty consistent. This might seem a little surprising since we’ve only tweaked the pixel shader and both cases should result in the same amount of pixels being drawn. So what’s going on??
In order to answer that, we need to talk about GPU quads. Specifically, there’s a quirk to fixed function rasterization where things always render as batches of 2x2 pixel quads. Regardless of whether your triangle is only the size of a single pixel, the GPU will actually a full 2x2 quad even if 3 of those pixels don’t really exist (the other 3 pixels are called “helper pixels”). The primary reason for this is so that the GPU can calculate derivatives to help “auto-magically” pick the correct teture mip. To go fully deep on why quads are a thing warrants it’s own post, but there’s some good resources for that, A Trip through the graphics pipeline being one of the best.
This silly seeming quad behavior has all kinds of implications for GPU performance. But it’s one of the big reasons GPU’s handle sub-pixel triangles so poorly (and why things like compute-shader-based rasterization can be so interesting).
So what’s all this have to do with rendering 2 full-screen triangles? The problem comes down the diagonal edge that gets formed between the triangles.
Lets make this even simpler and look at how these 2 triangles rasterize out on a 6x6 render target (warning: mspaint art incoming):

The top triangle is rendering blue and the bottom triangle is rendering green. It’s worth pointing out on the diagonal line of pixels are exactly between the triangles and so it should be 50:50 green and blue, but the rasterizer has to pick one (unless you’re using MSAA). In order to break this tie-breaker there’s actually a rule for this situation called the “Top-Left” rule that’s causing the green triangle to “win”. Essentially the triangle whose top-left edge covers the pixel centroid gets priority.
So lets first talk about how the blue triangle gets turned into quads. I’ve added a magenta overlay on how I’m guessing this might get packed up into 2x2 pixel quads:

First, to be clear this is just a guess at how the GPU might turn this into quads, they could be subtley different or “smarter” perhaps but I actually doubt it (and some math/counter information later on that will back this up). But the main point here is you’ll see there’s quads that are forced to bleed onto the green triangle. What’s important to know is that the overlapped quads will still run 4 pixel shader invocations even if only 1 pixel of the quad is within the triangle. And 3 “wasted” helper pixels can’t try to be smart and write out the green triangles information because it only has access to the blue triangle’s interpolated vertex information.
Now let’s see about how the green triangle gets turned into quads:

And so NOW we can finally talk about the crux of the problem, all of these quads are essentially getting run twice:

So let me sum that all up: the diagonal edge between the triangles cause overlapping quads that result in the pixel shader getting called twice for all pixels along this edge.
At this point you probably have a lot of questions. I’ll try to anticipate a couple of them and answer them to the best of my ability but to be honest this starts to get into murky hardware territory that isn’t my expertise. But the big one is probably “Why does it feel like it’s processing the triangles one-at-a-time? Can’t it just rasterize them all together?” It certainly is doing them all in parallell, but it’s important to keep in mind that pipelining is a big part of what makes a GPU fast. GPUs these days are insanely wide and it’s quite challenging to keeping the whole thing busy. The best way is just making sure you start rasterizing and running pixel shader invocations as soon as possible. And that means turning your vertex shader work into pixels ASAP, even if it means inefficiecies like this quad overlap mess.
Another question is “Are helper pixels even needed in this case?”. With this simple pixel shader, the helper pixel are absolutely useless, there’s no need for derivatives since I’m not using a texture read. But the fixed-function is just, well “fixed” to work in this way and there’s no any way to get around this (other writing a compute shader rasterizer).
Yes we can! Lets bring out our ol’ friend Pix again…
So what we want to do is find out how many quads actually got run for a draw. Unfortunately Pix doesn’t have a vendor-agnostic way of querying this (edit: Dr Pix can display this in both a vendor-agnostic AND more informative way, details at the bottom of this post). However, it DOES have some handshakes with the driver that allows it to get hardware-specific counters with more low-level insights. I’m using an NVidia GPU so I’ll be working with their specific counter, but other vendors will have something similar. The counter I’m interested in is “sm__ps_quads_launched.sum”:

The counter names can sometimes be pretty overwhelming due to acronyms, this one isn’t too bad but I’ll break it down for clarity. You can ignore the sm (“Streaming Multiprocessor”), it’s basically just the category this counter falls into. The ps is of course Pixel Shader, we just talked about what quads launched, and then the sum is saying that this counter is going to sum up all the quads launched for the selected draw. Hey, that’s exactly what we want, nice!
So lets look at the counter for the 1 triangle case:

65,536 quads huh? With some quick math, we can verify that makes sense. The render target is 512x512 and a 2x2 pixel quad contains 4 pixels so:
512 pixels * 512 pixels / (4 pixels per quad) = 65536 quadsMath checks out! Let’s move on to the 2 triangle case:
65792 quads, more quads that the single triangle case, which lines up with my quad theory! In fact, that’s exactly 256 more quads than the 1 tri case. In fact, this is exactly what you’d expect if you think about it. The width of the render target is 512 pixels, and you’d expect an extra quad every 2 pixels due to the diagonal edge, so 512px / 2px = 256 extra quads! Note that that math would need to be tweaked based on aspect ratio if the height and width weren’t the same, but I rigged the dimensions for some easy math.
So hopefully that’s convinced you to use 1 triangle over 2 triangles! It’s worth noting I’ve set things up in a way that exaggerates the problem slightly. As your resolution goes up, the less this matters because the ratio of quads due to the diagonal to quads from the rest of the triangle will get smaller and smaller. In an optimized game running at 4k on a modern GPU, is this a meaningful difference? My guess is you’re not going to get more than 0.1 millisecond savings out of it (unless maybe you’re using inline raytracing in a fullscreen pass). But let me tell you, if you’re a rendering engineer optimizing for 60 FPS, an easy 0.1ms savings is like hitting the lottery!
And if none of that is enough to convince you, the single triangle case is also just easier to setup. You don’t even need a vertex buffer, you can procedurally generate the points in the vertex shader using SV_VertexID:
float4 main( uint id : SV_VertexID ) : SV_POSITION
{
float2 uv = float2((id << 1) & 2, id & 2);
return float4(uv * float2(2, -2) + float2(-1, 1), 0, 1);
}Austin Kinross kindly pointed out there’s actually another VERY cool way to query quad profiling information using Dr Pix:

This isn’t my discovery or anything, this is largely inspired from a StackOverflow answer from Derhass. His answer is pretty comprehensive but I thought it was so fascinating that it was worth digging up performance and counter information to really break it down more concretely.
]]>A combination of this post from Yining Karl Li and a general itch to do something more artsy made me attempt something a little crazy, the RenderMan Art Challenge! The idea of challenge is to take a untextured scene and turn it into a nice final render in 90 days. You’re allowed to add your own VFX and modify the scene but the story-telling must revolve around the provided assets. This challenge was extra interesting because they also provided a fully rigged character model to work with, here’s what the provided scene looked like for this challenge:

And here’s what my final entry, titled “Midnight Meltdown” looked like (click for the full 4k image):
I still find it a little hard to believe, but my entry ended up getting first place! I highly recommend checking out the other entries, I actually liked a lot of the other entries more than my own, honestly when I submitted my entry, I had zero expectation of even getting placed!
This post went on longer than I expected, but hopefully it’ll give you a sense of what it took to get to the final render. If you don’t like words, I won’t be offended if you want to scroll through and see some cool pictures, though my hope is it inspires some folks to give the challenge a try themself (especially those from an engineering background). The challenge is free for anyone to enter and you’ll get feedback from some of the Pixar folks as well as other people participating which was an incredible experience that I can’t reccomend enough.
For the concept, I wanted to aim for something I could strongly relate to. Professionally, I spend an almost embarassing amount of time banging my head against my keyboard trying to understand why pixels look like one color and not another. Sometimes, issues I’m trying to debug take spend days, even weeks! The worst of those bugs can drive you insane with whiteboards full of crazy theories that feels remincent of some kind of detective movie. That fun little mix of insanity and frustration is what I wanted to revolve my concept around.
For the setting I had this vision of the Magic Library Trope meets a messy college dorm room. Meaning books everywhere as if she’d be researching in a frenzy with little regard for cleanliness. But I also wanted an atmosphere of coziness and solitude, like sitting alone by a fire on a cold winter night. That led me to experiment with a combination of cool and warm lights. Here’s some of my original concept shots with just the scene assets re-organized and some test lights:

Concept Shot 1. Originally I had this idea of the character working on a magical computer, the pink light on the right is supposed to be a monitor.
Concept Shot 2. Another idea was that she’d be trying to be decrypt some kind of magical artifact.
For the challenge, they provided a Mathilda model created by Xiong Lin and rigged by Leon Sooi. The rig could be controlled via Anim School Picker which was incredibly intuitive and generally just fun to play with.

Anim School Picker UI for the Mathilda Rig
For the posing, I wanted something Pixar/Disney-esque. I gathered a bunch of my “Art of …” books (I’ve gathered a small collection from both games and CG movies) and started pouring through a bunch of the gesture drawings. I could have used real-life photos as a reference, but I was worried about the pose looking too stiff if I followed those too closely. Animators are masters at capturing the essense and emotion of a pose and you can see that best in gesture drawings. My biggest inspiration was the work of Jin Kim, one of the many talented animators and character designers at Disney. If you haven’t seen his work, stop what you’re doing right now and google him immediately. I’d put something here but I’m not sure about what permissions are required, so you’ll have to find it yourself, I’ll wait.
My favorites are some of his character drawings in the “The Art of Big Hero 6” and “The Art of Moana”. It was also a good excuse to rewatch Big Hero 6, there’s a part where Tadashi is very tired looking (if you’ve seen it, you’ll probably know what I’m talking about) that was also a helpful resource.
A couple of the nuances that I think really helped:

Viewport with the untextured model that better shows some of the rigging
Some starting texture assets were already provided for the skin. I largely stuck with textures provided though I had to come up with the subsurface material parameters myself. One of the trick I took from skin in the RenderMan Material Pack was to convert some worley noise into a bump map to help add natural variation to the skin’s texture. I also painted in some darkness under they eyes to suggest some sleep deprivation.
The eyeball took quite a bit of tweaking to get them where I wanted. The provided eyeball geometry consists of a sphere that represents the cornea/schlera and a separate interior disc geometry for the iris. It makes it a smidge more complicated than just slapping a texture on a sphere, but it also gives much more depth to the iris which I liked. There is a PHENOMENAL tutorial on making a realistic human face in RenderMan by Leif Pederson that I heavily referenced for the eyes. The short of is to use subsurface scattering for the sclera (the white part) of the eye and then a glass material for the cornea to allow light to hit the iris geometry. The specular needed to be cranked up so you get those bright highlights you expect on character eyes. I also had to tack rod filters (essentially capsules you can place that can either block or intensify light within the capsule) to block specular light selectively around the eye because the eye was so reflective that it was picking up a bunch of distracting hot spots from supplemental lights. I also added a white beam light specifically to add the white highlight on the left eye. And finally I tinted the schlera reddish-yellow to help convey some fatigue.

For the witch hat I wanted to go for something that looked like a bit of a hand-me-down. Going back to the idea of a messy college dorm vibe, I wanted her clothes to seem worn in. The hat provided in the RenderMan challenge was a great start but the geometry was too smooth. Normally relying purely on texturing to show wear would have been sufficient, but because the hat was such a prominent silhouette in my scene, I wanted to make sure the underlying geometry looked worn also.

Zoomed in shot of the hat before adding in any wrinkles
I went through several failed attempts to get some wrinkles in. The first trial was just using the nCloth sim built into Maya. The problem was the hat would always collapse into itself and no reasonable settings seemed to help without making the hat overly rigid. My second thought was to give Blender’s new cloth brushes a try. They still run a cloth sim but you can control where they get applied so there’s still a bit of art direction allowed. I had better luck with this, but I started to realize that I was a little too attached to an idea of how the hat should look in my head, and getting a cloth sim to match up to my vision just wasn’t going to work. So I threw the idea of running a cloth sim out the window, I figured I’d try to sculpt the details in myself. I grabbed a trial of ZBrush and hammered away at the thing. It took a bit to get use to ZBrush but once I got started, it was REALLY fun. I’d only modelled in Maya/Blender before so it really flipped my perspective on how organic 3D modelling could feel. My end result wasn’t terribly great, but it was still far better than what I was getting from the cloth sim and I thought looked compelling enough.
Viewport of the hat in ZBrush
Afterwards I tooked the sculpted hat and threw it into Substance Painter for some texturing. I had a pretty standard looking leather material with some scratches, but I also added some patches. Substance Painter has this cool built-in stitch alpha that made it really straight-forward to add some nice stitching around the edges of the patch. Substance Painter also does a pretty nice job of changing height maps into a convincing normal map, so I didn’t end up needing to do any kind of displacement for the patches. Here’s the end result:

I wanted a cozy feel along with the candle lighting and I thought having a nice warm looking blanket around the character would help add to the atmosphere. I subdivided a plane and used Maya’s nCloth sim to get it to drape over the character. It made an okay starting point but I still had to do a whole bunch of manual vertex pulling to get it to look like it was convincingly wrapped around the character.
Wireframe of the blanket geometry
Okay, now to texture the thing. I wanted something warm looking so I though flannel would be a decent idea. So I followed this great Substance Painter tutorial on making a flannel blanket. And the end result was, well, flannel. Something about it really didn’t look right to me:
Render of the scene when I was approximately half way through the challenge with the flannel blanket
Nicoler, who was also participating in the challenge, had a fantastic suggestion of both extruding the blanket to make it look less flat and then using XGen grooming to give it some texture. I hadn’t used XGen before but I was pleasantly surprised with how easy it was to work with. Essentially you can add XGen grooming onto a mesh and it will just spam a whole bunch of splines everywhere on the mesh. There’s then a bunch of brush tools that let you “groom” the splines, the most useful one being the noise brush that lets you just randomize the spline direction/length/etc to make it more naturally chaotic. And best of all, it works great with RenderMan’s Marschner Hair material. I liked it so much that I flipped my idea into making it more of a big fluffy fur blanket! I really agonized over the grooming and literally continued to groom the blanket in different ways up until the very last day of the challenge. But overall, I liked how it ended up!

I knew early on I wanted to have huge piles of books everywhere, as-in over a thousand. This had two challenges:
For the first question, “how do you texture over a thousand books”? I was very excited to think about a procedural book material. I’d been doing load of pure art work up to this point and I was eager to see if I could sneak some programming into the challenge. I had a huge ambitious vision for what this material would be with procedural tears and dirt and varying text. Needless to say time didn’t allow for this. But I was able to fulfill a small slice of what I set out to do.
The final material ended up being relatively small (especially if you compare it to the amazing things you see in Nodevember). Up close, the books weren’t too compelling, but since they were all in the dark and covered up by depth of field, it was good enough for my purposes. I added some hero books that had custom nicer materials in the foreground to help cover for my not-so-amazing procedural material. The procedural material supported:

The PrxLayeredSurface Material graph for the book material

Displacement is used on the pages to make silhouette look more like ruffled pages
The main thing I leveraged was this cool node called PxrRamp. It lets you a completely tunable color ramp and then a random color gets pulled from that per object. It was quite cool because I could just duplicate books around and they would automatically look different without any work thanks to the PxrRamp!

Okay, so thousands of books textured. Great. Now where do we put all those books? I turned to the built-in bullet sim included in Maya for doing a simple rigid body simulation. Since the books were basically just blocks, it was more than sufficient and really fast! Essentially I just pasted books vertically until I had essentially created a Jenga tower of 1000 books and then let the simulation do it’s thing. Here’s how that looks:

Some of the books still ended up in an awkward position so I just went through and manually hid any books I didn’t like after the sim. All the books on the shelves were manually placed by me and a whole lot of copy-paste. I also did a whole bunch of manual rotation just to give the sense that some of the books were falling over and leaning on each other.
Here’s how it looks putting that all together (excuse some floating books, I missed a few):

I also wanted to have the character surrounded by a bunch of opened books. The look I was aiming for was a messy desk filled with textbooks and journals as if you’re deep into some sort of research project. The main challenge here was getting some nice looking pages. I could have done some modelling (and in hindsight, maybe I should have), but my modelling skills are real poor so I turned to sims instead. I cloned a whole bunch of planes and them imported them into Houdini to sim using the vellum solver. It was quite challenging to get the pages to fall naturally, I had to do some real silly things to get the pages to fall naturally, including adding some fan forces to literally blow the pages away from the spine of the book. Here’s how that silliness looks in action:

I lost the actual sim I used, so this is just a recreation. I also didn’t crank up the collision iterations so it’s a little flatter than a more robust sim, but hopefully you get the idea.
The final actual mesh looked like this:

The final sim look okay, I had to delete some of the top pages because they crumpled in some unnatural ways. From there I took the alpha from a page texture from Textures.com to make the planes look less perfect. For most of the books, I just copy pasted a bunch of text from one of my other blog posts and slammed it on top of the texture. For the book next to the character’s arm however, I wanted it to instead look like a journal. I’ve always loved Da Vinci’s journals and his mix of art and engineering have always been a huge inspiration for me, so I wanted something reminicent of his journal. I drew up a journal page in Autodesk Sketchbook, it’s hastily thrown together, but I think that’s part of the appeal of a journal. Something you also expect from pages is that they’re thin and some light should pass through the page. Normally you would turn to subsurface scattering to solve this, but because the pages were modelled as flat planes, they’re infinitely thin and there’s no “space” for the light rays to bounce around, causing subsurface scattering to be useless in this context. Oops! That’s okay, RenderMan has this covered, wanting subsurface scattering on an infinitely flat object is actually pretty common (it’s used for the leaves on foliage as well), and there’s a transmission parameter in PxrSurface for handling this. It essentially is a knob that allows you to specify how much light should pass through the plane. Perfect! Here’s how that looks:
Yes the page teture is mirrored, yes it’s ugly. But the otherside of the journal isn’t visible in a way that this is noticeble from the camera and I’m lazy.
I also used a lot of the material assets from the journal on the potion challenge flyer. However, instead of relying on a cloth sim, I modelled this one because I had a very specific idea of how it needed to be folded in a way that made the text readable from the camera. I exported a procedural “crumples” height map from Substance Painter that I plugged into the displacement to help make it look extra crumpled:
I wanted to have the ol’ classic witch brewing up a potion in a cauldron kind of scene. Texturing the pot was straight-forward but getting the smoke right was harder. I actually ended up generating 2 different smoke volumes that I overlay, both out of Houdini.
The first is just a simple candle flame swirling around. It’s basically just an out-of-the-box candle flame sim from Houdini but it had enough intersting shape and texture I stuck with it:

The second volume was much more trickier to get right. I imported the cauldron model into Houdini and added a smoke sim that starts in the shape of a sphere and added some outward velocity so it would billow out of the cauldron. It worked decently but I bumped into a visual issue. With it billowing uniformly out of the cauldron, it started to obscule the cauldron and just looked like a big ball of smoke:

There’s a cauldron in there…somewhere.
What I wanted instead was little tendrils of smoke so that you’d see thick streams of smoke, but also some holes in the smoke so that you could see the cauldron underneath. I played with the sim a LOT trying to get muck with the velocity forces to do what I want. No matter what, the smoke would diffuse out and I couldn’t get the gaps in the smoke the way I wanted. So I rolled up my sleeves and started to think hacky. My solution that I ended up actually using was adding a whole bunch of sphere blockers to force the smoke down certain gaps and leave other areas empty.

I was pretty happy with the final result using the spheres. Getting these volumes from Houdini and rendered out into RenderMan is pretty straight-forward using RenderMan’s OpenVDB support. Luckily exporting VDBs works even in the free version of Houdini (Houdini Apprentice), sweet! I did little else in the material other than a global multiplier to fudge the density values and a uniform purple albdeo color for the volume.
The cauldron itself is a relatively simple material whipped together in Substance Painter. I started with a base of iron and added a bunch of rust and scratches to make it well used. Around the rim I added some shiny purple liquid stains to look like some of the potion mixture was stirring out. Here’s how that looks in a render:
Volumes are visibly low resolution, never got around to doing it at a higher resolution unfortunately…
The foliage isn’t a center piece in the final render but I still added a fair amount to the scene. Going back to the idea of a magical library, I wanted to give a feeling of life to the room as well as a bit of neglect.
SpeedTree generously gave everyone in the competition a 30 day trial to help generate some assets. I didn’t do anything advanced, but I did create several vines using the “trellis growth” template provided and imported some of the important scene geometry so that the vines would wrap convincingly in the scene.
SpeedTree viewport with a modified version of the trellis growth template and the Magic Shop scene geometry
Getting the mesh and material into Maya with RenderMan went without a hitch using SpeedTree’s RenderMan plugin, didn’t need to do any material wiring, it all worked out of the box with the leaf tranmistance done for you, nice!
I also liked XGen grooming for the blanket so much that I also decided to use it for adding moss on to the wood shelves. Almost all the moss is blurred out by the depth of field out so I didn’t need to spend too much time on detail, but it’s definitely noticeble and you can tell it’s adding to the shelves silhouette, making it look more compelling than just painting on moss on the shelf as a texture
Combining all of that together, here’s an example of how the book shelves looked:
Actually this has more XGen “hair” than the one I used in the actual scene, not sure why the hair grew, but you get the idea…
So take a minute to look at the final render again, how many light sources do you think are in this scene?
Most people would probably say five. Four candle lights and one for the moonlight. Six is also reasonable if you consider the emissive spell circle a light source. The real answer?
cue drum roll
13 lights! Actually it’s more if you count the composited fog layers. If you do lighting a lot, this is probably no surprise. But to anyone else, I’d imagine this is a little surprisng! It certainly surprised me! As a real-time rendering engineer and I like to dream of the lofty days when we have true real-time path tracing and all the lighting will just look good. And when I started the challenge, I had the expectation that I’d just put light sources where it makes sense and the GI would just make everything look great out of the box. But the goal here isn’t just to make something looks real, we also need to be thinking about questions like:
All of the above are questions that can’t be solved via tech, because it’s no longer about physics but artistic intent. It’s actually quite challenging to evaluate these questions, however there’s a few tricks to try and get a sense of it. The first is looking at the image in black and white. This removes color from the equation and allows you to evaluate the image purely in values. Here’s a couple of the things that I continually was checking on:

One other trick I picked up from Leif Pederson was to look at a blurred version of the black and white image. This is a good way of making sure the overall composition of values still looks good without getting distracted by individual details. This is a little harder to quantify, but the eye should still naturally be drawn to the focus just due to the layout of the values. If the focal point is getting lost because the values get muddled after a blur, it’s a good sign that there’s not enough contrast between different regions of the image

So what about those 13 lights? What are they? I could walk through them all but I figured it’d be easier to just visually show them:

In addition to my primary render, I also rendered out 2 different renders adding light rays from the windows and some extra volumetric lighting around the candles. The reason these need to be rendered out on separate layers is because of some of the settings required to get nice volumetric lighting. In order to get nice light rays, you need to crank up the amount of volumetric density (essentially emulating how “dusty” the room is). However, this also has the side effect of reducing visibility of the background and causing a spread of lighting that can kill your contrast. As a result, it’s nice to restrict these volumes into their own pass so that you can limit their effects to a single light source.
Here’s what the render of the light ray pass looks like:

And here’s what the volumetric lighting pass for candles looks like:

After adding the passes together, I just do a smidge of color correction to get tweak some of the brighness values. I actually spent a lot of time agonizing over some very minor brightness tweaks. The main challenge I found with having a night-time scene is a majority of your scene ends up with darker values and trying to keep details visible while also not overbrightening the scene is really hard. To make matters worse, I found that my image looked drastically different depending on whether I looked at it on my phone, my laptop, or a desktop monitor. It was quite frustrating because it would look far too dark on one screen and far too bright on another. So which screen do you calibrate for?
Ultimately I tried to hit a sweet spot that looked average across all screens, though inevitably it looked better on some screens than others. A trick I used to sanity check was Photoshop’s histogram of luminosity. I grabbed a whole bunch of reference art/photos of night scenes from professional artists who actually know what they’re doing and compared the luminosity histogram to my render just as a way of knowing I was in the right ballpark.
Luminosity histogram in Photoshop
You’ve made it to the end, thanks for reading! Ideally this might encourage some of you to give something like this a shot. Here’s my takeaways distilled into several points:
All the renders are of course done in RenderMan. The Mathilda model is by Xiong Lin and rig by Leon Sooi. Pixar models by Eman Abdul-Razzaq, Grace Chang, Ethan Crossno, Siobhán Ensley, Derrick Forkel, Felege Gebru, Damian Kwiatkowski, Jeremy Paton, Leif Pedersen, Kylie Wijsmuller, and Miguel Zozaya © Disney/Pixar - RenderMan “Magic Shop” Art Challenge. For the isolated object renders, the environment map uses an HDRI from HDRI Haven.
]]>
I always loved the water rendering in Moana so I set out to make a ShaderToy that would recreate some semblance of it in ShaderToy. It took a whole bunch of tricks to get it where I wanted, this post will break down some of the major ones. The ShaderToy source is available here.
I’ll first give a brief intro to the physics behind some key elements of water: absorption, refraction, and reflection. I’ll go into the distinction between ray marching and ray tracing. Both were necessary to get the whole effect but there’s some key performance and visual differences.
The second half will go over some of the secondary environment effects including the sand and the sky and a couple of tricks to add believability while staying real-time-ish.
In order to render water there’s 3 important things that need to be nailed down:
To show how that all looks in this particular scene, here’s a diagram sloppily drawn in, of all things, Microsoft Word (sorry):

In total, that means we have a total of 5 rays per pixel: 1 primary camera ray, 2 reflected rays, and 2 refracted rays. And some of the that involves ray-marching to handle absorption correctly. Needless to say, the brute-force implementation is going to be too slow. So here’s a couple of some very scene-specific optimizations I made to make this affordable:
At this point, I need to delve into something that’s was important for performance, the distinction between ray tracing and ray marching. I use both in the shader, relying on ray marching for the water volume and ray tracing for the rest of the opaque objects (coral and sand). If you’re already aware of the difference, then you can skip all of this, but if not, read on! The TLDR; of it is that ray tracing is faster but only works with boring shapes (triangles/spheres/etc), ray marching works on all sorts of fun SDFs but is iterative and slower. Keep in mind this is in the context of my ShaderToy, in different contexts there’s different trade-offs.

Lets talk about ray tracing. Ray tracing is a bit of a overloaded term, but in this context I use it to mean analytically determining intersection between a ray and some geometry. What’s that mean? Well lets take the example of a sphere and a ray. Using math you can calculate exactly if the ray will hit that sphere and exactly where. The great part is that math doesn’t involve any iteration. So what that means is that regardless of where your ray is, where or how far away the sphere is, you will ALWAYS be able to calculate the intersection position (or lack thereof) with a few math operations. And this applies not just to spheres but several other shapes including boxes, triangles, etc. In terms of performance, this is GREAT because we know exactly what we’re paying for. If you’re coming from a rasterization background, you might be puzzled to hear that ray tracing is “fast”. Keep in mind, that we’re just talking about a single primitive, once you start talking about things like triangle meshes where there’s thousands upon thousands of triangles, things get different. Let’s not go there today. In my ShaderToy, I had 6 primitives total (5 spheres and 1 plane).
So what about ray marching? Ray marching is a little different, particularly with signed distance fields (SDF). The marching is an iterative process based on the SDF. What that means is that it’s iteractive and that performance WILL change based on your SDF and the position of your sphere. So why do ray marching if it’s so much slower? The main reason you’ll see it on something like ShaderToy is because it’s extremely malleble and allows you to create interesting shapes programmatically. Anything from Inigo Quilez is a great example of this. Another example of this in something that’s actually shipped is Dreams, there’s a very cool talk on their architecture here.
There’s also a second reason to ray march. For something volumetric like water, you need to ray march anyways to get proper volumetric lighting. If we have to march ANYWAYS, we might as well use an SDF where we can get some fun shapes out of it.
I modelled all of the opaque geometry (anything that isn’t water) using primitive shapes. Not SDFs but just spheres for the coral and a single plane for the sand. This was important for performance because intersection testing could be done fast using ray tracing. It looks terrible right now but bear with me, it will look better by the end!

The actual water was created by bashing a bunch of SDFs. I won’t go in detail on the technical details but if you’re interested, you can check out my post on making a cloud volume with SDFs. The quick and dirty of the water is a few sine waves, some noise for add faked turbulance, and then I subtract out a sphere where the camera is to make the parting of the ocean. There’s nothing fancy happening there, just techniques from IQ’s sdf page.

So our water is looking quite bad because it’s opaque! So lets light it as a volumetric, not an opaque. This is done by marching along the water and at each point, marching another ray towards the light to caluclate how much light reaches that point. As I mentioned in the modelling section, we’re already marching ANYWAYS because the water is an SDF so we’re only adding the cost of marching towards the light at each step. This is using the exact same logic I described in an earlier post on volumetric rendering so if you’re interested, check that out for more details. I additionally have it absorb red and green faster than blue which causes things to look more blue the further a ray goes into the volume. I exaggerated the amount of blue tinting the absorption does to more closely match Moana’s art style rather than keeping to what’s “physically based”.

Calculating refraction is very easy because both HLSL/GLSL offer an intrinsic that calculates the refraction direction for you. You just need 4 things:
rayDirection = refract(rayDirection, volumeNormal, AIR_IOR / WaterIor)And remember to flip the IOR values and normals when you’re exiting the volume! With that, the turbulent SDF surface causes a nice distortion of rays that already is starting to look like water!

You can’t have water without doing nice reflections! For reflection there’s similarly already a reflect intrinsic in HLSL/GLSL that we can use:
vec3 reflection = reflect( rayDirection, volumeNormal);The reflections add some nice highlights that help bring out the shape of the water, particular at the top where the water is much closer to a grazing angle, here’s how that looks:

Caustics are really critical to the look of water. They’re also challenging to calculate “properly” and generally out of the realm of real-time. So instead lets think about how to fake them. The intuition I have behind what causes caustics is to think about the hot spot of light you get when you’re using a magnifying glass in the hot sun. The same kind of thing is happening with water due to refraction, only imagine you get the somewhat random shaped magnifying glasses due to the water turbulence/waves.
The way I faked it was to take some smooth voronoi noise and distort it. I then use the xz position to read into the procedural caustic texture. It ends up looking okay, there’s some tiling artifacts but I was feeling lazy, so it was good enough for my purposes:

One other approximation when keeping the magnifying glass analogy in mind: we should only really have water caustics where light has passed through water. If it didn’t penetrate any water, then you shouldn’t have concentrated points of light since nothing is bending light around. So when I do my opaque lighting, I only tack on caustics only if the light went through water (and I also weight the amount of caustics applied based on how much water it went through so you get a smooth falloff of the caustics). Putting that all together gets you this:

I didn’t spend much time making nice white water so I won’t talk about this too much, but I did want to call this out as it’s usually important to making nice looking water.
My hacky white water is done when the first ray hits the water and it’s close to the ground, I slap on some noise with white albedo. It’s not particularly compelling but looked better than nothing. Here’s how it looks (it’s hard to see in the water shadow, it’s more apparent in the lighting on the right).

For the sky I wanted something fast but also still dynamic. So I a really simple approach of just using fBM noise as my density function and the position is just the ray direction. From there I take a mere 4 ray marched steps towards the directional sun light to calculate some rudimentary self shadowing. With just that it looks pretty decent!
vec4 GetCloudColor(vec3 position)
{
vec3 cloudAlbedo = vec3(1, 1, 1);
float cloudAbsorption = 0.6;
float marchSize = 0.25;
vec3 lightFactor = vec3(1, 1, 1);
vec3 marchPosition = position;
int selfShadowSteps = 4;
for(int i = 0; i < selfShadowSteps; i++)
{
marchPosition += GetSunLightDirection() * marchSize;
float density = cloudAbsorption * GetCloudDenity(marchPosition);
lightFactor *= BeerLambert(vec3(density, density, density), marchSize);
}
return vec4(cloudAlbedo * (GetSunLightColor() * lightFactor + GetAmbientSkyColor()));
}
However the clouds look pretty monotone. I took a look at some of the shots from Moana and there’s generally this really nice blue backsplash on the clouds that comes from the sky. So to fake that GI, instead of just darkening to gray when light has been self-shadowed by the cloud, I lerp to a dark blue. Here’s how that looks:

The refractive rippling water already makes the underlying spheres have a semi-interesting silhouette so I just aimed at adding some texture. I ended up just slapping on a procedural normal map based on fBM noise and voilà! We have some cheap, non-terrible looking coral!

So the “sand” is looking a little flat. Lets see what we can do to add some texture to it! I first attempted to see what we can get from just normal mapping. I started with a height map with high frequency fbm noise to try to simulate the grain texture and then some noise fed into a sine wave to form the dunes. The code was pretty simple
float SandHeightMap(vec3 position)
{
float sandGrainNoise = 0.1 * fbm(position * 10.0, 2);
float sandDuneDisplacement = 0.7 * sin(10.0 * fbm_4(10.0 + position / 40.0));
return sandGrainNoise + sandDuneDisplacement;
}And here’s how that looks:

That looks…better? But it’s still looking pretty flat. Hmm, we could try to enchance the geometry to be more interesting but that either involves moving my opaque geometry to be SDFs (which means going down a slower ray marching path) OR trying to make something more compelling using my limitted building blocks of spheres and planes.
A third alternative is to use something very similar to ray marching: parallax occlusion mapping. Parallax occlusion mapping is a technique commonly used in games for faking geometry with a height map on things like low-poly planes that gives the illusion of a much higher fidelity surface.
At this point, you might be asking, isn’t this the same thing as just representing your ground plane as an SDF plane that’s displaced by a height map? In truth, it’s very similar, but the key difference is that parallax occlusion mapping is deferred until lighting and doesn’t affect the geometry. This has 2 benefits:
Since it’s a heightmap AND it’s on a plane that’s aligned to the xz-plane, the code is really simple:
vec3 SandParallaxOcclusionMapping(vec3 position, vec3 view)
{
int pomCount = 6;
float marchSize = 0.3;
for(int i = 0; i < pomCount; i++)
{
if(position.y < GROUND_LEVEL - SandHeightMap(position)) break;
position += view * marchSize;
}
return position;
} And here’s how that looks:

Finally, the last touch here is to make the sand look wet when it’s close to the water. Because the water is modelled as an SDF, it’s really easy to query if the water is close by and then make your albedo a little bit darker if it’s close by. I also make the ground reflective here to mimic water built up on the sand, but to keep the reflection calculation cheap, I just assume the ray hits the sky (i.e. I don’t intersect with the water). One other trick is to generate a reflection ray with a normal pointing straight up (i.e. assuming a flat plane) rather than the normal from the sand dunes. This emulates a “puddle” look where water has built up on top of bumpy sand.

And that wraps things up! If you’ve read this far, thanks for bearing with me! I added a bunch of knobs configurable by holding down certain keys that let you play with physical properties such as absorption/turbulence/color/etc. What’s cool when you’ve built things up on top of mostly physically-based principles, things like this generally “just work” and you can get some cool results. So to wrap up, here’s some radioactive jello using the exact same ShaderToy but just with different water properties:

I recently wrote a small ShaderToy that does some simple volumetric rendering. I decided to follow up with a post on how the ShaderToy works. It ended up a little longer than I expected so I’ve broken this into 2 parts: the first part will talk about modelling a volume using SDFs. Part 2 will go into using ray marching to render the volume. I highly recommend you view the interactive ShaderToy yourself here. If you’re on a phone or laptop, I suggest viewing the fast version here. I’ve included some code snippets, which should help get a high-level understanding of how the ShaderToy works but aren’t all-inclusive. If you want to understand things at a deeper level, I’d suggest cross-referencing this with the actual ShaderToy code.
I had 3 main goals for my ShaderToy:
I’ll be starting from this scene with some starter code. I’m not going to go deep into the implementation as it’s not too interesting, but just to give a sense of where we’re starting from:
Here’s what that looks like:

We’ll be rendering the volume as a separate pass that gets blended with the opaque scene, similar to how any real-time rendering engine would handle opaque surfaces vs translucents.
But first, before we can even do any volumetric rendering, we need a volume to render! I decided to use signed distance functions (SDFs) to model my volume. Why distance fields functions? Because I’m not an artist and they’re really great for making organic shapes in a few lines of code. I’m not going to go into signed distance functions because Inigo Quilez has done a really great job of that already. If you’re interested, here’s a great list of different signed distance functions and modifiers here. And another on raymarching those SDFs.
So lets start simple and throw a sphere in there:

Now we’ll add an extra sphere and use a smooth union to merge the sphere distance functions together. This is taken straight from Inigo’s page, but I’m pasting it here for clarity:
// Taken from https://iquilezles.org/www/articles/distfunctions/distfunctions.htm
float sdSmoothUnion( float d1, float d2, float k )
{
float h = clamp( 0.5 + 0.5*(d2-d1)/k, 0.0, 1.0 );
return mix( d2, d1, h ) - k*h*(1.0-h);
}Smooth union is extremely powerful as you can get something quite interesting by just combining it with a handful of simple shapes. Here’s what my set of smooth union spheres look like:

Okay, we have something blobby looking, but we really want something that’s more of a cloud than a blob. The really cool thing about SDFs is how easy it is to distort the surface by just addding some noise to the SDF. So lets slap some fractal brownian motion (fBM) noise ontop using the position to index into the noise function. Inigo Quilez has us covered again with a really great article on fBM noise if you’re interested. But here’s how it looks with some fBM noise tossed on top:

Sweet! This thing suddenly looks a lot more interesting with the fBM noise! Finally we want to give the illusion that the volume is interacting with the ground plane. To do this, I added a plane signed distance just slightly under the actual ground plane and again re-use that smooth union merge with a really aggressive union value (the k parameter). And after that you get this:

And then a final touch is to adjust the xz index into the fBM noise with time so that the volume has a kind of rolling fog look. In motion it looks pretty good!

Woohoo, we have something that looks like a cloudy thing! The code for calculating the SDF is pretty compact too:
float QueryVolumetricDistanceField( in vec3 pos)
{
vec3 fbmCoord = (pos + 2.0 * vec3(iTime, 0.0, iTime)) / 1.5f;
float sdfValue = sdSphere(pos, vec3(-8.0, 2.0 + 20.0 * sin(iTime), -1), 5.6);
sdfValue = sdSmoothUnion(sdfValue,sdSphere(pos, vec3(8.0, 8.0 + 12.0 * cos(iTime), 3), 5.6), 3.0f);
sdfValue = sdSmoothUnion(sdfValue, sdSphere(pos, vec3(5.0 * sin(iTime), 3.0, 0), 8.0), 3.0) + 7.0 * fbm_4(fbmCoord / 3.2);
sdfValue = sdSmoothUnion(sdfValue, sdPlane(pos + vec3(0, 0.4, 0)), 22.0);
return sdfValue;
}But this is just rendering it as an opaque. We want something nice and fluffy! Which leads to part 2!
]]>In part 1 we talked about how to generate a cloud like volume using SDFs. We left off with this:

So how do we render this as a volume rather than an opaque? Lets talk about the physics we are simulating here first. A volume represents a large amount of particles within some span of space. And when I say large amount, I mean a LARGE amount. Enough that representing those particles individually is unfeasible today, even for offline rendering. Things like fire, fog, and clouds are great examples of this. In fact, everything is technically volumetric, but for performance reasons, it’s easier to assume turn a blind eye and assume they’re not. We represent aggregation of those particles as density values, generally in some 3D grid (or something more sophisticated like OpenVDB).
When light is going through a volume, a couple things can happen when light hits a particle. It can either get scattered and go off in another direction, or some of the light can get absorbed by the particle and diffused out. In order to stay within the constraints of real-time, we will be doing what’s called single scattering. What this means is that we assume that we assume light only scatters once when the light hits a particle and flies towards the camera. This means we won’t be able to simulate multiscattering effects such as fog where things in the distance tend to get blurrier. But for our purposes that’s okay. Here’s a visual of what single scattering looks like with ray marching:

The pseudo code for this looks something like this:
for n steps along the camera ray:
Calculate what % of your ray hit particles (i.e. were absorbed) and needs lighting
for m lights:
for k steps towards the light:
Calculate % of light that were absorbe in this step
Calculate lighting based on how much light is visible
Blend results on top of opaque objects pass based on % of your ray that made it through the volumeSo we’re talking about something with O(n * m * k) complexity. So buckle up, your GPU’s about to get a little toasty.
So first lets just tackle absorbtion of light in a a volume just along the camera ray (i.e. lets not march towards lights..yet). To do that, we need to do 2 things:
To calculate how much light gets absorbed at each point, we use the Beer-Lambert Law, which describes light attenuation through a material. The math for this is surprisingly simple:
float BeerLambert(float absorptionCoefficient, float distanceTraveled)
{
return exp(-absorptionCoefficient * distanceTraveled);
}The absorptionCoefficient is a parameter of the material. For example in a clear volume like water, this value would be low, but for something thicker like milk, you’d have a higher coefficient.
To ray march, the volume. We just take a fixed sized steps along the ray and get the absorption at each step. It might not be clear why we need to take fixed steps yet vs something faster like sphere tracing, but once density isn’t uniform within the volume, that will become clearer. Here’s what the code for ray marching and accumulating absorption looks like below. Some variable are outside of this code snippet but feel free to refer to the ShaderToy for the complete implementation.
float opaqueVisiblity = 1.0f;
const float marchSize = 0.6f;
for(int i = 0; i < MAX_VOLUME_MARCH_STEPS; i++) {
volumeDepth += marchSize;
if(volumeDepth > opaqueDepth) break;
vec3 position = rayOrigin + volumeDepth*rayDirection;
bool isInVolume = QueryVolumetricDistanceField(position) < 0.0f;
if(isInVolume) {
float previousOpaqueVisiblity = opaqueVisiblity;
opaqueVisiblity *= BeerLambert(ABSORPTION_COEFFICIENT, marchSize);
float absorptionFromMarch = previousOpaqueVisiblity - opaqueVisiblity;
for(int lightIndex = 0; lightIndex < NUM_LIGHTS; lightIndex++) {
float lightDistance = length((GetLight(lightIndex).Position - position));
vec3 lightColor = GetLight(lightIndex).LightColor * GetLightAttenuation(lightDistance);
volumetricColor += absorptionFromMarch * volumeAlbedo * lightColor;
}
volumetricColor += absorptionFromMarch * volumeAlbedo * GetAmbientLight();
}
}And this is what that gets you:

It looks like cotton candy! And perhaps for some effects, this actually could be good enough! But what’s missing here is self shadowing. Light is reaching all parts of the volume equally. But that’s not physically correct, depending on how much volume is between the point being rendered and the light, you will have different amounts of incoming light.
At this point, we’ve actually already done all the hard work. We need to do the same thing we did to calculate absorption along the camera ray, but just along the light ray. The code for calculating how much light reaches each point is basically duplicated code but duplicating is easier than hacking HLSL to get the kind of recursion we’d want. So here’s what that looks like:
float GetLightVisiblity(in vec3 rayOrigin, in vec3 rayDirection, in float maxT, in int maxSteps, in float marchSize) {
float t = 0.0f;
float lightVisiblity = 1.0f;
for(int i = 0; i < maxSteps; i++) {
t += marchSize;
if(t > maxT) break;
vec3 position = rayOrigin + t*rayDirection;
if(QueryVolumetricDistanceField(position) < 0.0) {
lightVisiblity *= BeerLambert(ABSORPTION_COEFFICIENT, marchSize);
}
}
return lightVisiblity;
}And adding self shadowing gets us this:

At this point, I was actually feeling pretty good about my volume. I showed this off to our talented VFX lead at The Coalition, James Sharpe, for some feedback. He immediately caught that the edges of the volume looked way to hard. Which is completely true, for things like clouds, they’re constantly diffusing with the space around them and so the edges are blending with the empty space arount the volume that should lead to really soft edges. James suggested a great idea of lowering density based how close you are to the edge. Which, because we’re working with signed distance functions, is really easy to do! So lets add a function we can use to query density at any point in the volume:
float GetFogDensity(vec3 position)
{
float sdfValue = QueryVolumetricDistanceField(position)
const float maxSDFMultiplier = 1.0;
bool insideSDF = sdfDistance < 0.0;
float sdfMultiplier = insideSDF ? min(abs(sdfDistance), maxSDFMultiplier) : 0.0;
return sdfMultiplier;
}And then we just fold this into our absorption value:
opaqueVisiblity *= BeerLambert(ABSORPTION_COEFFICIENT * GetFogDensity(position), marchSize);And here’s how that looks:

And now that we have a density function we’re using, it become easy to add some noise into the volume that will give us a little extra detail and fluff to the volume. In this case, I just re-use the fBM function we used for tweaking wit the volume’s shape.
float GetFogDensity(vec3 position)
{
float sdfValue = QueryVolumetricDistanceField(position)
const float maxSDFMultiplier = 1.0;
bool insideSDF = sdfDistance < 0.0;
float sdfMultiplier = insideSDF ? min(abs(sdfDistance), maxSDFMultiplier) : 0.0;
return sdfMultiplier * abs(fbm_4(position / 6.0) + 0.5);
}And with that, we’re here:

The volume is looking pretty good at this point! One thing is that there’s still some light leaking through the volume. Here we see green light leaking somewhere where the volume definitely should be occluded by the volume:
This is because opaque objects are rendered before the volume is rendered, so they don’t take into account shadowing that should happen from the volume. This is pretty easy to fix, we have a GetLightVisiblity function that we can use to calculate shadowing, so we just need to call this for our opaque object lighting. With that we get this:

In addition to getting some really nice colored shadows, it also really helps ground the shadow and sell the volume as a part of the scene. In addition, we get soft shadows even though we’re technically working with point lights thanks to the soft edges in the volume. And that’s it! There’s certainly more I could have done but this felt like it hit the visual bar I wanted for a sample I wanted to keep relatively simple.
At this point, this blog is running a little long but I’d make some quick mentions of some optimizations:
That’s about it! Personally I was surprised how you can get something fairly physically-based with a relatively small amount of code (~500 lines). Thanks for reading, hopefully it was mildly interesting. If you have questions, feel free to reach out to me on Twiter.
Oh and one more thing, here’s a fun tweak where I add emissive based on the SDF distance to make an explosion effect. Because we could always use more explosions.
