Realtime Raytracing in Bevy 0.18 (Solari)

2025-12-27T00:00:00+00:00

Introduction</a> </h2>
my last post</a>, which means it's time to talk about all the work I've been doing for the upcoming release of Bevy 0.18!
Like last time, this cycle has seen me focused entirely on Solari - Bevy's next-gen, fully dynamic raytraced lighting system, allowing artists and developers to get high quality lighting - without having to spend any time on static baking.

PICA PICA using Solari in Bevy 0.18 </figcaption> </figure>
Before getting into what's changed in this release, let's take a quick look back at where Solari was in Bevy 0.17.
Recap of 0.17
Specular Materials

BRDF Evaluation
BRDF Sampling
// https://jcgt.org/published/0006/01/01/paper.pdf let TBN = orthonormalize(surface.world_normal); let T = TBN[0]; let B = TBN[1]; let N = TBN[2]; // Convert input from world space to tangent space let wo_tangent = vec3(dot(wo, T), dot(wo, B), dot(wo, N)); // Swapped wo and wi let wi_tangent = sample_ggx_vndf(wo_tangent, surface.material.roughness, &rng); // Convert output from tangent space to world space let wi = wi_tangent.x T + wi_tangent.y B + wi_tangent.z N; // Swapped wo and wi let pdf = ggx_vndf_pdf(wo_tangent, wi_tangent, surface.material.roughness); </code></pre> One final thing to note is this line of code I added to sample_ggx_vndf</code>, which doesn't appear in the paper: if roughness <= 0.001 { return vec3(-wi_tangent.xy, wi_tangent.z); } </code></pre> Remember how earlier we clamped roughness to 0.001? Well that means we can no longer render perfect mirrors. To get around this, when importance sampling the specular BRDF for a material with a roughness of 0.001, we just treat it like a perfect mirror and reflect the incident light direction around the Z axis. This restores mirror-like behavior, while still preventing NaNs in BRDF evaluation. Without the fix - a little blurry </figcaption> </figure> With the fix - perfect mirror! </figcaption> </figure> Specular DI</a> </h3> Now that we've covered Solari's updated material BRDF, let's talk about how lighting has changed. First up - specular direct lighting. Status Quo</a> </h4> To recap: For direct lighting, Solari is using ReSTIR DI. We take a series of random initial samples from light sources, and use RIS to choose the best one. This is essentially fancy next event estimation (NEE). We then do some temporal and spatial resampling to share good samples between frames/pixels. Finally, we shade using the final selected sample (which in Bevy 0.17 used only the diffuse BRDF). Changes</a> </h4> To add support for specular materials, there's a couple of different places that we should modify: Account for the specular BRDF in the target function during initial resampling</li> Account for the specular BRDF in the target functions during temporal and spatial resampling</li> Trace a BRDF ray during initial sampling and combine it with the NEE samples using multiple importance sampling (MIS) This is the only way to sample DI for zero-roughness mirror surfaces</li> Improves quality for glossy (mid-roughness) surfaces</li> Improves quality for area lights that are very close to the surface</li> </ul> </li> Account for the specular BRDF during shading of the final selected sample</li> </ol> For Bevy 0.18, I ended up spending most of my time on GI, so for DI I only did #4. #1 and #2 are tricky because the whole point of ReSTIR is to share samples across pixels. But for specular, samples are not (easily) shareable, as unlike the diffuse lobe, a strong source of light for pixel A might be outside the specular lobe of pixel B and have zero contribution. Maybe in practice it's not a big deal, or maybe using a second set of reservoirs for specular would help, but for now I've chosen to skip these, and treat all surfaces (including metallic ones) as purely diffuse during resampling. #3 requires an extra raytrace, which costs a lot of performance, and so again I've skipped it. When I get more time to experiment, I'll play around with these and see if any of them work well. So to sum it up, for DI, all I did was swap albedo / PI</code> with a call to evaluate_brdf()</code> during the final shading step. Diffuse GI Changes</a> </h3> Indirect lighting is where specular gets much more interesting. First off, as far as the world cache is concerned, all surfaces are diffuse only, with no specular lobe. This means that when you query the cache, you treat the query point as a diffuse surface. When updating cache entries, you also treat the cache point as a diffuse surface. For per-pixel GI, Solari splits the lighting calculations into two seperate passes - one for the diffuse lobe, and one for the specular lobe. The diffuse lobe is handled by the existing ReSTIR GI pass. ReSTIR GI resampling is exactly the same as in Bevy 0.17 - like DI, only the final shading changes. For the ReSTIR GI final shading step, we're still shading using only the diffuse lobe, but now we need to skip shading metallic pixels that don't have a diffuse lobe. Diffuse GI only - metallic surfaces are black because they don't have any diffuse contribution </figcaption> </figure> Specular GI</a> </h3> The specular lobe, on the other hand, is handled by an entirely new dedicated specular GI pass. The basic structure of the pass looks like this (simplified): let surface = load_from_gbuffer(pixel_id); let wo = normalize(view.world_position - surface.world_position); var radiance: vec3<f32>; var wi: vec3<f32>; if surface.material.roughness > 0.4 { // Surface is very rough, reuse the ReSTIR GI reservoir let gi_reservoir = gi_reservoirs_a[pixel_index]; wi = normalize(gi_reservoir.sample_point_world_position - surface.world_position); radiance = gi_reservoir.radiance gi_reservoir.unbiased_contribution_weight; } else { // Surface is glossy or mirror-like, trace a new path let wi_tangent = sample_ggx_vndf(wo_tangent, surface.material.roughness, &rng); wi = wi_tangent.x T + wi_tangent.y B + wi_tangent.z N; let pdf = ggx_vndf_pdf(wo_tangent, wi_tangent, surface.material.roughness); radiance = trace_glossy_path(surface.world_position, wi, &rng) / pdf; } // Final shading let brdf = evaluate_specular_brdf(surface.world_normal, wo, wi, surface.material...); let cos_theta = saturate(dot(wi, surface.world_normal)); radiance = brdf cos_theta view.exposure; </code></pre> For rough surfaces, the specular lobe is wide enough to approximate the diffuse lobe. We can just skip tracing any new rays, and reuse the ReSTIR GI sample directly. This saves a lot of performance, with minimal quality loss. For glossy or mirror surfaces, we need to trace a new path, following the best direction from importance sampling the GGX distribution. The full code for trace_glossy_path</code> is a bit long, so I'm just going to link to the source on GitHub</a>. Specular GI only </figcaption> </figure> The basic outline is: We trace up to three bounces (after three bounces, the quality loss from skipping further bounces is minimal)</li> Lighting comes from any of: hitting an emissive surface, NEE, or terminating in the world cache</li> Each bounce samples the GGX distribution to find the next bounce direction (if the surface was rough enough, we would have terminated in the world cache - more on this in a second)</li> </ul> It's essentially just a standard pathtracer, except with theortically higher coherence from always following the specular lobe. However, there are many subtle details that took me some time to figure out: Emissive contributions are skipped on the first bounce, as ReSTIR DI handles those paths</li> We only query the world cache when hitting a rough surface (otherwise reflections would show the grid-like world cache)</li> We skip NEE for mirror surfaces</li> We apply MIS between the emissive contribution and the NEE contribution</li> </ul> And there are still some large remaining issues: NEE is using entirely random samples, which leads to noisy reflections</li> Glossy surfaces don't have any sort of path guiding to choose good directions, which also leads to noisy reflections</li> No specular motion vectors to aid the denoiser leads to ghosting when objects in reflections move around</li> Terminating in the world cache still leads to quality issues sometimes, especially on curved surfaces</li> </ul> Specular motion vectors are something I plan to work on. I just need to spend some more time understanding the theory. As for improving sampling during the path trace, this is technically what ReSTIR PT was invented to solve. However, ReSTIR PT is also very performance intensive. I'm not convinced it's the path we should go down for Solari. I have some other ideas in mind for improving sampling, which I'll talk about at the end of this post. Energy Loss Bug</a> </h2> One of the big problems Solari had in 0.17 was overall energy loss compared to a pathtraced reference. At the time, I chalked it up to an inherent limitation of the world cache and moved on. However, while experimenting with various things this cycle, I realized that not only was it not due to the world cache, but DI also was losing energy, and not just GI! After many painful days of narrowing down the issue, I tracked it down to the light tile code</a>, which was shared between DI and the world cache. The rgb9e5 packing of the light radiance I was doing did not have enough bits to encode the light, and so energy was being lost. The fix (thanks to @SparkyPotato) was to apply a log2-based encoding to the radiance before packing. This allocates more bits towards the values that human perception cares about, and less bits towards the values that we have a harder time seeing. fn pack_resolved_light_sample(sample: ResolvedLightSample) -> ResolvedLightSamplePacked { return ResolvedLightSamplePacked( // ... vec3_to_rgb9e5_(log2(sample.radiance view.exposure + 1.0)), // ... ); } fn unpack_resolved_light_sample(packed: ResolvedLightSamplePacked, exposure: f32) -> ResolvedLightSample { return ResolvedLightSample( // ... (exp2(rgb9e5_to_vec3_(packed.radiance)) - 1.0) / exposure, // ... ); } </code></pre> With this fix, we're much closer to matching the reference. Energy loss due to poor encoding of radiance in light tiles </figcaption> </figure> Correct energy due to a better encoding </figcaption> </figure> Resampling Correlations/Bias</a> </h2> One of the problems I wasn't able to solve in Bevy 0.17 was ReSTIR DI correlations introducing artifacts when denoising with DLSS-RR. Correlations from ReSTIR DI confusing the denoiser </figcaption> </figure> For ReSTIR GI, I was able to solve this with permutation sampling during temporal reuse. But for ReSTIR DI, trying to use permutation sampling lead to artifacts on shadow penumbras due to the way I was doing visibility reuse. Visibility reuse messing up shadows when using permutation sampling </figcaption> </figure> I played with resampling ordering a bit more this cycle, and was able to come up with a solution. In Bevy 0.17, the whole ReSTIR DI algorithm looked like this: Initial sampling</li> Test visibility of initial sample</li> Temporal resampling</li> Choose spatial sample</li> Test visibility of spatial sample</li> Spatial resampling</li> Store final reservoir for next frame temporal reuse</li> Final shading</li> </ol> In Bevy 0.18, it now looks like this: Initial sampling</li> Test visibility of initial sample</li> Temporal resampling</li> Choose spatial sample</li> Spatial resampling</li> Store final reservoir for next frame temporal reuse</li> Test visibility of final reservoir</li> Final shading</li> </ol> The two big differences are: The second visibility test was moved from the spatial sample, to the final sample after all resampling steps</li> The second visibility test is performed for the final shading, but is not fed forward for next frame's temporal resampling</li> </ul> Moving the second visibility test from the spatial sample only to after all resampling was the key change. Before permutation sampling, it was ok to not re-test visibility for the temporal sample. The the light was visible to the pixel last frame, it's probably still visible this frame. Same for if the light was not visible last frame. When this assumption is wrong, e.g. for moving objects, it just led to a 1-frame lag in shadows that's almost unnoticable - an acceptable tradeoff. With permutation sampling, we can no longer trust that the visibility of the temporal sample is correct to reuse. The temporal sample now may come from a neighboring pixel, and at shadow pneumbras, the visibility is changing very frequently. It's no longer safe to reuse visibility, even on static scenes - we must retest visibility. The best way to test visibility without using extra ray traces is to move it right before shading of the final sample, where incorrect visibility would show up on screen The second change (not feeding forward the second visibility test to the next frame) is not strictly necessary, but keeps direct lighting unbiased. If you were to feed forward the second visibility test, the following might happen: A pixel checks visibility and finds that the light is occluded, setting the reservoir's contribution to 0</li> The reservoir is stored for reuse next frame</li> <Next frame></li> The reservoir is reused temporally for the same pixel (say that the initial sample happened to also be 0 contribution)</li> The reservoir is reused spatially by a different pixel, which sees that it has zero contribution, and does not choose it via resampling Except since this is a different pixel, the light is not occluded, and the sample should have had non-zero contribution!</li> </ul> </li> </ol> Reusing visibility like this leads to bias in the form of shadows that "halo" objects, expanding further out than they should. Feeding forward final visibility leads to over-shadowing artifacts </figcaption> </figure> While so far I've only talked about DI resampling, these changes actually apply to GI resampling too. Solari's ReSTIR GI pass now uses the same modified ordering as the DI resampling, which fixes indirect shadow artifacts. It's just that incorrect shadow edges are not as obvious with GI as they are for DI, so it's less important. One final note on DI resampling: like we were doing with ReSTIR GI, we now use the balance heuristic for ReSTIR DI resampling, instead of constant MIS weights. This makes a small difference (hence why I never noticed it until now), but it does slightly increase emissive light brightness, matching the pathtraced reference better. Final ReSTIR DI output after all the changes </figcaption> </figure> World Cache Improvements</a> </h2> The world cache is the oldest part of Solari - it was copied nearly wholesale from my original prototype three years ago, without any real changes in Bevy 0.17 except for the addition of the LOD system. Because of this, it was also the jankiest part of Solari. As I started testing on more complex scenes, it became clear that there were significant problems: On the cornell box scene, it worked fine.</li> On the PICA PICA scene, it worked ok when conditions were static, but under dynamic conditions the GI was fairly laggy.</li> On Bistro, performance wasn't good, especially as you started moving around the scene.</li> </ul> In Bevy 0.18, I spent a large amount of time fixing these issues. Cache Lag</a> </h3> In the PICA PICA scene, if you turn off all the lights, it would take a good while for the light to completely fade. The reason being that: A) the world cache samples itself, recursively propagating light around the scene for a while, and B) the exponential blend between new and current radiance samples keeps the old radiance around for a decent amount of time. </video> Laggy GI with fixed blend factor </center> To combat this, we could increase the blend factor, to keep the lighting responsive. However that would lead to way more noise and instability under static lighting conditions. What we really need is an adaptive blend factor, which Guillaume Boissé</a> was kind enough to share with me. We keep track of the change in luminance between frames, and use that to compute an adaptive blend factor. let old_radiance = world_cache_radiance[cell_index]; let new_radiance = world_cache_active_cells_new_radiance[active_cell_id.x]; let luminance_delta = world_cache_luminance_deltas[cell_index]; // https://bsky.app/profile/gboisse.bsky.social/post/3m5blga3ftk2a let sample_count = min(old_radiance.a + 1.0, WORLD_CACHE_MAX_TEMPORAL_SAMPLES); let alpha = abs(luminance_delta) / max(luminance(old_radiance.rgb), 0.001); let max_sample_count = mix(WORLD_CACHE_MAX_TEMPORAL_SAMPLES, 1.0, pow(saturate(alpha), 1.0 / 8.0)); let blend_amount = 1.0 / min(sample_count, max_sample_count); let blended_radiance = mix(old_radiance.rgb, new_radiance, blend_amount); let blended_luminance_delta = mix(luminance_delta, luminance(blended_radiance) - luminance(old_radiance.rgb), 1.0 / 8.0); world_cache_radiance[cell_index] = vec4(blended_radiance, sample_count); world_cache_luminance_deltas[cell_index] = blended_luminance_delta; </code></pre> Now GI is stable under static conditions, but reacts pretty fast under dynamic conditions. It's not perfect - we're still heavily relying on temporal accumulation and denoising - but it's a heck of a lot better. </video> Less-laggy GI with adaptive blend factor </center> Once again, thanks a ton to Guillaume Boissé for this code! I was struggling to come up with something myself, and this perfectly solved my problem! Cache Lifetimes</a> </h3> While Solari was working great on smaller scenes, on larger scenes like Bistro, performance was much worse. The world cache update pass was taking way too long, and worse, as I moved around the scene, it got worse and worse. The reason is that since cache entries sample each other (in order to get multibounce lighting), they were keeping each other alive forever. So once you stepped into an area, it would forever be present in the world cache, even when you left the area. The solution (thanks to @IsaacSM and @NthTensor) ended up being pretty simple! fn query_world_cache(world_position: vec3<f32>, world_normal: vec3<f32>, view_position: vec3<f32>, cell_lifetime: u32, rng: ptr<function, u32>) -> vec3<f32> { let cell_size = get_cell_size(world_position, view_position); let world_position_quantized = bitcast<vec3<u32>>(quantize_position(world_position, cell_size)); let world_normal_quantized = bitcast<vec3<u32>>(quantize_normal(world_normal)); var key = compute_key(world_position_quantized, world_normal_quantized); let checksum = compute_checksum(world_position_quantized, world_normal_quantized); for (var i = 0u; i < WORLD_CACHE_MAX_SEARCH_STEPS; i++) { let existing_checksum = atomicCompareExchangeWeak(&world_cache_checksums[key], WORLD_CACHE_EMPTY_CELL, checksum).old_value; // Cell already exists or is empty - reset lifetime if existing_checksum == checksum || existing_checksum == WORLD_CACHE_EMPTY_CELL { #ifndef WORLD_CACHE_QUERY_ATOMIC_MAX_LIFETIME atomicStore(&world_cache_life[key], cell_lifetime); #else atomicMax(&world_cache_life[key], cell_lifetime); #endif } if existing_checksum == checksum { // Cache entry already exists - get radiance return world_cache_radiance[key].rgb; } else if existing_checksum == WORLD_CACHE_EMPTY_CELL { // Cell is empty - initialize it world_cache_geometry_data[key].world_position = world_position; world_cache_geometry_data[key].world_normal = world_normal; return vec3(0.0); } else { // Collision - linear probe to next entry key += 1u; } } return vec3(0.0); } </code></pre> When a ReSTIR GI or specular GI pixel is querying the world cache, nothing has changed. We still perform atomicStore(&world_cache_life[key], WORLD_CACHE_CELL_LIFETIME)</code>, resetting the lifetime of the queried cache entry. However when a world cache entry is querying another entry during the world cache update pass, the algorithm changes. Now we're instead doing atomicMax(&world_cache_life[key], cell_lifetime_of_querier)</code>. When the camera is in a given area, ReSTIR GI and specular GI pixels will reset world cache entries to their max lifetime. Then during world cache update the next frame, those world cache entries will copy their max lifetime to other entries nearby. However once the camera moves away from the area, there will be no more pixels querying the world cache. When world cache entries go to query each other, they'll copy over their current lifetimes (which are decaying each frame). After a couple of frames, all the world cache entries in the area will go dead. No more performance wasted on areas the camera will never see! Misc Cache Tweaks</a> </h3> NSight trace of Bistro showing an expensive and spiky world cache update </figcaption> </figure> Finally, I tweaked a bunch of other things based on my testing in Bistro: Limited indirect rays sent from cache entries during the world cache update step to a max of 50 meters - This prevents long raytraces from holding up the whole threadgroup, improving performance, and prevents far-away samples from influencing the cache, reducing variance.</li> Switched the world cache update workgroup size from 1024 to 64 threads - Much more appropriate for raytracing workloads. This fixed some really weird GPU usage traces I was seeing in NSight.</li> Make the world cache transition LODs faster - In a large scene like Bistro, we had way too many cache entries for far-away areas.</li> </ul> Combined, these changes brought the world cache update step from 1.42ms to a much more reasonable 0.09ms in Bistro. What's Next</a> </h2> This blog post just scratched the surface of the past three months. I didn't even cover stuff I tried that didn't work out; but I'm a little sick of writing this and want to get something posted rather than spend a month perfecting it. So with that, let's talk about the future! Solari has improved a ton in these past three months, but of course there's still more work to be done! General Improvements</a> </h3> First, some general issues carrying over from my last blog post: Feature parity for things like skinned and morphed meshes, alpha masks, transparent materials, support for more types of light sources, etc still need implementing.</li> Solari is still NVIDIA only in practice due to relying on DLSS-RR (FSR-RR did release since my last blog post, but to my immense sadness is currently DirectX 12 only - no Vulkan support. AMD employees - please reach out!)</li> Shader execution reordering (blocked on wgpu support) and half-resolution GI (on top of DLSS upscaling) would bring major performance improvements.</li> </ul> Sampling Improvements</a> </h3> The other major improvements I want to make are to sampling quality. There are a bunch of different papers and techniques I want to experiment with. DI Sampling</a> </h4> For DI, our initial sampling is totally random, which is pretty terrible. The minimal improvement would be to build a global CDF</a> of lights in the scene. A better, but much more complex and expensive method would be to build the spherical gaussian light tree</a> I mentioned in my last post. However, we can get even better results by building and caching a local distribution of lights at discrete points in the scene. Disney's Hyperion renderer</a> uses a set of randomly traced candidate paths to pick cache points in the scene, and estimates both the unshadowed light contribution and visibility of a light at each cache point, stored as a small CDF table. Unreal Engine's MegaLights</a> uses a similar idea in screen space. @SparkyPotato has been experimenting with a world-space equivalent</a> of the idea in Solari. Unlike Disney's cache points, we already have a good way to discretize the scene - the spatial hashing we use for the GI world cache. We're already sampling DI at each world cache cell - why not additionally build up a CDF, and use that to improve light sampling for ReSTIR DI and Specular GI NEE? Or, we could go the other way, and splat each per-pixel sample from ReSTIR DI and Specular GI NEE back into the world cache. Or... both? Lots of things to experiment with (and not enough time!) Lastly, linearly-transformed cosines</a> (LTCs) are a promising avenue to explore to improve resampling quality. GI Sampling</a> </h4> Like DI, we can also improve our GI sampling. Recently, realtime pathtracing research has been retracing (pun intended) the steps that offline pathtracing took, and researching path guiding techniques. There are several promising avenues to explore: Markov chain resampling</a></li> ReSTIR PT splatting</a></li> Parallax-aware vMF mixtures</a></li> </ul> All three techniques share the same basic idea: build a local distribution of incident light in world-space, stored as a combination of several vMF lobes. The main differences are how samples get fed into the distribution, and the exact steps used to update the distribution. It's the same exact idea as improving DI sampling - discretize to world space, estimate a local distribution, use for sampling. And again the same questions arise - using a small set of candidate paths traced from the camera and splatting into the cache; build on top of the existing cache update pass; both? The same questions also apply to the world irradiance cache itself. Currently the cache is updated in a dedicated pass at the start of the frame, sampling from a fixed point for each active cache cell. Other caches like NVIDIA's SHARC</a>, NVIDIA's NRC</a> and AMD's FSR Radiance Caching</a> all splat candidate paths traced from the camera. Lots of room for experimentation. Additionally as a final note on GI quality, currently one of Solari's worst form of artifacts is GI light leaks on the edges of objects. While hashing the surface normal helps, on curved surfaces and corners, it's not a perfect solution. And it's actually very easy to identify the cases where this happens. Light leaks tend to occur when the length of the ray querying the cache is less than the size of the cache cell. I tried both sampling last frame's screen-space texture, and hashing ray_t < cell_size</code></a>, but unfortunately neither helped. More experimentation is needed. Specular</a> </h4> Specular DI and GI in Solari 0.18 was a pretty initial implementation, and as I've mentioned throughout the post, there's a lot that could be improved. Specular motion vectors are not implemented, so mirror and glossy indirect reflections can have ghosting. I need to implement either "Rendering Perfect Reflections and Refractions in Path-Traced Games"</a> or "Temporally Reliable Motion Vectors for Real-time Ray Tracing"</a>.</li> </ul> </li> Local light sampling would greatly help NEE quality for specular GI. Currently, NEE is heavily undersampled. DLSS-RR does its best, but you can see some cross-stitch patterns on glossy reflections where the denoiser is struggling.</li> Path guiding for GI would help with tracing glossy paths.</li> Experimenting with ReSTIR for both specular DI and specular GI.</li> Potentially terminating a specular GI path into the world cache sooner based on total roughness/cone-spread of the path.</li> </ul> Results</a> </h2> All results were captured on an RTX 3080 locked to base clocks in NSight, at 1600x900 upscaled to 3200x1800 via DLSS-RR. PICA PICA</a> </h3> PICA PICA - Solari realtime </figcaption> </figure> PICA PICA - Pathraced reference (ignore the black noise - it's a bug) </figcaption> </figure> Bistro</a> </h3> Bistro - Solari realtime (ignore the foilage - Solari doesn't support alpha masks yet) </figcaption> </figure> Bistro - Pathraced reference </figcaption> </figure> Dragons</a> </h3> Dragons - Solari realtime </figcaption> </figure> Dragons - Pathraced reference </figcaption> </figure> Cornell Box</a> </h3> Cornell Box - Solari realtime </figcaption> </figure> Cornell Box - Pathraced reference </figcaption> </figure> Performance</a> </h3> Pass</th> PICA PICA (ms)</th> Bistro (ms)</th> Dragons (ms)</th> Cornell Box (ms)</th></tr></thead> Presample Light Tiles</td> 0.03</td> 0.09</td> 0.02</td> 0.02</td></tr> World Cache: Decay Cells</td> 0.01</td> 0.02</td> 0.02</td> 0.01</td></tr> World Cache: Compaction P1</td> 0.04</td> 0.04</td> 0.04</td> 0.04</td></tr> World Cache: Compaction P2</td> 0.01</td> 0.01</td> 0.01</td> 0.01</td></tr> World Cache: Write Active Cells</td> 0.01</td> 0.02</td> 0.01</td> 0.01</td></tr> World Cache: Sample Lighting</td> 0.03</td> 0.66</td> 0.05</td> 0.03</td></tr> World Cache: Blend New Samples</td> 0.01</td> 0.03</td> 0.01</td> 0.01</td></tr> ReSTIR DI: Initial + Temporal</td> 0.28</td> 1.89</td> 0.39</td> 0.22</td></tr> ReSTIR DI: Spatial + Shade</td> 0.18</td> 1.06</td> 0.23</td> 0.16</td></tr> ReSTIR GI: Initial + Temporal</td> 0.30</td> 2.28</td> 0.80</td> 0.29</td></tr> ReSTIR GI: Spatial + Shade</td> 0.31</td> 1.37</td> 0.56</td> 0.27</td></tr> Specular GI</td> 0.61</td> 0.35</td> 0.31</td> 0.09</td></tr> DLSS-RR: Copy Inputs From GBuffer</td> 0.04</td> 0.08</td> 0.05</td> 0.04</td></tr> DLSS-RR</td> 6.10</td> 6.16</td> 6.08</td> 6.07</td></tr> Total</td> 7.96</td> 14.06</td> 8.58</td> 7.27</td></tr> </tbody></table> Realtime Raytracing in Bevy 0.17 (Solari) 2025-09-20T00:00:00+00:00 Lighting a scene is hard! Anyone who's tried to make a 3D scene look good knows the frustration of placing light probes, tweaking shadow cascades, and trying to figure out why their materials don't look quite right. Over the past few years, real-time raytracing has gone from a research curiosity to a shipping feature in major game engines, promising to solve many of these problems by simulating how light actually behaves. With the release of v0.17, Bevy</a> now joins the club with experimental support for hardware raytracing! </video> PICA PICA scene by SEED</a> </center> Try it out yourself: git clone https://github.com/bevyengine/bevy && cd bevy git checkout release-0.17.0 cargo run --release --example solari --features bevy_solari,https # Optionally setup DLSS support for NVIDIA GPUs following https://github.com/bevyengine/dlss_wgpu?tab=readme-ov-file#downloading-the-dlss-sdk cargo run --release --example solari --features bevy_solari,https,dlss </code></pre> Introduction</a> </h2> Back in 2023, I started</a> a project I called Solari to integrate hardware raytracing into Bevy's rendering pipeline. I was experimenting with Lumen</a>-style screen space probes for global illumination, and later extended it to use radiance cascades</a>. These techniques, while theoretically sound, proved challenging to use in practice. Screen space probes were tricky to get good quality out of (reusing and reprojecting the same probe across multiple pixels is hard!), and radiance cascades brought its own set of artifacts and performance costs. On top of the algorithmic challenges, the ecosystem simply wasn't ready. Wgpu's raytracing support existed only as a work-in-progress PR that never got merged upstream. Maintaining a fork of wgpu (and by extension, Bevy) was time-consuming and unsustainable. After months of dealing with these challenges, I shelved the project. In the 2 years since, I've learned a bunch more, raytracing has been upstreamed into wgpu, and raytracing algorithms have gotten much more developed. I've restarted the project with a new approach (ReSTIR, DLSS-RR), and soon it will be released as an official Bevy plugin! In this post, I'll be doing a frame breakdown of how Solari works in Bevy 0.17, why I made certain choices, some of the challenges I faced, and some of the issues I've yet to solve. Why Raytracing for Bevy?</a> </h2> Before we start, I think it's fair to ask why an "indie" game engine needs high-end raytracing features that requires an expensive graphics card. The answer comes from my own experience learning 3D graphics. Back when I was a teenager experimenting with small 3D games in Godot, I had a really hard time figuring out why my lighting looked so bad. Metallic objects didn't look reflective, scenes felt flat, and everything just looked wrong compared to the games I was playing. I didn't understand that I was missing indirect light, proper reflections, and accurate shadows - I had no idea I was supposed to bake lighting. This is the core problem that raytracing solves for indie developers. Even if not all players have hardware capable of running ray-traced effects, having a reference implementation of what lighting is supposed to look like is incredibly valuable. With fully dynamic global illumination, reflections, shadows, and direct lighting, developers can see how their scenes should be lit. Then they can work backwards to replicate those results with baked lighting, screen-space techniques, and other less performance-intensive approximations. Without that reference, it's really hard to know what you're missing or how to improve your lighting setup. Raytracing provides the ground truth that other techniques are trying to approximate. Additionally, hardware is advancing all the time. Five years ago, raytracing was much less widespread than today. If you start developing a new game today with a 3-4 year lead time, raytracing is probably going to be even more common by the time you're ready to release it. Solari was in large part designed as a foward-looking rendering system. There's also the practical consideration that if Bevy ever wants to attract AAA game developers, we need these kinds of systems. Recent AAA games like DOOM: The Dark Ages</a> and Cyberpunk 2077</a> rely heavily on raytracing, and artists working on these types of projects expect their tools to support similar techniques. And honestly? It's just cool, and something I love working on :) Frame Breakdown</a> </h2> In its initial release, Solari supports raytraced diffuse direct (DI) and indirect lighting (GI). Light can come from either emissive</a> triangle meshes, or analytic directional lights</a>. Everything is fully realtime and dynamic, with no baking required. Direct lighting is handled via ReSTIR DI, while indirect lighting is handled by a combination of ReSTIR GI and a world-space irradiance cache. Denoising is handled by DLSS Ray Reconstruction. As opposed to coarse screen-space probes, per-pixel ReSTIR brings much better detail, along with being considerably easier to get started with. I had my first prototype working in a weekend. While I won't be covering ReSTIR from first principles (that could be its own entire blog post), A Gentle Introduction to ReSTIR: Path Reuse in Real-time</a> and A gentler introduction to ReSTIR</a> are both really great resources. If you haven't played with ReSTIR before, I suggest giving them a skim before continuing with this post. Or continue anyways, and just admire the pretty pixels :) Onto the frame breakdown! GBuffer Raster</a> </h3> The first step of Solari is also the most boring: rasterize a standard GBuffer. Base color </figcaption> </figure> Normals </figcaption> </figure> Position reconstructed from depth buffer </figcaption> </figure> Why Raster?</a> </h4> The GBuffer pass remains completely unchanged from standard Bevy (it's the same plugin). This might seem like a missed opportunity - after all, I could have used raytracing for primary visibility instead of rasterization - but I decided to stick with rasterization here. By using raster for primary visibility, I maintain the option for people to use low-resolution proxy meshes in the raytracing scene, while still getting high quality meshes and textures in the primary view. The raster meshes can be full resolution with all their geometric detail, while the raytracing acceleration structure contains simplified versions that are cheaper to trace against. Rasterization also works better with other Bevy features like Virtual Geometry</a>. Attachments</a> </h4> Bevy's GBuffer uses quite a bit of packing. The main attachment is a Rgba32Uint</code> texture with each channel storing multiple values: First channel: sRGB base color and perceptual roughness packed as 4x8unorm</li> Second channel: Emissive color stored as pre-exposed Rgb9e5</li> Third channel: Reflectance, metallic, baked diffuse occlusion (unused by Solari), and an unused slot, again packed as 4x8unorm</li> Fourth channel: World-space normal encoded into 24 bits via octahedral encoding</a>, plus 8 bits of flags meant for Bevy's default deferred shading (unused by Solari)</li> </ul> There's also a second Rg16Float</code> attachment for motion vectors, and of course the depth attachment. Drawing</a> </h4> The GBuffer rendering itself uses multi_draw_indirect</code> to draw several meshes at once, using sub-allocated</a> buffers. Culling is done on the GPU using two-pass occlusion culling</a> against a hierarchal depth buffer. Textures are handled bindlessly, and we try to minimize overall pipeline permutations. These combined techniques keep draw call overhead and per-pixel overdraw fairly low, even for complex scenes. ReSTIR DI</a> </h3> In order to calculate direct lighting (light emitted by a light source, bouncing off a surface, and then hitting the camera), for each pixel, we need to loop over every light and point on those lights, and then calculate the light's contribution, as well as whether or not the light is visible. This is very expensive, so realtime applications tend to approximate it by averaging many individual light samples. If you choose those samples well, you can get an approximate result that's very close to the real thing, without tons of expensive calculations. To quickly estimate direct lighting, Solari uses a pretty standard ReSTIR DI setup. ReSTIR DI randomly selects points on lights, and then shares the random samples between pixels based in order to choose the best light (most contribution to the image) for a given pixel. DI Structure</a> </h4> Reservoirs store the light sample, confidence weight, and unbiased contribution weight (acting as the sample's PDF). struct Reservoir { sample: LightSample, confidence_weight: f32, unbiased_contribution_weight: f32, } </code></pre> Direct lighting is handled in two compute dispatches. The first pass does initial and temporal resampling, while the second pass does spatial resampling and shading. DI Initial Resampling</a> </h4> Initial sampling uses 32 samples from a light tile (more on this later), and chooses the brightest one via resampling importance sampling (RIS), using constant MIS weights. 32 samples per pixel is often overkill for scenes with a small number of lights. As this is one of the most expensive parts of Solari, I'm planning on letting users control this number in a future release. After choosing the best sample, we trace a ray to test visibility, setting the unbiased contribution weight to 0 in the case of occlusion. All raytracing in Solari is handled via inline ray queries. Wgpu does not yet support raytracing pipelines, so I haven't gotten a chance to play around with them. </div> </blockquote> One candidate sample DI </figcaption> </figure> 32 candidate sample DI, one sample chosen via RIS </figcaption> </figure> DI Temporal Resampling</a> </h4> A temporal reservoir is then obtained via motion vectors and last frame's pixel data. We validate the reprojection using the pixel_dissimilar</code> heuristic. We also need to check that the temporal light sample still exists in the current frame (i.e. the light has not been despawned). Additionally, the chosen light from last frame might no longer be visible this frame, e.g. if an object moved behind a wall. We could trace an additional ray here to validate visibility, but it's cheaper to just assume that the temporal light sample is still visible from the current pixel this frame. Reusing temporal visibility saves one raytrace, at the cost of shadows for moving objects being delayed by 1 frame, and some slighty darker/wider shadows. Overall the artifacts are not very noticable, so I find that it's well worth reusing visibility for the temporal reservoir resampling. The initial and temporal reservoir are then merged together using constant MIS weights. I tried using the balance heuristic, but didn't notice much difference for DI, and constant MIS weights are much cheaper. // Reject if tangent plane difference difference more than 0.3% or angle between normals more than 25 degrees fn pixel_dissimilar(depth: f32, world_position: vec3<f32>, other_world_position: vec3<f32>, normal: vec3<f32>, other_normal: vec3<f32>) -> bool { // https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s22699-fast-denoising-with-self-stabilizing-recurrent-blurs.pdf#page=45 let tangent_plane_distance = abs(dot(normal, other_world_position - world_position)); let view_z = -depth_ndc_to_view_z(depth); return tangent_plane_distance / view_z > 0.003 || dot(normal, other_normal) < 0.906; } </code></pre> DI Spatial Resampling</a> </h4> The second pass handles spatial resampling. We choose one random pixel within a 30 pixel-radius disk, and borrow its reservoir. We use the same pixel_dissimilar</code> heuristic as the temporal pass to validate the spatial reservoir. We must also trace a ray to test visibility, as the reservoir comes from a neighboring pixel, and we cannot assume that the same light sample visible at the neighbor pixel is also visible for the current pixel. Unlike a lot of other ReSTIR implementations, we only ever use 1 spatial sample. Using more than 1 sample does not tend to improve quality, and increases performance costs. We cannot, however, skip spatial resampling entirely. Having a source of new samples from other pixels is crucial to prevent artifacts from temporal resampling. 1 random spatial sample, 6.4 ms </figcaption> </figure> Spatial sampling is probably the least well-researched part of ReSTIR. I tried a couple of other schemes, including trying to reuse reservoirs across a workgroup/subgroup similar to Histogram Stratification for Spatio-Temporal Reservoir Sampling</a>, but none of them worked out well. Subgroups-level resampling was very cheap, but had tiling artifacts, and was not easily portable to different machines with different amounts of threads per workgroup. Subgroup-level spatial resampling, 7.3 ms </figcaption> </figure> Workgroup-level resampling had much better quality, but was twice as expensive compared to 1 spatial sample, and introduced correlations that broke the denoiser. Workgroup-level spatial resampling, 12 ms </figcaption> </figure> In the end, I stuck with the 1 random spatial sample I described above. The reservoir produced by the first pass and the spatial reservoir are combined with the same routine that we used for merging initial and temporal reservoirs. DI Shading</a> </h4> Once the final reservoir is produced, we can use its chosen light sample to shade the pixel, producing the final direct lighting. I did try out shading the pixel using all 3 samples (initial, temporal, and spatial), weighed by their resampling probabilities as Rearchitecting Spatiotemporal Resampling for Production</a> suggests, but had noisier results compared to shading using the final reservoir only. I'm not sure if I messed up the implementation or what. Overall the DI pass uses two raytraces per pixel (1 initial, 1 spatial). DI with 32 initial candidates, 1 temporal resample, and 1 spatial resample </figcaption> </figure> ReSTIR GI</a> </h3> Indirect lighting (light emitted by a light source, bouncing off more than 1 surface, and then hitting the camera) is even more expensive to calculate than direct lighting, as you need to trace multiple bounces of each ray to calculate the lighting for a given path. To quickly estimate indirect lighting, Solari uses ReSTIR GI, with a very similar setup to the previous ReSTIR DI. Where as ReStir DI picks the best light, ReSTIR GI randomly selects directions in a hemisphere, and then shares the random samples between pixels in order to choose the best 1-bounce path for a given pixel. GI Structure</a> </h4> Reservoirs store the cached radiance bouncing off of the sample point, sample point geometry info, confidence weight, and unbiased contribution weight. struct Reservoir { radiance: vec3<f32>, sample_point_world_position: vec3<f32>, sample_point_world_normal: vec3<f32>, confidence_weight: f32, unbiased_contribution_weight: f32, } </code></pre> I tried some basic packing schemes for the GI reservoir (Rgb9e5 radiance, octahedral-encoded normals), but didn't find that it meaningfully reduced GI costs. Reservoir memory bandwidth is not a big bottleneck compared to raytracing and reading mesh/texture data for ray intersections. I have heard that people had good results storing reservoirs as struct-of-arrays instead of array-of-structs, so I'll likely revist this topic at some point. ReSTIR GI again uses two compute dispatches, with the first pass doing initial and temporal resampling, and the second pass doing spatial resampling and shading. GI Initial Sampling</a> </h4> GI samples are much more expensive to generate than DI samples (tracing paths is more expensive than looping over a list of light sources), so for initial sampling, we only generate 1 sample. We start by tracing a ray along a random direction chosen from a uniform hemisphere distribution. At some point I also want to try using spatiotemporal blue noise</a>. Although DLSS-RR recommends white noise, the docs do say that blue noise with a sufficiently long period can also work. At the ray's hit point, we need to obtain an estimate of the incoming irradiance, which becomes the outgoing radiance towards the current pixel, i.e. the path's contribution. One sample GI </figcaption> </figure> To obtain irradiance, we query the world cache at the hit point (more on this later). fn generate_initial_reservoir(world_position: vec3<f32>, world_normal: vec3<f32>, rng: ptr<function, u32>) -> Reservoir { var reservoir = empty_reservoir(); let ray_direction = sample_uniform_hemisphere(world_normal, rng); let ray_hit = trace_ray(world_position, ray_direction, RAY_T_MIN, RAY_T_MAX, RAY_FLAG_NONE); if ray_hit.kind == RAY_QUERY_INTERSECTION_NONE { return reservoir; } let sample_point = resolve_ray_hit_full(ray_hit); // Direct lighting is handled by ReSTIR DI if all(sample_point.material.emissive != vec3(0.0)) { return reservoir; } reservoir.unbiased_contribution_weight = uniform_hemisphere_inverse_pdf(); reservoir.sample_point_world_position = sample_point.world_position; reservoir.sample_point_world_normal = sample_point.world_normal; reservoir.confidence_weight = 1.0; reservoir.radiance = query_world_cache(sample_point.world_position, sample_point.geometric_world_normal, view.world_position); let sample_point_diffuse_brdf = sample_point.material.base_color / PI; reservoir.radiance = sample_point_diffuse_brdf; return reservoir; } </code></pre> GI Temporal and Spatial Resampling</a> </h4> Temporal reservoir selection for GI is a little different from DI. In addition to reprojecting based on motion vectors, we jitter the reprojected location by a few pixels in either direction using permutation sampling</a>. This essentially adds a small spatial component to the temporal resampling, which helps break up temporal correlations. No permutation sampling: The denoiser (DLSS-RR) produces blotchy noise </figcaption> </figure> I also tried permutation sampling for ReSTIR DI, and while it did reduce correlation artifacts, it also added even worse artifacts because we reuse visibility, which becomes very obvious under permutation sampling. Tracing an extra ray to validate visibility would fix this, but I'm not quite ready to pay that performance cost. DI: Permutation sampling and visibility reuse do not work well together </figcaption> </figure> Spatial reservoir selection for GI is identical to DI. Reservoir merging for GI uses the balance heuristic for MIS weights, and includes the BRDF contribution, as I found that unlike for DI, these make a significant quality difference. The balance heuristic is not much more expensive here, as we are only ever merging two reservoirs at a time. GI Jacobian</a> </h4> Additionally, since both temporal and spatial resampling use neighboring pixels, we need to add a Jacobian determinant to the MIS weights to account for the change in sampling domain. The Jacobian proved to be the absolute hardest part of ReSTIR GI for me. While it makes the GI more correct, it also adds a lot of noise in corners. Worse, the Jacobian tends to make the GI calculations "explode" into super high numbers that result in overflow to inf</code>, which then spreads over the entire screen due to resampling and denoising. The best solution I have found to reduce artifacts from the Jacobian is to reject neighbor samples when the Jacobian is greater than 2 (i.e., a neighboring sample reused at the current pixel would have more than 2x the contribution it originally did). While this somewhat works, there are still issues with stability. If I leave Solari running for a couple of minutes in the same spot, it will eventually lead to overflow. I haven't yet figured out how to prevent this. Using the balance heuristic (and factoring in the two Jacobians) for MIS weights when resampling also helped a lot with fighting the noise introduced by the Jacobian. GI Shading</a> </h4> Once the final reservoir is produced, we can use it to shade the pixel, producing the final indirect lighting. Since we're using DLSS-RR for denoising, we can simply add the GI on top of the existing framebuffer (holding the DI). There's no need to write to a separate buffer for use with a separate denoising process, unlike a lot of other GI implementations. Overall the GI pass uses two raytraces per pixel (1 initial, 1 spatial), same as DI. GI with 1 initial candidate, 1 temporal resample, and 1 spatial resample </figcaption> </figure> Interlude: What is ReSTIR Doing?</a> </h3> I have heard ReSTIR described as a signal amplifier. If you feed it decent samples, it's likely to produce a good sample. If you feed it good samples, it's likely to produce a great sample. The better your initial sampling, the better ReSTIR does. The quality of your final result heavily depends on the quality of the initial samples you feed into it. For this reason, it's important that you spend time improving the initial sampling process. This could take the form of generating more initial samples, or improving your sampling strategy. For ReSTIR DI, taking more initial samples is viable, as samples are just random lights, and are fairly cheap to generate. For ReSTIR GI, even 1 initial sample is already expensive, as each sample involves tracing a ray. Instead of increasing initial sample count, we'll have to be smart about how we obtain that 1 sample. In the next two sections of the frame breakdown, we will discuss how I improved initial sampling for ReSTIR DI and GI. Light Tile Presampling</a> </h3> While generating initial samples for ReSTIR DI is fairly cheap, when we start taking 32 or more samples per pixel, the memory bandwidth costs quickly add up. In order to make 32 samples per pixel viable, we'll need a way to improve our cache coherency. In this section, we will generate some light tile buffers, following section 5 of Rearchitecting Spatiotemporal Resampling for Production</a>. Light Sampling APIs</a> </h4> Before I can explain light tiles, we first need to talk about Solari's shader API for working with light sources. Bevy stores light sources as a big list of objects on the GPU. All emissive meshes and directional lights get collected by the CPU, and put in this list. When calculating radiance emitted by a light source, Bevy works with specific light samples - not the whole light at once. A LightSample</code> uniquely identifies a specific subset of the light source, e.g. a specific point on an emissive mesh. struct LightSample { light_id: u16, triangle_id: u16, // Unused for directional lights seed: u32, } fn generate_random_light_sample(rng: ptr<function, u32>) -> LightSample { let light_count = arrayLength(&light_sources); let light_id = rand_range_u(light_count, rng); let light_source = light_sources[light_id]; var triangle_id = 0u; if light_source.kind != LIGHT_SOURCE_KIND_DIRECTIONAL { let triangle_count = light_source.kind >> 1u; triangle_id = rand_range_u(triangle_count, rng); } let seed = rand_u(rng); return LightSample(light_id, triangle_id, seed); } </code></pre> The light ID points to the overall light source object in the big list of lights. The seed is used to initialize a random number generator (RNG). For directional lights, the RNG is used to choose a direction within a cone. For emissive meshes, the RNG is used to choose a specific point on the triangle identified by the triangle ID. A LightSample</code> can be resolved, giving some info on its properties: struct ResolvedLightSample { world_position: vec4<f32>, // w component is 0.0 for directional lights, and 1.0 for emissive meshes world_normal: vec3<f32>, emitted_radiance: vec3<f32>, inverse_pdf: f32, } fn resolve_light_sample(light_sample: LightSample, light_source: LightSource) -> ResolvedLightSample { if light_source.kind == LIGHT_SOURCE_KIND_DIRECTIONAL { let directional_light = directional_lights[light_source.id]; let direction_to_light = sample_cone(directional_light); return ResolvedLightSample( vec4(direction_to_light, 0.0), -direction_to_light, directional_light.luminance, directional_light.inverse_pdf, ); } else { let triangle_count = light_source.kind >> 1u; let triangle_id = light_sample.light_id & 0xFFFFu; let barycentrics = triangle_barycentrics(light_sample.seed); // Interpolates and transforms vertex positions, UVs, etc, and samples material textures let triangle_data = resolve_triangle_data_full(light_source.id, triangle_id, barycentrics); return ResolvedLightSample( vec4(triangle_data.world_position, 1.0), triangle_data.world_normal, triangle_data.material.emissive.rgb, f32(triangle_count) triangle_data.triangle_area, ); } } </code></pre> And finally a ResolvedLightSample</code> can be used to calculate the received radiance at a point from the light sample, also known as the unshadowed light contribution: struct LightContribution { received_radiance: vec3<f32>, inverse_pdf: f32, wi: vec3<f32>, } fn calculate_resolved_light_contribution(resolved_light_sample: ResolvedLightSample, ray_origin: vec3<f32>, origin_world_normal: vec3<f32>) -> LightContribution { let ray = resolved_light_sample.world_position.xyz - (resolved_light_sample.world_position.w ray_origin); let light_distance = length(ray); let wi = ray / light_distance; let cos_theta_origin = saturate(dot(wi, origin_world_normal)); let cos_theta_light = saturate(dot(-wi, resolved_light_sample.world_normal)); let light_distance_squared = light_distance light_distance; let received_radiance = resolved_light_sample.emitted_radiance cos_theta_origin * (cos_theta_light / light_distance_squared); return LightContribution(received_radiance, resolved_light_sample.inverse_pdf, wi); } </code></pre> Notably, only the first and second steps (generating a LightSample</code>, resolving it into a ResolvedLightSample</code>) involve branching based on the type of light (directional or emissive). Calculating the light contribution involves no branching. Presampling Lights</a> </h4> The straightforward way to implement ReSTIR DI initial sampling is to perform the whole light sampling process (generate -> resolve -> calculate contribution) all in one shader. Indeed, for my first ReSTIR DI prototype, this is what I did - but performance was terrible. By generating the light sample, resolving it, and then calculating its contribution all in the same shader, we're introducing a lot of divergent branches and incoherent memory accesses. If there's one thing GPUs hate, it's divergence. GPUs perform better when all threads in a group are executing the same branch and don't need masking, and when the threads are all accessing similar memory locations that are likely in a nearby cache. Instead, we can separate out the steps. Generating a bunch of random light samples and resolving them can be performed ahead of time, by a separate shader. We can then pack the resolved samples and store them in a buffer. fn pack_resolved_light_sample(sample: ResolvedLightSample) -> ResolvedLightSamplePacked { return ResolvedLightSamplePacked( sample.world_position.x, sample.world_position.y, sample.world_position.z, pack2x16unorm(octahedral_encode(sample.world_normal)), vec3_to_rgb9e5_(sample.radiance view.exposure), sample.inverse_pdf select(1.0, -1.0, sample.world_position.w == 0.0), ); } fn unpack_resolved_light_sample(packed: ResolvedLightSamplePacked, exposure: f32) -> ResolvedLightSample { return ResolvedLightSample( vec4(packed.world_position_x, packed.world_position_y, packed.world_position_z, select(1.0, 0.0, packed.inverse_pdf < 0.0)), octahedral_decode(unpack2x16unorm(packed.world_normal)), rgb9e5_to_vec3_(packed.radiance) / exposure, abs(packed.inverse_pdf), ); } </code></pre> We call these presampled sets of lights "light tiles". Following the paper, we perform a compute dispatch to generate a fixed 128 tiles (these are not screen-space tiles), each with 1024 samples (ResolvedLightSamplePacked</code>). Samples are generated completely randomly, without any info about the scene - there is no spatial heuristic or any way of identifying "good" samples. ReSTIR DI initial sampling can now pick a random tile, and then random samples within the tile, and use calculate_resolved_light_contribution()</code> to calculate their radiance. With light tiles, we have much higher cache hit rates when sampling lights, which greatly improves our performance. In fact, even more than the actual raytracing - light sampling is by far the biggest performance bottleneck in Solari. World Cache</a> </h3> With light tiles accelerating initial sampling for ReSTIR DI, it's time to talk about how we accelerate initial sampling for ReSTIR GI. Unlike DI, where generating more samples is relatively cheap, for GI we can only afford 1 sample. However, unlike DI, GI is a lot more forgiving of inaccuracies. GI just has to be "mostly correct". We can take advantage of that fact by sharing the same work amongst multiple pixels, via the use of a world-space irradiance cache. The world cache voxelizes the world, storing accumulated irradiance (light hitting the surface) at each voxel. When sampling indirect lighting in ReSTIR GI, rather than having to trace additional rays towards light sources to estimate the irradiance, we can simply lookup the irradiance at the given voxel. The world cache both amortizes the cost of the GI pass, and reduces variance, especially for newly-disoccluded pixels for which the screen-space ReSTIR GI has no temporal history. Adding the world cache both significantly improved quality, and halved the time spent on the initial GI sampling. Cache Querying</a> </h4> The world cache uses spatial hashing</a> to discretize the world. Unlike other options such as clipmaps</a>, cards</a>, or bricks</a>, spatial hashing requires no explicit build step, and automatically adapts to scene geometry while having minimal light leaks. With spatial hashing, a given descriptor (e.g. {position, normal}</code>) hashes to a u32</code> key. This key corresponds to an index within a fixed-size buffer, which holds whatever values you want to store in the hashmap - in our case, irradiance. Either the entry that you're querying corresponds to some existing entry (same checksum), and you can return the value, or the entry does not exist (empty checksum), and you can initialize the entry by writing the checksum to it. The checksum is the same descriptor, hashed to a different key via a different hash function, and is used to detect hash collisions. The query_world_cache()</code> function below is what ReSTIR GI uses to lookup irradiance at the hit point for raytraces. fn query_world_cache(world_position: vec3<f32>, world_normal: vec3<f32>, view_position: vec3<f32>) -> vec3<f32> { let cell_size = get_cell_size(world_position, view_position); let world_position_quantized = bitcast<vec3<u32>>(quantize_position(world_position, cell_size)); let world_normal_quantized = bitcast<vec3<u32>>(quantize_normal(world_normal)); var key = compute_key(world_position_quantized, world_normal_quantized); let checksum = compute_checksum(world_position_quantized, world_normal_quantized); for (var i = 0u; i < WORLD_CACHE_MAX_SEARCH_STEPS; i++) { let existing_checksum = atomicCompareExchangeWeak(&world_cache_checksums[key], WORLD_CACHE_EMPTY_CELL, checksum).old_value; if existing_checksum == checksum { // Cache entry already exists - get irradiance and reset cell lifetime atomicStore(&world_cache_life[key], WORLD_CACHE_CELL_LIFETIME); return world_cache_irradiance[key].rgb; } else if existing_checksum == WORLD_CACHE_EMPTY_CELL { // Cell is empty - reset cell lifetime so that it starts getting updated next frame atomicStore(&world_cache_life[key], WORLD_CACHE_CELL_LIFETIME); world_cache_geometry_data[key].world_position = world_position; world_cache_geometry_data[key].world_normal = world_normal; return vec3(0.0); } else { // Collision - jump to another entry key = wrap_key(pcg_hash(key)); } } return vec3(0.0); } </code></pre> In Solari, the descriptor is a combination of the world_position</code> of the query point, the geometric_world_normal</code> (shading normal is too detailed) of the query point, and a LOD factor that's used to reduce cell count for far-away query points. fn quantize_position(world_position: vec3<f32>, quantization_factor: f32) -> vec3<f32> { return floor(world_position / quantization_factor + 0.0001); } fn quantize_normal(world_normal: vec3<f32>) -> vec3<f32> { return floor(world_normal + 0.0001); } fn compute_key(world_position: vec3<u32>, world_normal: vec3<u32>) -> u32 { var key = pcg_hash(world_position.x); key = pcg_hash(key + world_position.y); key = pcg_hash(key + world_position.z); key = pcg_hash(key + world_normal.x); key = pcg_hash(key + world_normal.y); key = pcg_hash(key + world_normal.z); return wrap_key(key); } fn compute_checksum(world_position: vec3<u32>, world_normal: vec3<u32>) -> u32 { var key = iqint_hash(world_position.x); key = iqint_hash(key + world_position.y); key = iqint_hash(key + world_position.z); key = iqint_hash(key + world_normal.x); key = iqint_hash(key + world_normal.y); key = iqint_hash(key + world_normal.z); return key; } </code></pre> World cache from further away, showing LOD </figcaption> </figure> Cache Decay</a> </h4> In order to maintain the world cache, we need a series of passes to decay and update active entries. The first compute dispatch checks every entry in the hashmap, decaying their "life" count by 1. Each entry's life is initialized when the entry is created, and is reset when queried. When an entry reaches 0 life, we clear out the entry, freeing up a space for future voxels to use. @compute @workgroup_size(1024, 1, 1) fn decay_world_cache(@builtin(global_invocation_id) global_id: vec3<u32>) { var life = world_cache_life[global_id.x]; if life > 0u { // Decay and write new life life -= 1u; world_cache_life[global_id.x] = life; // Clear cells that become dead if life == 0u { world_cache_checksums[global_id.x] = WORLD_CACHE_EMPTY_CELL; world_cache_irradiance[global_id.x] = vec4(0.0); } } } </code></pre> Cache Compact</a> </h4> The next three dispatches compact and count the total number of active entries in the world cache. This produces a dense array of indices of active entries, as well as indirect dispatch parameters for the next two passes. The code is just a standard parallel prefix-sum, so I'm going to skip showing it. Cache Update</a> </h4> Now that we know the list of active entries in the world cache (and can perform indirect dispatches to process each active entry), it's time to update the irradiance estimate for each voxel. The first part of the update process is taking new samples of the scene's lighting. Two rays are traced per voxel: a direct light sample, and an indirect light sample. @compute @workgroup_size(1024, 1, 1) fn sample_irradiance(@builtin(workgroup_id) workgroup_id: vec3<u32>, @builtin(global_invocation_id) active_cell_id: vec3<u32>) { if active_cell_id.x < world_cache_active_cells_count { // Get voxel data let cell_index = world_cache_active_cell_indices[active_cell_id.x]; let geometry_data = world_cache_geometry_data[cell_index]; var rng = cell_index + constants.frame_index; // Sample direct lighting via RIS (1st ray) var new_irradiance = sample_random_light_ris(geometry_data.world_position, geometry_data.world_normal, workgroup_id.xy, &rng); // Sample indirect lighting via BRDF sampling + world cache querying (2nd ray) let ray_direction = sample_cosine_hemisphere(geometry_data.world_normal, &rng); let ray_hit = trace_ray(geometry_data.world_position, ray_direction, RAY_T_MIN, RAY_T_MAX, RAY_FLAG_NONE); if ray_hit.kind != RAY_QUERY_INTERSECTION_NONE { let ray_hit = resolve_ray_hit_full(ray_hit); new_irradiance += ray_hit.material.base_color * query_world_cache(ray_hit.world_position, ray_hit.geometric_world_normal, view.world_position); } world_cache_active_cells_new_irradiance[active_cell_id.x] = new_irradiance; } } </code></pre> The direct light sample is chosen via RIS, and uses the same presampled light tiles that we're going to use for ReSTIR DI. It's basically the same process as ReSTIR DI initial candidate sampling. I've thought about using ReSTIR (well, ReTIR, without the spatial resampling part) for the world cache, but it's not something I've tried yet. The indirect light sample is a little more interesting. In order to estimate indirect lighting, we trace a ray using a cosine-hemisphere distribution. At the ray hit point, we query the world cache. You might be thinking "Wait, aren't we updating the cache? But we're also sampling from the same cache in order to... update it?" By having the cache sample from itself, we form a full path tracer, where tracing the path is spread out across multiple frames (for performance). As an example: In frame 5, world cache cell A samples a light source. In frame 6, a different world cache cell B samples cell A. In frame 7, yet another world cache cell C samples cell B. We've now formed a multi-bounce path light source->A->B->C</code>, and once ReSTIR GI gets involved, light source->A->B->C->primary surface->camera</code>. By having the cache sample itself, we get full-length multi-bounce paths, instead of just single-bounce paths. In indoor scenes that make heavy use of indirect lighting, the difference is pretty dramatic. Single-bounce lighting </figcaption> </figure> Multi-bounce lighting </figcaption> </figure> Cache Blend</a> </h4> The second and final step of the world cache update process is to blend the new light samples with the existing irradiance samples, giving us an estimate of the overall irradiance via temporal accumulation. If you've ever seen code for temporal antialiasing, this should look pretty familiar. The blending factor is based on the total sample count of voxel, capped at a max value. New voxels without any existing irradiance estimate use more of the new sample's contribution, while existing voxels with existing irradiance estimates use less of the new sample. Choosing the max sample count is a tradeoff between having the cache be stable and low-variance, and having the cache be responsive to changes in the scene's lighting. It's also important to note that this is a separate compute dispatch from the previous dispatch we used for sampling lighting. If the passes were combined, we would have data races from voxels writing new irradiance estimates at the same time other voxels were querying them. @compute @workgroup_size(1024, 1, 1) fn blend_new_samples(@builtin(global_invocation_id) active_cell_id: vec3<u32>) { if active_cell_id.x < world_cache_active_cells_count { let cell_index = world_cache_active_cell_indices[active_cell_id.x]; let old_irradiance = world_cache_irradiance[cell_index]; let new_irradiance = world_cache_active_cells_new_irradiance[active_cell_id.x]; let sample_count = min(old_irradiance.a + 1.0, WORLD_CACHE_MAX_TEMPORAL_SAMPLES); let blended_irradiance = mix(old_irradiance.rgb, new_irradiance, 1.0 / sample_count); world_cache_irradiance[cell_index] = vec4(blended_irradiance, sample_count); } } </code></pre> DLSS Ray Reconstruction</a> </h3> Once we have our noisy estimate of the scene, we run it through DLSS-RR to upscale, antialias, and denoise it. Noisy and aliased image </figcaption> </figure> Denoised and antialiased image </figcaption> </figure> Pathtraced reference </figcaption> </figure> While ideally we would be able to configure DLSS-RR to read directly from our GBuffer, we unfortunately need a small pass to first copy from the GBuffer to some standalone textures. DLSS-RR will read these textures as inputs to help guide the denoising pass. DLSS-RR is called via the dlss_wgpu</a> wrapper I wrote, which is integrated into bevy_anti_alias as a Bevy plugin. The dlss_wgpu crate is standalone, and can also be used by non-Bevy projects that are using wgpu! </div> </blockquote> Denoised and antialiased image - DI only </figcaption> </figure> Denoised and antialiased image - GI only </figcaption> </figure> Performance</a> </h2> Numbers</a> </h3> Timings for all scenes were measured on an RTX 3080, rendered at 1600x900, and upscaled to 3200x1800 using DLSS-RR performance mode. PICA PICA </figcaption> </figure> Bistro </figcaption> </figure> Cornell Box </figcaption> </figure> Pass</th> PICA PICA Duration (ms)</th> Bistro Duration (ms)</th> Cornell Box Duration (ms)</th> Dependent On</th></tr></thead> Presample Light Tiles</td> 0.03</td> 0.08</td> 0.02</td> Negligible</td></tr> World Cache: Decay Cells</td> 0.02</td> 0.02</td> 0.01</td> Negligible</td></tr> World Cache: Compaction P1</td> 0.04</td> 0.04</td> 0.04</td> Negligible</td></tr> World Cache: Compaction P2</td> 0.01</td> 0.01</td> 0.01</td> Negligible</td></tr> World Cache: Write Active Cells</td> 0.01</td> 0.02</td> 0.01</td> Negligible</td></tr> World Cache: Sample Lighting</td> 0.06</td> 2.09</td> 0.05</td> World size</td></tr> World Cache: Blend New Samples</td> 0.01</td> 0.07</td> 0.01</td> Negligible</td></tr> ReSTIR DI: Initial + Temporal</td> 1.25</td> 1.85</td> 1.28</td> Pixel count</td></tr> ReSTIR DI: Spatial + Shade</td> 0.19</td> 0.66</td> 0.18</td> Pixel count</td></tr> ReSTIR GI: Initial + Temporal</td> 0.37</td> 2.75</td> 0.33</td> Pixel count</td></tr> ReSTIR GI: Spatial + Shade</td> 0.44</td> 0.60</td> 0.46</td> Pixel count</td></tr> DLSS-RR: Copy Inputs From GBuffer</td> 0.04</td> 0.07</td> 0.04</td> Pixel count</td></tr> DLSS-RR</td> 5.75</td> 6.29</td> 5.82</td> Pixel count</td></tr> Total</td> 8.22</td> 14.55</td> 8.25</td> N/A</td></tr> </tbody></table> Upscaling Benefits</a> </h3> While DLSS-RR is quite expensive, it still ends up saving performance overall. Without upscaling, we would have 4x as many pixels total, meaning ReSTIR DI and GI would be ~4x as expensive. After that, we would need a separate denoising process (usually two separate processes, one for direct and one for indirect), a separate shading pass to apply the denoised lighting, and then an antialiasing method. Total performance costs would be higher than using the unified upscaling + denoising + antialiasing pipeline that DLSS-RR provides. DLSS-RR also performs much better on the newer Ada and Blackwell GPUs. NSight Trace</a> </h3> Looking at a GPU trace, our main ReSTIR DI/GI passes are primarily memory bound. The ReSTIR DI initial and temporal pass is mainly limited by loads from global memory (blue bar), which source-code level profiling reveals to come from loading ResolvedLightSamplePacked</code> samples from light tiles during initial sampling. The ReSTIR DI spatial and shade pass, and both ReSTIR GI passes, are limited by raytracing throughput (yellow bar). NSight Graphics GPU Trace </figcaption> </figure> There are typically three ways to improve memory-bound shaders: Loading less data</li> Improving cache hit rate</li> Hiding the latency</li> </ol> For ReSTIR DI initial sampling, this would correspond to: Taking less than 32 initial samples (viable, depending on the scene)</li> Can't do this - we're already hitting 95% L2 cache throughput</li> Would need to increase occupancy</a></li> </ol> Unfortunately, the only real optimization I think we could do is hiding the latency by improving the occupancy. More threads for the GPU to swap between when while waiting for memory loads to finish = finishing the overall workload faster. NSight shows that we have a mediocre 32 out of a hardware maximum of 48 warps occupied, limited by the "registers per thread limiter". I.e. our shader code uses too many registers per thread, and NSight does not have enough register space to allocate additional warps. Source-code level profiling shows that the majority of live registers are consumed by the triangle resolve function</a>, which maps a point on a mesh to surface data like position, normal, material properties, etc. I'm not really sure how to reduce register usage here. For the other 3 passes limited by raytracing throughput, we have the same issue. Not a ton we can do besides hiding the latency, which runs into the same issue with register count and occupancy. For GI specifically though, there is a way I have thought of to do less work, again at the cost of worse quality depending on the scene. For the world cache, rather than trace rays for every active cell, we could do it for a random subset of cells each frame (up to some maximum), to help limit the cost of updating many cache entries. For the ReSTIR GI passes, we could perform them at quarter resolution (half the pixels along each axis). GI is not particuarly important to have exactly per-pixel data, so we can calculate it at a lower resolution, and then upscale</a> (timestamp 17:22). This upscaling would be in addition to the the DLSS-RR upscaling. Future Work</a> </h2> As always, the first release of a new plugin is just the start. I still have a ton of ideas for future improvements to Solari! Feature Parity</a> </h3> In terms of feature parity with Bevy's standard renderer, the most important missing feature is support for specular, transparent, and alpha-masked materials. I've been actively prototyping specular material support, and with any luck will be writing about it in a future blog post on Solari changes in Bevy v0.18. Custom material support is another big one, although it's blocked on raytracing pipeline support in wgpu (which would also unlock shader execution reordering!). Support for skinned meshes first needs some work done in Bevy to add GPU-driven skinning, but would be a great feature to add. Finally, Solari is eventually going to want to support more types of lights such as point lights, spot lights, and image-based lighting. Light Sampling</a> </h3> Light sampling in Solari is currently purely random (not even uniformly random!), and there's big opportunities to improve it. Having a large number of lights in the scenes is theoretically viable with ReSTIR, but in practice Solari is not yet there. We need some sort of spatial/visibility-aware sampling to improve the quality of our initial candidate samples. One approach another Bevy developer is exploring is using spherical gaussian light trees</a>. Another promising direction is copying from the recently released MegaLights</a> presentation, and adding visible light lists. I want to experiment with implementing light lists in world space, so that it can also be used to improve our GI. Chromatic ReSTIR</a> </h3> Another problem is that overlapping lights of similar brightness, but different chromas (R,G,B) tend to pose a problem for ReSTIR. ReSTIR can only select a single sample, but in this case, there are multiple overlapping lights. One approach I've been prototyping to solve this is using ratio control variates</a> (RCV). The basic idea (if I understand the paper correctly) is that you apply a vector-valued (R,G,B) weight to your lighting integral, based on the fraction of light a given sample contributes, divided by the overall light in the scene. E.g. if you sample a pure red light, but the scene has a large amount of blue and green light, then you downweight the sample's red contribution, and upweight its blue and green contributions. The paper gives a scheme involving precomputing (offline) the total amount of light in the scene ahead of time, using light trees. We could easily add RCV support if we go ahead with adding light trees to Solari. But another option I've been testing (without much luck yet) is to learn an online estimate of the total light in the scene. The idea is that each reservoir keeps track of the total amount of light it sees per channel as you do initial sampling and resampling between reservoirs. When it comes time to shade the final selected sample, you can use this estimate with RCV to weight the sample appropriately. We'll see if I can get it working! GI Quality</a> </h3> While the world cache greatly improves GI quality and performance, it also brings its own set of downsides. The main one is that when we create a cache entry, we set its world-space position and normal. Every frame when the cache entry samples lighting, it uses that position and normal for sampling. The position and normal are fixed, and can never be updated. This means that if a bad position or normal that poorly represents the cache voxel is chosen when initializing the voxel, then it's stuck with that. This leads to weird artifacts that I haven't figured out how to solve, like some screenshots having orange lighting around the robot, and others not. Another unsolved problem is overall loss of energy. Compare the below screenshots of the current Solari scheme to a different scheme where instead of terminating in the world cache, the GI system traces an additional ray towards a random light. // Baseline scheme using the world cache reservoir.radiance = query_world_cache(sample_point.world_position, sample_point.geometric_world_normal, view.world_position); reservoir.unbiased_contribution_weight = uniform_hemisphere_inverse_pdf(); // Alternate scheme sampling and tracing a ray towards 1 random light let direct_lighting = sample_random_light(sample_point.world_position, sample_point.world_normal, rng); reservoir.radiance = direct_lighting.radiance; reservoir.unbiased_contribution_weight = direct_lighting.inverse_pdf * uniform_hemisphere_inverse_pdf(); </code></pre> Alternate GI scheme, without the world cache </figcaption> </figure> Despite the alternate scheme having higher variance and no multibounce pathtracing, it's actually brighter than using the world cache. For some reason, the voxelized nature of the world cache leads to a loss of energy. I've been thinking about trying out reprojecting the last frame to get multi bounce for rays that hit within the camera's view, instead of always relying on the world cache. That might mitigate some of the energy loss. Finally, the biggest problem with GI in general is both the overall lack of stability, and the slow reaction time to scene changes. The voxelized nature of the world cache, combined with how ReSTIR amplifies samples, means that bright outliers (e.g. world cache voxels much bighter than their neighbors) lead to temporal instability as shown below. While we could slow down the temporal accumulation speed to improve stability, that would slow down how fast Solari can react to changes in the scene's lighting. Our goal is realtime, fully dynamic lighting. Not sorta realtime, but actual realtime. Unfortunately the lack of validation rays in the ReSTIR GI temporal pass, combined with the recursive nature of the world cache, means that Solari already takes a decent amount of time to react to changes. Animated and moving light sources in particular leave trails behind in the GI. Slowing down the temporal accumulation speed would make it even worse. Going forwards with the project, I'm looking to mitigate all of these problems. While it would be more expensive, one option I've considered is combining the alternate sampling scheme with some kind of world-space feedback mechanism like the MegaLights visible light list I described above. The GI pass could trace an additional ray towards a light instead of sampling the world cache. If the light is visible, we could add it to a list stored in a world-space voxel, to be fed back into the (GI or DI) light sampling for future frames. Denoising Options</a> </h3> While Solari currently requires a NVIDIA GPU, the DLSS-RR integration is a separate plugin from Solari. Users can optionally choose to bring their own denoiser. In the future, whenever they release them, I'm hoping to add support for AMD's FSR Ray Regeneration</a>, whatever XeSS extension Intel</a> eventually releases, and potentially even Apple's MTL4FXTemporalDenoisedScaler</a>. Even ARM</a> is working on a neural-network based denoiser! Writing a denoiser from scratch is a lot of work, but it would also be nice to add ReBLUR</a> as a fallback for users of other GPUs. Thank You</a> </h2> If you've read this far, thank you, I hope you've enjoyed it! (to be fair, I can't imagine you got this far if you didn't enjoy reading it...) Solari represents the culmination of a significant amount of research, development, testing, refining, and more than a few tears over the last three years of my spare time. Not just from me, but also from the shoulders of all the research and work it stands on. I couldn't be more proud of what I've made. Like the rest of Bevy, Solari is also free and open source, forever. If you find Solari useful, consider donating</a> to help fund future development. Further Reading</a> </h2> A Gentle Introduction to ReSTIR: Path Reuse in Real-time</a></li> A gentler introduction to ReSTIR</a></li> Spatiotemporal Reservoir Resampling for Real-time Ray Tracing with Dynamic Direct Lighting</a></li> ReSTIR GI: Path Resampling for Real-Time Path Tracing</a></li> Rearchitecting Spatiotemporal Resampling for Production</a></li> Dynamic diffuse global illumination</a></li> Kajiya global illumination overview</a></li> Fast as Hell: idTech8 Global Illumination</a></li> Lumen: Real-time Global Illumination in Unreal Engine 5</a></li> MegaLights: Stochastic Direct Lighting in Unreal Engine 5</a></li> GI-1.0: A Fast Scalable Two-Level Radiance Caching Scheme for Real-Time Global Illumination</a></li> </ul> Bevy's Fifth Birthday - Progress and Production Readiness 2025-09-03T00:00:00+00:00 Written in response to Bevy's Fifth Birthday</a>. </blockquote> Introduction</a> </h3> Welcome to the review of my third year of Bevy development! After three years, I'm still enjoying Bevy as much as ever. Not only that, but development is the smoothest it's ever been! As is my usual writing style, this post is going to be a bit dry and disjointed, and maybe not the most hype-oriented. It's not exactly what I aimed for when I started writing, but it seems to be how I end up writing things :). I did try using an LLM to help with writing, but it was way too fawning, so in the end I've written this by hand. Perfect is the enemy of good and all that. While I am really excited about Bevy, this year and every year, I'll leave hyping Bevy up to others, so go read their blog posts once they're posted to https://bevy.org</a>! Consider this post more a brain dump of my own experiences, rather than on Bevy as a project. Anyways, let's talk about how this year went. My Stuff</a> </h3> The Bevy community has collectively landed a metric truckload of features and improvements this year. Like, just a mind-blowingly large amount. The Bevy 0.15, 0.16, and 0.17 release notes do a good job of highlighting what's new in each release, so I'm going to give a brief overview of just the (rendering) work I did this year that I'm particularly proud of. The biggest project for me this year has been the massive amounts of improvements I (@JMS55), @atlv24, and @SparkyPotato have landed for virtual geometry. When I wrote about Bevy's fourth birthday, we had landed the initial virtual geometry feature in Bevy 0.14. Since then, we've made numerous improvements to asset deserialization, compression, rasterization performance, LOD selection, LOD building, and culling in Bevy 0.15-0.17. I'm optimistic that this year will be the year that we add streaming, improve CPU performance, and fix the remaining culling bugs. With any luck with asset processing (more on this later), we'll finally take virtual geometry out of experimental status later this year! Side note: Unlike the last few releases, I won't be writing a blog post about virtual geometry for Bevy 0.17. I didn't work on the BVH-culling PR for Bevy 0.17 (that was all @atlv24 and @SparkyPotato) due to a combination of burnout and life getting in the way. While I've taken a break from virtual geometry, I've started another huge project: Solari. </blockquote> Bevy Solari is a brand new crate for raytraced lighting coming in Bevy 0.17. While most of the work was technically done in Bevy 0.17, its origins trace back to a ~2 year old project. I've mentioned in past blog posts how I started it, and then later abandoned it (which lead to me starting virtual geometry) after poor results, and issues with keeping forks of bevy/wgpu/naga_oil up to date. Now that I'm taking a break from virtual geometry, and due to the introduction of some new algorithms and research papers, along with 2 additional years of learning under my belt and upstreamed raytracing in wgpu, I've restarted the project with a completely new approach. I'm really super excited to share what we have so far, so expect a more detailed blog post about this soon! Like with Solari, DLSS integration is another abandoned project that I've revived thanks to work done in wgpu to enable interopt with underlying graphics APIs like Vulkan. Bevy 0.17 will be shipping support for DLSS (and DLSS-RR), alongside it's existing anti-aliasing options in MSAA, FXAA, SMAA, and TAA. NVIDIA users now have a great option for anti-aliasing, and much cheaper rendering via upscaling. I also wanted to add FSR4 support, but sadly FSR4 was released as a DirectX-only SDK, without any Vulkan support. This would have meant redoing a lot of work, and wasn't going to be done in time for Bevy 0.17 (and I don't own an RX 9070 XT). Still, eventually we could add support for FSR and XeSS (and potentially MetalFX), now that the infrastructure for temporal upscaling is in place. The last major feature I landed this year was hooking up our existing GPU timestamps to Tracy, the profiling tool we use. Now Bevy users can see combined CPU and GPU bottlenecks in one place, which is super useful! This year I would also like to shout-out several contributors (besides @altv24 and @SparkyPotato) I've been working with: @cart and @alice-i-cecile of course, as well as (in no particular order) @mockersf, @tychedelia, @Elabajaba, @DGriffin91, @IceSentry, @mate-h, @ecoskey, @NthTensor, @viridia, @pcwalton, and @ickshonpe, as well as @cwfitzgerald and @Vecvec for their work on wgpu. Without their help, none of this would have been possible! Unexpected Improvements</a> </h3> Along with the usual headline features, there's also been a lot of work done by the Bevy community this year that has ended up suprising me. The major one would be required components. I was fairly skeptical of them when they were first introduced, and thought it wasn't really an "ECS" way of doing this. In retrospect, I was totally wrong. As a user, using required components is way more pleasant than the older bundle-based API, and is easier to get started with. It ends up being easier to explain to users to spawn and query an entity with a Camera</code> component, rather than spawning with a CameraBundle</code> and then querying for a Camera</code>. As a plugin author, required components give me some peace of mind knowing that users can't easily add a component without adding its dependencies (and make the API a little nicer compared to bundles). There is still some rough edges to sort out, mainly making some sort of priority system for required components to override each other (e.g. Camera</code> requiring Msaa::Sample4</code>, and then TemporalAntiAliasing</code> being able to override that with Msaa::Off</code>), but overall it was an unexpectedly nice improvement. Cart absolutely cooked with this change. Retained rendering, and the other GPU-driven rendering parts have also worked out really well. Again some sharp edges with the APIs (mainly cleaning up render world entities being hard to write and easy to mess up), but these were hugely foundational changes, that overall landed really smoothly! I don't think the Bevy of 1-2 years ago would have landed these so easily, and they've drastically improved performance. The introduction of working groups was, to some extent, a big contributor of this. In the past I've complained about review speed for PRs, but this year it's been way better. Working groups give a set of built-in reviewers for larger projects, and Alice's Monday merge train often ends up being the final maintainer-review-once-over before merging, which keeps PRs merging smoothly. It's been working quite well! Similarly, having release notes in the main Bevy repo, and required as part of submitting PRs (something I've been advocating for!) has made the release process for Bevy 0.17 so much easier. Rather than having to crunch out release notes, changelogs, and showcases out at the end of the cycle (when we're all burnt out from writing PRs), writing them incrementally as a part of the PR process has been a huge time stress relief. As an unexpected benefit, it also makes reviewing PRs much easier, as it forces the author to write a good user-facing description of the changes for reviewers. Next Year?</a> </h3> The end of the birthday post is usually where I take some time to talk about what I'm planning to work on for the next year, but honestly I don't have a ton of plans at the moment. Continuing virtual geometry and Solari is a given, but otherwise I don't have any other concrete goals in terms of features. Neural-compressed textures would be cool, but is maybe a little too-researchy, and requires better asset processing APIs. Writing more blog posts would be great, but I probably don't have it in me to write these more frequently. I have been using a Bluesky page</a> to document short progress snippets as I work on PRs, so maybe follow that if you're interested in my content. I would like to write more documentation this year though - both API docs, and module docs / Bevy book content. As Bevy is getting increasingly mature, docs have become one of the bigger sticking points. Rendering in particular needs a lot more docs, both because it's under-documented, and because it's a fairly arcane subject. When I first started making 3d games, and later when I started working on rendering, I had absolutely no clue what to do. How to light a scene, how to write a custom material, what's important for rendering performance, how do I write my own rendering feature <FOO>, and more are all questions that would greatly benefit from some longer-form written documents. As we start to run out of major rendering features, putting my energy towards writing more docs seems like a good way to move Bevy closer to being production-ready. I've already started writing some stuff</a>. Production Ready - What's Missing?</a> </h3> So instead of writing my plans for next year, let's talk about what I think Bevy is missing (besides docs). I don't necessarily plan or not plan on working on any of this myself, but here are the things that I feel make it hard to say "Just use Bevy, duh!" UI continues to be a weak point. While bevy_ui</code> is a great foundation for rendering UI (in large part to @ickshonpe's and the taffy team's heroic efforts), no third-party crate (including my own bevy_dioxus) has proven out a good high-level API for declaring and updating UI trees. BSN is coming soon, but it only solves the declarative part of UI, and not the reactivity part. Until we resolve this, it's hard to reccomend Bevy for UI-heavy games and apps, and more importantly, we can't build the- Editor absence continues to be a big, big hole for Bevy. Not just in terms of being production ready, but I think the first release with an official editor is going to get an exponential influx of new users, and eventually new contributors. Working on the Solari demo scene has made me feel the lack of an editor badly. It was quite frustrating trying to get the materials correct for everything without an editor. I did work on a prototype</a> scene tree + inspector using a third-party BSN crate, but it was exceedingly difficult to write and understand, and I gave up on it. I'm really excited to work on the editor, but I'm going to hold off until reactive UI lands. Asset processing is another big bottleneck. Hard to say Bevy is production ready when the only texture compressor it has is an outdated version of BasisU. While cart added some asset processing APIs with Assets V2, it's clunky and dosen't support enough features. Trying to write a glTF -> virtual geometry processor proved to be unfeasible</a>. Once cart is done with BSN and reactivity, I would like to see him go back to this area. I would also like to move away from recommending that users ship their games with glTF/glb scenes, and instead provide some kind of glTF -> BSN + seperate image/mesh assets importer, that we can then run further asset processing on. This is another area that would greatly benefit from having an Editor. Animation isn't something I know a ton about, but after recently trying it out in Bevy, I can definitely say it's lacking. The API is quite clunky, with too many confusingly-named components and amount of entities needed, and not enough features. Custom Materials in Bevy are currently servicable, but not enjoyable. Users have a lot of power, but that's because we don't really provide much in the way of customizable abstractions. Mostly on the shader side, but also partly on the Rust side, with users having to resort to MeshTag</code> and ShaderStorageBuffer</code> to get good performance. The Material API should be completely redesigned, unified across 3d/2d/UI, and made much easier to use for common use-cases. We've been throwing around ideas in the #rendering-dev channel on Discord, but nothing concrete yet. This is a good area to get involved in! Overall, I do see paths to improving all of these areas over the next year (or likely two for animations and editor). I am a little disappointed with how long it has taken to land BSN, mostly with the opaqueness of the process (which cart has talked about, so I'm not going to repeat here), but I'm hopeful about the next steps! See you next year! Virtual Geometry in Bevy 0.16 2025-03-27T00:00:00+00:00 Introduction</a> </h2> Bevy 0.16 is releasing soon, and as usual it's time for me to write about the progress I've made on virtual geometry over the last couple of months. Due to a combination of life being busy, and taking an ongoing break from the project to work on other stuff (due to burnout), I haven't gotten as much done as the last two releases. This will be a much shorter blog post than usual. METIS-based Triangle Clustering</a> </h2> PR #16947</a> improves the DAG quality. I've said it before, and I'll say it again - DAG quality is the most important part of virtual geometry (and the hardest to get right). Before, in order to group triangles into meshlets, I was simply relying on meshoptimizer's meshopt_buildMeshlets()</code> function. It works pretty good for the general use case of splitting meshes into meshlets, but for virtual geometry, it's not ideal. Meshoptimizer prioritizes generating nice clusters for culling and vertex reuse, but for virtual geometry, we want to ensure that meshlets share as few vertices as possible. Less shared vertices between meshlets means less locked vertices when simplifying, which leads to better DAG quality. Minimizing shared vertices between meshlets when clustering triangles is the same problem as minimizing shared vertices between meshlet groups when grouping meshlets. We will once again use METIS to partition a graph, where nodes are triangles, edges connect adjacent triangles, and edge weights are the count of shared vertices between the triangles. From there it was just a lot of experimentation and tweaking parameters in order to get METIS to generate good meshlets. The secret ingredients I discovered for good clustering are: Set UFactor to 1 in METIS's options (did you know METIS has an options struct?), to ensure as little imbalance between partitions as possible.</li> Undershoot the partition count a little. Otherwise METIS will tend to overshoot and give you too many triangles per meshlet. For 128 max triangles per cluster, I set partition_count = number_of_triangles.div_ceil(126)</code>.</li> </ol> With this, we get a nicer quality DAG. Up until now, I've been plagued by tiny <10 triangles meshlets that tend to get "stuck" and not simplify into higher LOD levels. Now we get nice and even meshlets that simplify well as we build the higher LOD levels. Old DAG New DAG </center> I'm still not done working on DAG quality - I haven't considered spatial positions of triangles/meshlets for grouping things yet - but this was a great step forwards. Texture Atomics</a> </h2> PR #17765</a> improves the runtime performance. Thanks once again to @atlv24's work on wgpu/naga, we now have access to atomic operations on u64/u32 storage textures! Instead of using a plain GPU buffer to store our visbuffer, and buffer atomics to rasterize, we'll now use a R64Uint/R32Uint storage texture, and use texture atomics for rasterization. Things get a little bit faster, mostly due to cache behaviors for texture-like access patterns being better with actual textures instead of buffers. Faster Depth Resolve</a> </h3> The real win however, was actually an entirely unrelated change I made in the same PR. After rasterizing to the visbuffer texture (packed depth + cluster ID + triangle ID), there are two fullscreen triangle render passes to read from the visbuffer and write depth to both an actual depth texture, and the "material depth" texture discussed in previous posts. Lets look at the material depth resolve shader: @fragment fn resolve_material_depth(in: FullscreenVertexOutput) -> @builtin(frag_depth) f32 { let visibility = textureLoad(meshlet_visibility_buffer, vec2<u32>(in.position.xy)).r; let depth = visibility >> 32u; if depth == 0lu { discard; } // This line is new let cluster_id = u32(visibility) >> 7u; let instance_id = meshlet_cluster_instance_ids[cluster_id]; let material_id = meshlet_instance_material_ids[instance_id]; return f32(material_id) / 65535.0; } </code></pre> For pixels where depth is 0 (i.e. the background, i.e. no meshes covering that pixel), we don't need to write depth out. The textures are already cleared to zero by the render pass setup. Adding this single line to discard the background fragments doubled the performance of the resolve depth/material depth passes in the demo scene. Issues With Clearing</a> </h3> In Bevy we cache resource between frames, and so at the start of the frame, we need to clear the visbuffer texture back to zero to prepare it for use during the frame. Wgpu has some simple CommandEncoder::clear_buffer()</code> and CommandEncoder::clear_texture()</code> commands. But their behavior under the hood might be a little unintuitive if you've never used Vulkan before. When I initially switched the visbuffer from a buffer to a storage texture, and switched the clear from CommandEncoder::clear_buffer()</code> to CommandEncoder::clear_texture()</code>, I profiled and was shocked to see this: </center> 0.68ms spent on a single vkCmdCopyBufferToImage, just to clear the texture. Before, using buffers, it was a simple vkCmdFillBuffer that took 0.01ms. What's going on? Well, under the hood, CommandEncoder::clear_texture()</code> maps to one of the following operations: If the texture was created with the TextureUsages::RENDER_ATTACHMENT</code> bit set, create a render pass with no draws and fragment load op = clear in order to clear the texture.</li> Otherwise allocate a big buffer filled with zeros, and then use vkCmdCopyBufferToImage to copy zeros to fill the texture.</li> </ol> Option #1 is out since R64Uint/R32Uint textures don't support the TextureUsages::RENDER_ATTACHMENT</code> bit, and of course as we've found out, option #2 is horribly slow. The best option would be to use vkClearColorImage to clear the texture, which should be a similar fast path in the driver to using vkCmdFillBuffer with zeros, but wgpu neither uses vkClearColorImage internally, nor exposes it to users. So instead I wrote a custom compute pass (and all the CPU-side boilerplate that that entails) to manually zero the texture, like so: #ifdef MESHLET_VISIBILITY_BUFFER_RASTER_PASS_OUTPUT @group(0) @binding(0) var meshlet_visibility_buffer: texture_storage_2d<r64uint, write>; #else @group(0) @binding(0) var meshlet_visibility_buffer: texture_storage_2d<r32uint, write>; #endif var<push_constant> view_size: vec2<u32>; @compute @workgroup_size(16, 16, 1) fn clear_visibility_buffer(@builtin(global_invocation_id) global_id: vec3<u32>) { if any(global_id.xy >= view_size) { return; } #ifdef MESHLET_VISIBILITY_BUFFER_RASTER_PASS_OUTPUT textureStore(meshlet_visibility_buffer, global_id.xy, vec4(0lu)); #else textureStore(meshlet_visibility_buffer, global_id.xy, vec4(0u)); #endif } </code></pre> Still not as fast as vkClearColorImage likely is, but much faster than 0.68ms. Texture Atomic Results</a> </h3> Overall perf improvement is about 0.42ms faster in a very simple demo scene. Old frame trace New frame trace </center> Upcoming</a> </h2> And that's it for virtual geometry stuff I worked on during Bevy 0.16. In related but non-Bevy news, Nvidia revealed their blackwell RTX 50 series GPUs, with some exciting new meshlet/virtual geometry stuff</a>! New raytracing APIs (not rasterization!) for meshlet-based acceleration structures (CLAS) that are cheaper to build And on blackwell, CLAS's use a compressed (but sadly opaque) memory format</li> </ul> </li> New demos using CLAS's for animated geometry, dynamic tesselation, and even full Nanite-style virtual geometry!</li> New libraries for generating raytracing-friendly meshlets (i.e. optimized for bounding-box size), and virtual geometry oriented DAGs of meshlets</li> </ul> One of the biggest issues with Nanite (besides aggregate geometry like foilage) is that it came about right when realtime raytracing was starting to pick up. Until now, it hasn't been clear how to integrate virtual geometry with raytracing (beyond rasterizing the geometry to a gbuffer, so at least you get more primary visibility detail). These new APIs resolve that issue. Meshoptimizer v0.23 also released recently, with some new APIs (meshopt_buildMeshletsFlex</code>, meshopt_partitionClusters</code>, meshopt_computeSphereBounds</code>) that I need to try out for DAG building at some point. Finally of course, I need to work on BVH-based culling for Bevy's virtual geometry. As I went over in my last post, culling is the biggest bottleneck at the moment. I did start working on it during the 0.16 dev cycle, but burned out before the end. We'll see what happens this cycle. Enjoy Bevy 0.16! Virtual Geometry in Bevy 0.15 2024-11-14T00:00:00+00:00 Introduction</a> </h2> Original scene by Griffin</a>. Slightly broken due to lack of double-sided material support. </center> It's been a little over 5 months since my last post</a> where I talked about the very early prototype of virtual geometry I wrote for Bevy 0.14. While it's still not production ready, the improved version of virtual geometry that will ship in Bevy 0.15 (which is releasing soon) is a very large step in the right direction! In this blog post I'll be going over all the virtual geometry PRs merged since my last post, in chronological order. At the end, I'll do a performance comparison of Bevy 0.15 vs 0.14, and finally discuss my roadmap for what I'm planning to work on in Bevy 0.16 and beyond. Like last time, a lot of the larger architectural changes are copied from Nanite based on the SIGGRAPH presentation, which you should watch if you want to learn more. It's going to be another super long read, so grab some snacks and strap in! Arseny Kapoulkine's Contributions</a> </h2> PRs #13904</a>, #13913</a>, and #14038</a> improve the performance of the Mesh to MeshletMesh converter, and makes it more deterministic. These were written by Arseny Kapoulkine (author of meshoptimizer, the library I use for mesh simplification and meshlet building). Thanks for the contributions! PR #14042</a>, also by Kapoulkine, fixed a bug with how we calculate the depth pyramid mip level to sample at for occlusion culling. These PRs were actually shipped in Bevy 0.14, but were opened after I published my last post, hence why I'm covering them now. Faster MeshletMesh Loading</a> </h2> PR #14193</a> improves performance when loading MeshletMesh assets from disk. Previously I was using the bincode</code> and serde</code> crates to serialize and deserialize MeshletMeshes. All I had to do was slap #[derive(Serialize, Deserialize)]</code> on the type, and then I could use bincode::serialize_into()</code> to turn my asset into a slice of bytes for writing to disk, and bincode::deserialize_from()</code> in order to turn a slice of bytes loaded from disk back into my asset type. Easy. Unfortunately, that ease of use came with a good bit of performance overhead. Specifically in the deserializing step, where bytes get turned into the asset type. Deserializing the 5mb Stanford Bunny asset I was using for testing took a depressingly long 77ms on my Ryzen 5 2600 CPU. Thinking about the code flow more, we already have an asset -> bytes step. After the asset is loaded into CPU memory, we serialize it back into bytes so that we can upload it to GPU memory. For this, we use the bytemuck</code> crate which provides functions for casting slices of data that are Pod</code> (plain-old-data, i.e. just numbers, which all of our asset data is) to slices of bytes, without any real overhead. Why not simply use bytemuck to cast our asset data to slices of bytes, and write that? Similarly for reading from disk, we can simply cast the slice of bytes back to our asset type. fn write_slice<T: Pod>( field: &[T], writer: &mut dyn Write, ) -> Result<(), MeshletMeshSaveOrLoadError> { writer.write_all(&(field.len() as u64).to_le_bytes())?; writer.write_all(bytemuck::cast_slice(field))?; Ok(()) } fn read_slice<T: Pod>(reader: &mut dyn Read) -> Result<Arc<[T]>, std::io::Error> { let len = read_u64(reader)? as usize; let mut data: Arc<[T]> = std::iter::repeat_with(T::zeroed).take(len).collect(); let slice = Arc::get_mut(&mut data).unwrap(); reader.read_exact(bytemuck::cast_slice_mut(slice))?; Ok(data) } </code></pre> These two functions are all we need to read and write asset data. write_slice()</code> takes a slice of asset data, writes the length of the slice, and then casts the slice to bytes and writes it to disk. read_slice()</code> reads the length of the slice from disk, allocates an atomically reference counted buffer of that size, and then reads from disk to fill the buffer, casting it back into the asset data type. Writing the entire asset to disk now looks like this: write_slice(&asset.vertex_data, &mut writer)?; write_slice(&asset.vertex_ids, &mut writer)?; write_slice(&asset.indices, &mut writer)?; write_slice(&asset.meshlets, &mut writer)?; write_slice(&asset.bounding_spheres, &mut writer)?; </code></pre> And reading it back from disk looks like this: let vertex_data = read_slice(reader)?; let vertex_ids = read_slice(reader)?; let indices = read_slice(reader)?; let meshlets = read_slice(reader)?; let bounding_spheres = read_slice(reader)?; Ok(MeshletMesh { vertex_data, vertex_ids, indices, meshlets, bounding_spheres, }) </code></pre> Total load time from disk to CPU memory for our 5mb MeshletMesh went from 102ms down to 12ms, an 8.5x speedup. Software Rasterization</a> </h2> PR #14623</a> improves our visbuffer rasterization performance for clusters that appear small on screen (i.e. almost all of them). I rewrote pretty much the entire virtual geometry codebase in this PR, so this is going to be a really long section. Motivation</a> </h3> If you remember the frame breakdown from the last post, visbuffer rasterization took the largest chunk of our frame time. Writing out a buffer of cluster + triangle IDs to render in the culling pass, and then doing a single indirect draw over the total count of triangles does not scale very well. The buffer used a lot of memory (4 bytes per non-culled triangle). The GPU's primitive assembler can't keep up with the sheer number of vertices we're sending it as we're not using indexed triangles (to save extra memory and time spent writing out an index buffer), and therefore lack a vertex cache. And finally the GPU's rasterizer just performs poorly with small triangles, and we have a lot of small triangles. Current GPU rasterizers expect comparatively few triangles that each cover many pixels. They have performance optimizations aimed at that kind of workload like shading 2x2 quads of pixels at a time and tile binning of triangles. Meanwhile, our virtual geometry renderer is aimed at millions of tiny triangles that only cover a pixel each. We need a rasterizer aimed at being efficient over the number of triangles; not the number of covered pixels per triangle. We need a custom rasterizer algorithm, written in a compute shader, that does everything the GPU's hardware rasterizer does, but with the extra optimizations stripped out. Preparation</a> </h3> Before we get to the actual software rasterizer, there's a bunch of prep work we need to do first. Namely, redoing our entire hardware rasterizer setup. In Bevy 0.14, we were writing out a buffer of triangles from the culling pass, and issuing a single indirect draw to rasterize every triangle in the buffer. We're going to throw all that out, and go with a completely new scheme. First, we need a buffer for to store a bunch of cluster IDs (the ones we want to rasterize). We'll have users give a fixed size for this buffer on startup, based on the maximum number of clusters they expect to have visible in a frame in any given scene (not the amount pre-culling and LOD selection). MeshletPlugin { cluster_buffer_slots: 8192 } render_device.create_buffer(&BufferDescriptor { label: Some("meshlet_raster_clusters"), size: cluster_buffer_slots as u64 * size_of::<u32>() as u64, usage: BufferUsages::STORAGE, mapped_at_creation: false, }); </code></pre> Next, we'll setup two indirect commands in some buffers. One for hardware raster, one for software raster. For hardware raster, we're going to hardcode the vertex count to 64 (the maximum number of triangles per meshlet) times 3 (vertices per triangle) total vertices. We'll also initialize the instance count to zero. This was a sceme I described in my last post, but purposefully avoided due to the lackluster performance. However, now that we're adding a software rasterizer, I expect that almost all clusters will be software rasterized. Therefore some performance loss for the hardware raster is acceptable, as it should be rarely used. In return, we'll get to use a nice trick in the next step. render_device.create_buffer_with_data(&BufferInitDescriptor { label: Some("meshlet_hardware_raster_indirect_args"), contents: DrawIndirectArgs { vertex_count: 64 * 3, instance_count: 0, first_vertex: 0, first_instance: 0, } .as_bytes(), usage: BufferUsages::STORAGE | BufferUsages::INDIRECT, }); render_device.create_buffer_with_data(&BufferInitDescriptor { label: Some("meshlet_software_raster_indirect_args"), contents: DispatchIndirectArgs { x: 0, y: 1, z: 1 }.as_bytes(), usage: BufferUsages::STORAGE | BufferUsages::INDIRECT, }); </code></pre> In the culling pass, after LOD selection and culling, we're going to replace the the triangle buffer writeout code with something new. First we need to decide if the cluster is going to be software rasterized, or hardware rasterized. For this, my current heuristic is to take the cluster's screen-space AABB size we already calculated for occlusion culling, and check how big it is. If it's small (currently < 64 pixels on both axis), then it should be software rasterized. If it's large, then it gets hardware rasterized. At some point, when I have some better test scenes setup, I'll need to experiment with this parameter and see if I get better results with a different number. let cluster_is_small = all(vec2(aabb_width_pixels, aabb_height_pixels) < vec2(64.0)); </code></pre> Finally, the culling pass needs to output a list of clusters for both software and hardware rasterization. For this, I'm going to borrow a trick from Unreal's Nanite I learned from this frame breakdown</a>. Instead of allocating two buffers (one for SW raster, one for HW raster), we have the one meshlet_raster_clusters</code> buffer that we'll share between them, saving memory. Software rasterized clusters will be added starting from the left side of the buffer, while hardware rasterized clusters will be added from the right side of the buffer. As long as the buffer is big enough, they'll never overlap. Software rasterized clusters will increment the previously created indirect dispatch (1 workgroup per cluster), while hardware rasterized clusters will increment the previously created indirect draw (one draw instance per cluster). var buffer_slot: u32; if cluster_is_small && not_intersects_near_plane { // Append this cluster to the list for software rasterization buffer_slot = atomicAdd(&meshlet_software_raster_indirect_args.x, 1u); } else { // Append this cluster to the list for hardware rasterization buffer_slot = atomicAdd(&meshlet_hardware_raster_indirect_args.instance_count, 1u); buffer_slot = constants.meshlet_raster_cluster_rightmost_slot - buffer_slot; } meshlet_raster_clusters[buffer_slot] = cluster_id; </code></pre> Hardware Rasterization and atomicMax</a> </h3> We can now perform the indirect draw for hardware rasterization, and an indirect dispatch for software rasterization. In the hardware rasterization pass, since we're now spawning MESHLET_MAX_TRIANGLES * 3</code> vertices per cluster, we now need extra vertex shader invocations to write NaN triangle positions to ensure the extra triangles gets discarded. @vertex fn vertex(@builtin(instance_index) instance_index: u32, @builtin(vertex_index) vertex_index: u32) -> VertexOutput { let cluster_id = meshlet_raster_clusters[meshlet_raster_cluster_rightmost_slot - instance_index]; let meshlet_id = meshlet_cluster_meshlet_ids[cluster_id]; var meshlet = meshlets[meshlet_id]; let triangle_id = vertex_index / 3u; if triangle_id >= get_meshlet_triangle_count(&meshlet) { return dummy_vertex(); } // ... } </code></pre> In the fragment shader, instead of writing to a bound render target, we're now going to do an atomicMax()</code> on a storage buffer to store the rasterized visbuffer result. The reason is that we'll need to do the same for the software rasterization pass (because compute shaders don't have access to render targets), so to keep things simple and reuse the same bind group and underlying texture state between the rasterization passes, we're going to stick to using the atomicMax trick for the hardware rasterization pass as well. The Nanite slides describe this in more detail if you want to learn more. @fragment fn fragment(vertex_output: VertexOutput) { let frag_coord_1d = u32(vertex_output.position.y) * u32(view.viewport.z) + u32(vertex_output.position.x); let depth = bitcast<u32>(vertex_output.position.z); let visibility = (u64(depth) << 32u) | u64(vertex_output.packed_ids); atomicMax(&meshlet_visibility_buffer[frag_coord_1d], visibility); } </code></pre> Special thanks to @atlv24</a> for adding 64-bit integers and atomic u64 support in wgpu 22, specifically so that I could use it here. Note that there's a couple of improvements we could make here still, pending on support in wgpu and naga for some missing features: R64Uint texture atomics would both be faster than using buffers, and a bit more ergonomic to sample from and debug. This is hopefully coming in wgpu 24, again thanks to @atlv24.</li> Async compute would let us overlap the hardware and software rasterization passes, which would be safe since they're both writing to the same texture/buffer using atomics, which is another reason to stick with atomics for hardware raster.</li> Wgpu currently requires us to bind an empty render target for the hardware raster, even though we don't ever write to it, which is a waste of VRAM. Ideally we wouldn't need any bound render target.</li> And of course if we had mesh shaders, I wouldn't use a regular draw at all.</li> </ul> Rewriting the Indirect Dispatch</a> </h3> Before we get to software rasterization (soon, I promise!), we first have to deal with one final problem. We're expecting to deal with a lot of visible clusters. For each software rasterized cluster, we're going to increment the X dimension of an indirect dispatch, with 1 workgroup per cluster. On some GPUs (mainly AMD), you're limited to 65536 workgroups per dispatch dimension, which is too low. We need to do the same trick we've done in the past of turning a 1d dispatch into a higher dimension dispatch (in this case 2d), and then reinterpreting it back as a 1d dispatch ID in the shader. Since this is an indirect dispatch, we'll need to run a single-thread shader after the culling pass and before software rasterization, to do the 1d -> 2d remap of the indirect dispatch arguments on the GPU. @compute @workgroup_size(1, 1, 1) fn remap_dispatch() { meshlet_software_raster_cluster_count = meshlet_software_raster_indirect_args.x; if meshlet_software_raster_cluster_count > max_compute_workgroups_per_dimension { let n = u32(ceil(sqrt(f32(meshlet_software_raster_cluster_count)))); meshlet_software_raster_indirect_args.x = n; meshlet_software_raster_indirect_args.y = n; } } </code></pre> The Software Rasterizer</a> </h3> Finally, we can do software rasterization. The basic idea is to have a compute shader workgroup with size equal to the max triangles per meshlet. Each thread within the workgroup will load 1 vertex of the meshlet, transform it to screen-space, and then write it to workgroup shared memory and issue a barrier. After the barrier, the workgroup will switch to handling triangles, with one thread per triangle. First each thread will load the 3 indices for its triangle, and then load the 3 vertices from workgroup shared memory based on the indices. Once each thread has the 3 vertices for its triangle, it can compute the position/depth gradients across the triangle, and screen-space bounding box around the triangle. Each thread can then iterate the bounding box (Like Nanite does, choosing to either iterate each pixel or iterate scanlines, based on the bounding box sizes across the subgroup), writing pixels to the visbuffer as it goes using the same atomicMax() method that we used for hardware rasterization. One notable difference to the Nanite slides is that for the scanline variant, I needed to check if the pixel center was within the triangle for each pixel in the scanline, which the slides don't show. Not sure if the slides just omitted it for brevity or what, but I got artifacts if I left the check out. There's also some slight differences between my shader and the GPU rasterizer - I didn't implement absolutely every detail. Notably I skipped fixed-point math and the top-left rule. I should implement these in the future, but for now I haven't seen any issues from skipping them. Material and Depth Resolve</a> </h3> In Bevy 0.15, after the visbuffer rasterization, we have two final steps. The resolve depth pass reads the visbuffer (which contains packed depth), and writes the depth to the actual depth texture of the camera. /// This pass writes out the depth texture. @fragment fn resolve_depth(in: FullscreenVertexOutput) -> @builtin(frag_depth) f32 { let frag_coord_1d = u32(in.position.y) view_width + u32(in.position.x); let visibility = meshlet_visibility_buffer[frag_coord_1d]; return bitcast<f32>(u32(visibility >> 32u)); } </code></pre> The resolve material depth pass has the same role in Bevy 0.15 that it did in Bevy 0.14, where it writes the material ID of each pixel to a depth texture, so that we can later abuse depth testing to ensure we shade the correct pixels during each material draw in the material shading pass. However, you may have noticed that unlike the rasterization pass in Bevy 0.14, the new rasterization passes write only depth and cluster + triangle IDs, and not material IDs. During the rasterization pass, where we want to write only the absolute minimum amount of information per pixel (cluster ID, triangle ID, and depth) that we have to. Because of this, the resolve material depth pass can no longer read the material ID texture and copy it directly to the material depth texture. There's now a new step at the start to first load the material ID based on the visbuffer. /// This pass writes out the material depth texture. @fragment fn resolve_material_depth(in: FullscreenVertexOutput) -> @builtin(frag_depth) f32 { let frag_coord_1d = u32(in.position.y) view_width + u32(in.position.x); let visibility = meshlet_visibility_buffer[frag_coord_1d]; let depth = visibility >> 32u; if depth == 0lu { return 0.0; } let cluster_id = u32(visibility) >> 7u; let instance_id = meshlet_cluster_instance_ids[cluster_id]; let material_id = meshlet_instance_material_ids[instance_id]; // Everything above this line is new - the shader used to just load the // material_id from another texture return f32(material_id) / 65535.0; } </code></pre> Retrospect</a> </h3> Software rasterization is a lot of complexity, learning, and work - I spent a lot of time researching how the GPU rasterizer works, rewrote a lot of code, and just writing the software rasterization shader itself and getting it bug-free took a week or two of effort. As you'll see later, I missed a couple of (severe) bugs, which will need correcting. The upside is that performance is a lot better than my previous method (I'll show some numbers at the end of this post), and we can have thousands of tiny triangles on screen without hurting performance. My advice to others working on virtual geometry is to skip software raster until close to the end. If you have mesh shaders, stick with those. From what I've heard from other projects, software raster is only a 10-20% performance improvement over mesh shaders in most scenes, unless you really crank the tiny triangle count (which is admittedly a goal, but not an immediate priority). If like me, you don't have mesh shaders, then I would still probably stick with only hardware rasterization until you've exhausted other, more fundamental areas to work on, like culling and DAG building. However, I would learn from my mistakes, and not spend so much time trying to get hardware rasterization to be fast. Just stick to writing out a list of visible cluster IDs in the culling shader and have the vertex shader ignore extra triangles, instead of trying to get clever with writing out a buffer of visible triangles and drawing the minimum number of vertices. You'll eventually add software rasterization, and then the hardware rasterization performance won't be so important. If you do want to implement a rasterizer in software (for virtual geometry, or otherwise), check out the below resources that were a big help for me in learning rasterization and the related math. A fast and precise triangle rasterizer, by Kristoffer Dyrkorn</a></li> The barycentric conspiracy, by Fabian Giesen</a></li> Triangle Rasterization, by pikuma</a></li> </ul> Larger Meshlet Sizes</a> </h2> PR #15023</a> has a bunch of small improvements to virtual geometry. The main change is switching from a maximum 64 vertices and 64 triangles (64v:64t</code>) to 255 vertices and 128 triangles per meshlet (255v:128t</code>). I found that having a less than or equal v:t</code> ratio leads to most meshlets having less than t</code> triangles, which we don't want. Having a 2v:t</code> ratio leads to more fully-filled meshlets, and I went with 255v:128t</code> (which is nearly the same as Nanite, minus the fact that meshoptimizer only supports meshlets with up to 255 vertices) over 128v:64t</code> after some performance testing. Note that this change involved some other work, such as adjusting software and hardware raster to work with more triangles, software rasterization looping if needed to load 2 vertices per thread instead of 1, using another bit per triangle ID when packing cluster + triangle IDs to accomodate triangles up to 127, etc. The other changes I made were: Setting the target error when simplifying triangles to f32::MAX</code> (no point in capping it for continuous LOD, gives better simplification results)</li> Adjusting the threshold to allow less-simplified meshes to still count as having been simplified enough (gets us closer to log2(lod_0_meshlet_count)</code> total LOD levels)</li> Setting group_error = max(group_error, all_child_errors)</code> instead of group_error += max(all_child_errors)</code> (not really sure if this is more or less correct)</li> </ul> Screenspace-derived Tangents</a> </h2> PR #15084</a> calculates tangents at runtime, instead of precomputing them and storing them as part of the MeshletMesh asset. Virtual geometry isn't just about rasterizing huge amounts of high-poly meshes - asset size is also a big factor. GPUs only have so much memory, disks only have so much space, and transfer speeds from disk to RAM and RAM to VRAM are only so fast (as we discovered in the last post). Looking at our asset data, right now we're storing 48 bytes per vertex, with a single set of vertices shared across all meshlets in a meshlet mesh. struct MeshletVertex { position: vec3<f32>, normal: vec3<f32>, uv: vec2<f32>, tangent: vec4<f32>, } </code></pre> An easy way to reduce the amount of data per asset is to just remove the explicitly-stored tangents, and instead calculate them at runtime. In the visbuffer resolve shader function, rather then loading 3 vertex tangents and interpolating across the triangle, we can instead calculate the tangent based on UV derivatives across the triangle. The tangent derivation I used was "Surface Gradient–Based Bump Mapping Framework"</a> from Morten S. Mikkelsen (author of the mikktspace</a> standard). It's a really cool paper that provides a framework for using normal maps in many more scenarios than just screen-space based tangents. Definitely give it a further read. I used the code from this blog post</a> by Jeremy Ong, which also does a great job motivating and explaining the paper. The only issue I ran into is that the tangent.w</code> always came out with the wrong sign compared to the existing mikktspace-tangents I had as a reference. I double checked my math and coordinate space handiness a couple of times, but could never figure out what was wrong. I ended up just inverting the sign after calculating the tangent. If anyone knows what I did wrong, please open an issue</a>! // https://www.jeremyong.com/graphics/2023/12/16/surface-gradient-bump-mapping/#surface-gradient-from-a-tangent-space-normal-vector-without-an-explicit-tangent-basis fn calculate_world_tangent( world_normal: vec3<f32>, ddx_world_position: vec3<f32>, ddy_world_position: vec3<f32>, ddx_uv: vec2<f32>, ddy_uv: vec2<f32>, ) -> vec4<f32> { // Project the position gradients onto the tangent plane let ddx_world_position_s = ddx_world_position - dot(ddx_world_position, world_normal) world_normal; let ddy_world_position_s = ddy_world_position - dot(ddy_world_position, world_normal) world_normal; // Compute the jacobian matrix to leverage the chain rule let jacobian_sign = sign(ddx_uv.x ddy_uv.y - ddx_uv.y ddy_uv.x); var world_tangent = jacobian_sign * (ddy_uv.y ddx_world_position_s - ddx_uv.y ddy_world_position_s); // The sign intrinsic returns 0 if the argument is 0 if jacobian_sign != 0.0 { world_tangent = normalize(world_tangent); } // The second factor here ensures a consistent handedness between // the tangent frame and surface basis w.r.t. screenspace. let w = jacobian_sign * sign(dot(ddy_world_position, cross(world_normal, ddx_world_position))); return vec4(world_tangent, -w); // TODO: Unclear why we need to negate this to match mikktspace generated tangents } </code></pre> At the cost of a few extra calculations in the material shading pass, and some slight inaccuracies compared to explicit tangents (mostly on curved surfaces), we save 16 bytes per vertex, both on disk (although LZ4 compression means we might be saving less in practice), and in memory. 16 bytes might not sound like a lot, but our high-poly meshes have a lot of vertices, so the savings are significant, especially in combination with the next PR. Explicit Tangents (0.14)</th> Implicit tangents (0.15)</th></tr></thead> </td> </td></tr> </tbody></table> Also of note is that while trying to debug the sign issue, I found that The Forge had published an updated version</a> of their partial derivatives calculations, fixing a small bug. I updated my WGSL port to match. Compressed Per-Meshlet Vertex Data</a> </h2> PR #15643</a> stores copies of the overall mesh's vertex attribute data per-meshlet, and then heavily compresses it. Motivation</a> </h3> The whole idea behind virtual geometry is that you only pay (as much as possible, it's of course not perfect) for the geometry currently needed on screen. Zoomed out? You pay the rasterization cost for only a few triangles at a higher LOD, and not for the entire mesh. Part of the mesh occluded? It gets culled. But continuing on with the theme from the last PR, memory usage is also a big cost. We might be able to render a large scene of high poly meshes with clever usage of LODs and culling, but can we afford to store all that mesh data to begin with in our GPU's measily 8-12gb of VRAM? (not even accounting for space taken up by material textures which will reduce our budget even further). The way we fix this is with streaming. Rather than keep everything in memory all the time, you have the GPU write requests of what data it needs to a buffer, read that back onto the CPU, and then load the requested data from disk into a fixed-size GPU buffer. If the GPU no longer needs a piece of data, you mark that section of the buffer as free space, and can write new data to it as new requests come in. Typical implementations of mesh streaming stream discrete LOD levels, but our goal is to be much more fine-grained. Keeping with the theme of only paying for the cluster data you need actually need to render the current frame, we want to stream individual meshlets, not whole LOD levels (in practice, Nanite streams fixed-size pages of meshlet data, and not individual meshlets). This presents a problem with our current implementation: since all meshlets reference the same set of vertex data, we have no simple way of unloading or loading vertex data for a single meshlet. While I'm not going to tackle streaming in Bevy 0.15, in this PR I'll be changing the way we store vertex data to solve this problem and unblock streaming in the future. Up until now, each MeshletMesh has had one set of vertex data shared between all meshlets within the mesh. Each meshlet has a local index buffer, mapping triangles to meshlet-local vertex IDs, and then a global index buffer mapping meshlet-local vetex IDs to actual vertex data from the mesh. E.g. triangle corner X within a meshlet points to vertex ID Y within a meshlet which points to vertex Z within the mesh. In order to support streaming, we're going to move to a new scheme. We will store a copy of vertex data for each meshlet, concatenated together into one slice. All the vertex data for meshlet 0 will be stored as one contiguous slice, with all the vertex data for meshlet 1 stored contiguously after it, and all the vertex data for meshlet 2 after that, etc. Each meshlet's local index buffer will point directly into vertices within the meshlet's vertex data slice, stored as an offset relative to the starting index of the meshlet's vertex data slice within the overall buffer. E.g. triangle corner X within a meshlet points to vertex Y within the meshlet directly. Besides unblocking streaming, this scheme is also much simpler to reason about, uses less dependent memory reads, and works much nicer with our software rasterization pass where each thread in the workgroup is loading a single meshlet vertex into workgroup shared memory. That was a lot of background and explanation for what's really a rather simple change, so let me finally get to the main topic of this PR: the problem with duplicating vertex data per meshlet is that we've just increased the size of our MeshletMesh asset by a thousandfold. The solution is quantization and compression. Position Compression</a> </h3> Meshlets compress pretty well. Starting with vertex positions, there's no reason we need to store a full vec3<f32></code> per vertex. Most meshlets tend to enclose a fairly small amount of space. Instead of storing vertex positions as coordinates relative to the mesh center origin, we can instead store them in some coordinate space relative to the meshlet bounds. For each meshlet, we'll iterate over all of its vertex positions, and calculate the min and max value for each of the X/Y/Z axis. Then, we can remap each position relative to those bounds by doing p -= min</code>. The positions initially range from [min, max]</code>, and then range from [0, max - min]</code> after remapping. We can store the min</code> values for each of the X/Y/Z axis (as a full f32</code> each) in the meshlet metadata, and in the shader reverse the remapping by doing p += min</code>. Our first (albeit small) saving become apparent: at the cost of 12 extra bytes in the meshlet metadata, we save 3 bits per vertex position due to no longer needing a bit for the sign for each of the X/Y/Z values, as [0, max - min]</code> is never going to contain any negative numbers. We technically now only need a hypothetical f31</code> per axis. However, there's a another trick we can perform. If we take the ceiling of the log2 of a range of floating point values ceil(log2(max - min + 1))</code>, we get the minimum number of bits we need to store any value in that range. Rather than storing meshlet vertex positions as a list of vec3<f32></code>s, we could instead store them as a packed list of bits (a bitstream). E.g. if we determine that we need 4/7/3 bits for the X/Y/Z ranges of the meshlet, we could store a list of bits where bits 0..4 are for vertex 0 axis X, bits 4..11 are for vertex 0 axis Y, bits 11..14 are for vertex 0 axis Z, bits 14..18 are for vertex 1 axis X, bits 18..25 are for vertex 1 axis Y, etc. Again we can store the bit size (as a u8</code>) for each of the X/Y/Z axis within the meshlet's metadata, at a cost of 3 extra bytes. We'll use this later in our shaders to figure out how many bits to read from the bistream for each of the meshlet's vertices. In practice, if you try this out as-is, you're probably going to end up with fairly large bit sizes per axis, and not actually save any space vs using vec3<f32></code>. This is due to the large amount of precision we have in our vertex positions (a full f32</code>), which leads to a lot of precision needed in the range, and therefore a large bit size. The final trick up our sleeves is that we don't actually need all this precision. If we know that our meshlet's vertices range from 10.2041313123 to 84.382543538, do we really need to know that a vertex happens to be stored at exactly 57.594392822? We could pick some arbitrary amount of precision to round each of our vertices to, say four decimal places, resulting in 57.5944. Less precision means a less precise range, which means our bit size will be smaller. Don't quantize too much, or you'll get bugs! </center> Better yet, lets pick some factor q = 2^p</code>, where p</code> is some arbitrary u8</code> integer. Now, lets snap each vertex to the nearest point on the grid that's a multiple of 1/q</code>, and then store the vertex as the number of "steps" of size 1/q</code> that we took from the origin to reach the snapped vertex position (a fixed-point representation). E.g. if we say p = 4</code>, then we're quantizing to a grid with a resolution of 1/16</code>, so v = 57.594392822</code> would snap to v = 57.625</code> (throwing away some unnecessary precision) and we would store that as v = round(57.594392822 / (1/16)) = i32(57.594392822 * 16 + 0.5) = 922</code>. This is once again easily reversible in our shader so long as we have our factor p</code>: 922 / 2^4 = 57.625</code>. The factor p</code> we choose is not particularly important. I set it to 4 by default (with an additional factor to convert from Bevy's meters to the more appropriate-for-this-use-case unit of centimeters), but users can choose a good value themselves if 4 is too high (unnecessary precision = larger bit sizes and therefore larger asset sizes), or too low (visible mesh deformity from snapping the vertices too-coarsely). Nanite has an automatic heuristic that I assume is based on some kind of triangle surface area to mesh size ratio, but also lets users choose p</code> manually. The important thing to note is that you should not choose p</code> per-meshlet, i.e. p</code> should be the same for every meshlet within the mesh. Otherwise, you'll end up with cracks between meshlets. Finally, we can combine all three of these tricks. We can quantize our meshlet's vertices, find the per-axis min/max values and remap to a better range, and then store as a packed bitstream using the minimum number of bits for the range. The final code to compress a meshlet's vertex positions is below. let quantization_factor = (1 << vertex_position_quantization_factor) as f32 * CENTIMETERS_PER_METER; let mut min_quantized_position_channels = IVec3::MAX; let mut max_quantized_position_channels = IVec3::MIN; // Lossy vertex compression let mut quantized_positions = [IVec3::ZERO; 255]; for (i, vertex_id) in meshlet_vertex_ids.iter().enumerate() { let position = ...; // Quantize position to a fixed-point IVec3 let quantized_position = (position quantization_factor + 0.5).as_ivec3(); quantized_positions[i] = quantized_position; // Compute per X/Y/Z-channel quantized position min/max for this meshlet min_quantized_position_channels = min_quantized_position_channels.min(quantized_position); max_quantized_position_channels = max_quantized_position_channels.max(quantized_position); } // Calculate bits needed to encode each quantized vertex position channel based on the range of each channel let range = max_quantized_position_channels - min_quantized_position_channels + 1; let bits_per_vertex_position_channel_x = log2(range.x as f32).ceil() as u8; let bits_per_vertex_position_channel_y = log2(range.y as f32).ceil() as u8; let bits_per_vertex_position_channel_z = log2(range.z as f32).ceil() as u8; // Lossless encoding of vertex positions in the minimum number of bits per channel for quantized_position in quantized_positions.iter().take(meshlet_vertex_ids.len()) { // Remap [range_min, range_max] IVec3 to [0, range_max - range_min] UVec3 let position = (quantized_position - min_quantized_position_channels).as_uvec3(); // Store as a packed bitstream vertex_positions.extend_from_bitslice( &position.x.view_bits::<Lsb0>()[..bits_per_vertex_position_channel_x as usize], ); vertex_positions.extend_from_bitslice( &position.y.view_bits::<Lsb0>()[..bits_per_vertex_position_channel_y as usize], ); vertex_positions.extend_from_bitslice( &position.z.view_bits::<Lsb0>()[..bits_per_vertex_position_channel_z as usize], ); } </code></pre> Position Decoding</a> </h3> Before this PR, our meshlet metadata was this 16-byte type: pub struct Meshlet { /// The offset within the parent mesh's [`MeshletMesh::vertex_ids`] buffer where the indices for this meshlet begin. pub start_vertex_id: u32, /// The offset within the parent mesh's [`MeshletMesh::indices`] buffer where the indices for this meshlet begin. pub start_index_id: u32, /// The amount of vertices in this meshlet. pub vertex_count: u32, /// The amount of triangles in this meshlet. pub triangle_count: u32, } </code></pre> With all the custom compression, we need to store some more info, giving us this carefully-packed 32-byte type (a little bit bigger, but reducing size for vertices is much more important than reducing the size of the meshlet metadata): pub struct Meshlet { /// The bit offset within the parent mesh's [`MeshletMesh::vertex_positions`] buffer where the vertex positions for this meshlet begin. pub start_vertex_position_bit: u32, /// The offset within the parent mesh's [`MeshletMesh::vertex_normals`] and [`MeshletMesh::vertex_uvs`] buffers /// where non-position vertex attributes for this meshlet begin. pub start_vertex_attribute_id: u32, /// The offset within the parent mesh's [`MeshletMesh::indices`] buffer where the indices for this meshlet begin. pub start_index_id: u32, /// The amount of vertices in this meshlet. pub vertex_count: u8, /// The amount of triangles in this meshlet. pub triangle_count: u8, /// Unused (needed to satisfy alignment rules). pub padding: u16, /// Number of bits used to to store the X channel of vertex positions within this meshlet. pub bits_per_vertex_position_channel_x: u8, /// Number of bits used to to store the Y channel of vertex positions within this meshlet. pub bits_per_vertex_position_channel_y: u8, /// Number of bits used to to store the Z channel of vertex positions within this meshlet. pub bits_per_vertex_position_channel_z: u8, /// Power of 2 factor used to quantize vertex positions within this meshlet. pub vertex_position_quantization_factor: u8, /// Minimum quantized X channel value of vertex positions within this meshlet. pub min_vertex_position_channel_x: f32, /// Minimum quantized Y channel value of vertex positions within this meshlet. pub min_vertex_position_channel_y: f32, /// Minimum quantized Z channel value of vertex positions within this meshlet. pub min_vertex_position_channel_z: f32, } </code></pre> To fetch a single vertex from the bitstream (we we bind as an array of u32</code>s), we can use this function: fn get_meshlet_vertex_position(meshlet: ptr<function, Meshlet>, vertex_id: u32) -> vec3<f32> { // Get bitstream start for the vertex let unpacked = unpack4xU8((meshlet).packed_b); let bits_per_channel = unpacked.xyz; let bits_per_vertex = bits_per_channel.x + bits_per_channel.y + bits_per_channel.z; var start_bit = (meshlet).start_vertex_position_bit + (vertex_id bits_per_vertex); // Read each vertex channel from the bitstream var vertex_position_packed = vec3(0u); for (var i = 0u; i < 3u; i++) { let lower_word_index = start_bit / 32u; let lower_word_bit_offset = start_bit & 31u; var next_32_bits = meshlet_vertex_positions[lower_word_index] >> lower_word_bit_offset; if lower_word_bit_offset + bits_per_channel[i] > 32u { next_32_bits |= meshlet_vertex_positions[lower_word_index + 1u] << (32u - lower_word_bit_offset); } vertex_position_packed[i] = extractBits(next_32_bits, 0u, bits_per_channel[i]); start_bit += bits_per_channel[i]; } // Remap [0, range_max - range_min] vec3<u32> to [range_min, range_max] vec3<f32> var vertex_position = vec3<f32>(vertex_position_packed) + vec3( (meshlet).min_vertex_position_channel_x, (meshlet).min_vertex_position_channel_y, (meshlet).min_vertex_position_channel_z, ); // Reverse vertex quantization let vertex_position_quantization_factor = unpacked.w; vertex_position /= f32(1u << vertex_position_quantization_factor) CENTIMETERS_PER_METER; return vertex_position; } </code></pre> This could probably be written better - right now we're doing a minimum of 3 u32</code> reads (1 per channel), but there's a good chance that a single u32</code> read will contain the data for all 3 channels of the vertex. Something to optimize in the future. Other Attributes</a> </h3> Now that we've done positions, lets talk about how to handle other vertex attributes. Tangents we already removed in the last PR. For UVs, I currently store them uncompressed. I could have maybe used half-precision floating point values, but I am wary of artifacts resulting from the reduced precision, so for right now it's a full vec2<f32></code>. This is a big opportunity for future improvement. Normals are a bit more interesting. They start as vec3<f32></code>. I first perform an octahedral encoding on them, bringing them down to a vec2<f32></code> near-losessly. I then give up some precision to reduce the size even further by using pack2x16snorm()</code>, bringing it down to a vec2<f16></code>, or a packed u32</code>. These operations are easily reversed in the shader using the built-in unpack2x16snorm()</code> function, and then the simple octahedral decode step. I did try a bitstream encoding similiar to what I did for positions, but couldn't get any smaller sizes than a simple pack2x16snorm()</code>. I think with more time and motivation (I was getting burnt out by the end of this), I could have probably figured out a good variable-size octahedral encoding for normals as well. Something else to investigate in the future. Results</a> </h3> After all this, how much memory savings did we get? Disk space is practically unchanged (maybe 2% smaller at best), but memory savings on a test mesh went from 110 MB</code> before this PR (without duplicating the vertex data per-meshlet at all), to 64 MB</code> after this PR (copying and compressing vertex data per-meshlet). This is a huge savings (42%</code> smaller), with room for future improvements! I'll definitely be coming back to this at some point in the future. Additional references: https://advances.realtimerendering.com/s2021/Karis_Nanite_SIGGRAPH_Advances_2021_final.pdf#page=128</a></li> https://arxiv.org/abs/2404.06359</a> (also compresses the index buffer, not just vertices!)</li> https://daniilvinn.github.io/2024/05/04/omniforce-vertex-quantization.html</a></li> https://gpuopen.com/download/publications/DGF.pdf</a> (more focused on raytracing than rasterization)</li> </ul> Improved LOD Selection Heuristic</a> </h2> PR #15846</a> changes how we select the LOD cut. Previously, I was building a bounding sphere around each group with radius based on the group error, and then projecting that to screen space to get the visible error in pixels. That method worked, but isn't entirely watertight. Where you place the bounding sphere center in the group is kind of arbitrary, right? And how do you ensure that the error projection is perfectly monotonic, if you have these random bounding spheres in each group? Arseny Kapoulkine once again helped me out here. As part of meshoptimizer, they started experimenting with their nanite.cpp</a> demo. In this PR, I copied his code for LOD cut selection. To determine the group bounding sphere, you simply build a new bounding sphere enclosing all of the group's childrens' bounding spheres. The first group you build out of LOD 0 uses the LOD 0 culling bounding spheres around each meshlet. This way, you ensure that both the error (using the existing method of taking the max error among the group and group children), and the bounding sphere are monotonic. Error is no longer stored in the radius of the bounding sphere, and is instead stored as a seperate f16 (lets us pack both group and parent group error into a single u32, and the lost precision is irrelevant). This also gave me the opportunity to clean up the code now that I understand the theory better, and clarify the difference between meshlets and meshlet groups better. For projecting the error at runtime, we now use the below function. I can't claim to understand how it works that well (and it's been a few weeks since I last looked at it), but it does work. The end result is that we get more seamless LOD changes, and our mesh to meshlet mesh converter is more robust (it used to crash on larger meshes, due to a limitation in the code for how I calculated group bounding spheres). // https://github.com/zeux/meshoptimizer/blob/1e48e96c7e8059321de492865165e9ef071bffba/demo/nanite.cpp#L115 fn lod_error_is_imperceptible(lod_sphere: MeshletBoundingSphere, simplification_error: f32, world_from_local: mat4x4<f32>, world_scale: f32) -> bool { let sphere_world_space = (world_from_local * vec4(lod_sphere.center, 1.0)).xyz; let radius_world_space = world_scale lod_sphere.radius; let error_world_space = world_scale simplification_error; var projected_error = error_world_space; if view.clip_from_view[3][3] != 1.0 { // Perspective let distance_to_closest_point_on_sphere = distance(sphere_world_space, view.world_position) - radius_world_space; let distance_to_closest_point_on_sphere_clamped_to_znear = max(distance_to_closest_point_on_sphere, view.clip_from_view[3][2]); projected_error /= distance_to_closest_point_on_sphere_clamped_to_znear; } projected_error = view.clip_from_view[1][1] 0.5; projected_error = view.viewport.w; return projected_error < 1.0; } </code></pre> An interesting side note, finding the minimal bounding sphere around a set of other bounding sphere turns out to be a very difficult problem. Kaspar Fischer's thesis "The smallest enclosing balls of balls"</a> covers the math, and it's very complex. I copied Kapoulkine's approximate, much simpler method. Improved Mesh to MeshletMesh Conversion</a> </h2> PR #15886</a> brings more improvements to the mesh to meshlet mesh converter. Following on from the last PR, I again took a bunch of improvements from the meshoptimizer nanite.cpp demo: Consider only the vertex position (and ignore things like UV seams) when determining meshlet groups</li> Add back stuck meshlets that either failed to simplify, or failed to group, to the processing queue to try again at a later LOD. Dosen't seem to be much of an improvement though.</li> Provide a seed to METIS to make the meshlet mesh conversion fully deterministic. I didn't realize METIS even had options before now.</li> Target groups of 8 meshlets instead of 4. This improved simplification quality a lot! Nanite does groups of size 8-32, probably based on some kind of heuristic, which is probably worth experimenting with in the future.</li> Manually lock only vertices belonging to meshlet group borders, instead of the full toplogical group border that meshoptimizer's LOCK_BORDER</code> flag does.</li> </ul> With all of these changes combined, we can finally reliably get down to a single meshlet (or at least 1-3 meshlets for larger meshes) at the highest LOD! The last item on the list in particular is a huge improvement. With meshoptimizer's LOCK_BORDER</code> flag, the entire edge of the mesh will be locked. That means that at the most simplified LOD level, the entire border of the original mesh will be preserved. You will pretty much never be able to reduce down to 1 meshlet with this constraint. Using manual vertex locks to only lock vertices belonging to shared edges between meshlets (regardless of whether or not they're on the original mesh border) fixes this issue. Faster Fill Cluster Buffers Pass</a> </h2> PR #15955</a> improves the speed of the fill cluster buffers pass. Speeding Up</a> </h3> At this point, I improved rasterization performance, meshlet mesh building, and asset storage and loading. The Bevy 0.15 release was coming up, people were winding down features in favor of testing the release candidates, and I wasn't going to have the time (or, the motivation) to do another huge PR. While looking at some small things I could improve, I ended up talking with Kirill Bazhenov about how he manages per-instance (entity) GPU data in his Esoterica</a> renderer. To recap the problem we had in the last post, uploading 8 bytes (instance ID + meshlet ID) per cluster to the GPU was way too expensive. The solution I came up with was to dispatch a compute shader thread per cluster, have it perform a binary search on an array of per-instance data to find the instance and meshlet it belongs to, and then write out the instance and meshlet IDs. This way, we only had to upload 8 bytes per instance to the GPU, and then the cluster -> instance ID + meshlet ID write outs would be VRAM -> VRAM writes, which are much faster than RAM -> VRAM uploads. This was the fill cluster buffers pass in Bevy 0.14. It's not super fast, but it's also not the bottleneck, and so for a while I was fine leaving it as-is. Kirill, however, showed me a much better way. Instead of having our compute shader operate on a list of clusters, and write out the two IDs per cluster, we can turn the scheme on its head. We can instead have the shader operate on a list of instances, and write out the two IDs for each cluster within the instance. After all, each instance already has the list of meshlets it has, so writing out the cluster (an instance of a meshlet) is easy! Instead of dispatching one thread per cluster, now we're going to dispatch one workgroup per instance, with each workgroup having 1024 threads (the maximum allowed). Instead of uploading a prefix-sum of meshlet counts per instance, now we're going to upload just a straight count of meshlets per instance (we're still only uploading 8 bytes per instance total). In the shader, each workgroup can load the 8 bytes of data we uploaded for the instance it's processing. let instance_id = workgroup_id.x; let instance_meshlet_count = meshlet_instance_meshlet_counts[instance_id]; let instance_meshlet_slice_start = meshlet_instance_meshlet_slice_starts[instance_id]; </code></pre> Then, the first thread in each workgroup can reserve space in the output buffers for its instance's clusters via an atomic counter, and broadcast the start index to the rest of the workgroup. var<workgroup> cluster_slice_start_workgroup: u32; // Reserve cluster slots for the instance and broadcast to the workgroup if local_invocation_index == 0u { cluster_slice_start_workgroup = atomicAdd(&meshlet_global_cluster_count, instance_meshlet_count); } let cluster_slice_start = workgroupUniformLoad(&cluster_slice_start_workgroup); </code></pre> Finally, we can have the workgroup loop over its instance's clusters, and for each one, write out its instance ID (which we already have, since it's just the workgroup ID) and meshlet ID (the instance's first meshlet ID, plus the loop counter). Each thread will handle 1 cluster, and the workgroup as a whole will loop enough times to write out all of the instance's clusters. // Loop enough times to write out all the meshlets for the instance given that each thread writes 1 meshlet in each iteration for (var clusters_written = 0u; clusters_written < instance_meshlet_count; clusters_written += 1024u) { // Calculate meshlet ID within this instance's MeshletMesh to process for this thread let meshlet_id_local = clusters_written + local_invocation_index; if meshlet_id_local >= instance_meshlet_count { return; } // Find the overall cluster ID in the global cluster buffer let cluster_id = cluster_slice_start + meshlet_id_local; // Find the overall meshlet ID in the global meshlet buffer let meshlet_id = instance_meshlet_slice_start + meshlet_id_local; // Write results to buffers meshlet_cluster_instance_ids[cluster_id] = instance_id; meshlet_cluster_meshlet_ids[cluster_id] = meshlet_id; } </code></pre> The shader is now very efficient - the workgroup as a whole, once it reserves space for its clusters, is just repeatedly performing contiguous reads from and writes to global GPU memory. Overall, in a test scene with 1041 instances with 32217 meshlets per instance, we went from 0.55ms to 0.40ms, a small 0.15ms savings. NSight now shows that we're at 95% VRAM throughput, and that we're bound by global memory operations. The speed of this pass is now basically dependent on our GPU's bandwidth - there's not much I could do better, short of reading and writing less data entirely. Hitting a Bump</a> </h3> In the process of testing this PR, I ran into a rather confusing bug. The new fill cluster buffers pass worked on some smaller test scenes, but spawning 1042 instances with 32217 meshlets per instance (cliff mesh) lead to the below glitch. It was really puzzling - only some instances would be affected (concentrated in the same region of space), and the clusters themselves appeared to be glitching and changing each frame. Debugging the issue was complicated by the fact that the rewritten fill cluster buffers code is no longer deterministic. Clusters get written in different orders depending on how the scheduler schedules workgroups, and the order of the atomic writes. That meant that every time I clicked on a pass in RenderDoc to check it's output, the output order would completely change as RenderDoc replayed the entire command stream up until that point. Since using a debugger wasn't stable enough to be useful, I tried to think the logic through. My first thought was that my rewritten code was subtly broken, but testing on mainline showed something alarming - the issue persisted. Testing several old PRs showed that it went back for several PRs. It couldn't have been due to any recent code changes. It took me a week or so of trial and error, and debugging on mainline (which did have a stable output order since it used the old fill cluster buffers shader), but I eventually made the following observations: 1041 cliffs: rendered correctly</li> 1042 cliffs: did not render correctly, with 1 glitched instance</li> 1041 + N cliffs: the last N being spawned glitched out</li> 1042+ instances of a different mesh with much less meshlets than the cliff: did render correctly</li> 1042+ cliffs on the PR before I increased meshlet size to 255v/128t: rendered correctly</li> </ul> The issue turned out to be overflow of cluster ID. The output of the culling pass, and the data we store in the visbuffer, is cluster ID + triangle ID packed together in a single u32. After increasing the meshlet size, it was 25 bits for the cluster ID, and 7 bits for the triangle ID (2^7 = 128 triangles max). Doing the math, 1042 instances 32217 meshlets = 33570114 clusters. 2^25 - 33570114 = -15682. We had overflowed the cluster limit by 15682 clusters. This meant that the cluster IDs we were passing around were garbage values, leading to glitchy rendering on any instances we spawned after the first 1041. Obviously this is a problem - the whole point of virtual geometry is to make rendering independent of scene complexity, yet now we have a rather low limit of 2^25 clusters in the scene. The solution is to never store data per cluster in the scene, and only store data per visible cluster in the scene, i.e. clusters post LOD selection and culling. Not necessarily visible on screen, but visible in the sense that we're going to rasterize them. Doing so would require a large amount of architectural changes, however, and is not going to be a simple and easy fix. For now, I've documented the limitation, and merged this PR confident that it's not a regression. Software Rasterization Bugfixes</a> </h2> PR #16049</a> fixes some glitches in the software rasterizer. While testing out some scenes to prepare for the release, I discovered some previously-missed bugs with software rasterization. When zooming in to the scene, sometimes triangles would randomly glitch and cover the whole screen, leading to massive slowdowns (remember the software rasterizer is meant to operate on small triangles only). Similarly, when zooming out, sometimes there would be single stray pixels rendered that didn't belong. These issues didn't occur with only hardware rasterization enabled. Stray pixels on the tops of the pawns and king. </center> The stray pixels turned out to be due to two issues. The first bug is in how I calculated the bounding box around each triangle. I wasn't properly accounting for triangles that would be partially on-screen, and partially off-screen. I changed my bounding box calculations to stick to floating point, and clamped negative bounds to 0 to fix. The second bug is that I didn't perform any backface culling in the software rasterizer, and ignoring it does not lead to valid results. If you want a double-sided mesh, then you need to explicitly check for backfacing triangles and invert them. If you want backface culling (I do), then you need to reject the triangle if it's backfacing. Ignoring it turned out to not be an option - skipping backface culling earlier turned out to have bitten me :). The large green and orange triangles aren't supposed to be there. </center> The fullscreen triangles was trickier to figure out, but I ended up narrowing it down to near plane clipping. Rasterization math, specifically the homogenous divide, has a singularity</a> when z = 0. Normally, the way you solve this is by clipping to the near plane, which is a frustum plane positioned slightly in front of z = 0. As long as you provide the plane, GPU rasterizers handle near plane clipping for you automatically. In my software rasterizer, however, I had of course not accounted for near plane clipping. That meant that we were getting Nan/Infinity vertex positions due to the singularity during the homogenous divide, which led to the garbage triangles we were seeing. Proper near plane clipping is somewhat complicated (slow), and should not be needed for most clusters. Rather than have our software rasterizer handle near plane clipping, we're instead going to have the culling pass detect which clusters intersect the near plane, and put them in the hardware rasterization queue regardless of size. The fix for this is just two extra lines. // Before if cluster_is_small { // Software raster } else { // Hardware raster } // After let not_intersects_near_plane = dot(view.frustum[4u], culling_bounding_sphere_center) > culling_bounding_sphere_radius; if cluster_is_small && not_intersects_near_plane { // Software raster } else { // Hardware raster } </code></pre> With these changes, software raster is now visibly bug-free. Normal-aware LOD Selection</a> </h2> PR #16111</a> improves how we calculate the LOD cut to account for vertex normals. At the end of the Bevy 0.15's development cycle, meshoptimizer 0.22 was released, bringing some simplification improvements. Crucially, it greatly improves meshopt_simplifyWithAttributes()</code>. I now use this function to pass vertex normals into the simplifier, meaning that the deformation error the simplifier outputs (which we feed directly into the LOD cut selection shader) accounts for not only position deformation, but also normal deformation. Without this change, before this PR, visualizing the pixel positions was near-seamless as the LOD cut changed when you zoomed in or out. Pixel normals, however, had visible differences between LOD cuts. After this PR, normals are now near-seamless too. There's still work to be done in this area - I'm not currently accounting for UV coordinate deformation, and the weights I chose for position vs normal influence are completely arbitrary. The Nanite presentation talks about this problem a lot - pre-calculating an error amount that perfectly accounts for every aspect of human perception, for meshes with arbitrary materials, is a really hard problem. The best we can do is spend time tweaking heuristics, which I'll leave for a future PR. Results: Bevy 0.14 vs 0.15</a> </h2> Finally, I'd like to compare Bevy v0.14 to (what will soon release as) v0.15. The test scene we'll be comparing is 3375 instances of the Stanford bunny mesh arranged in a 15x15x15 cube, running at a resolution of 2240x1260 on an RTX 3080 locked to base clocks. As an additional test scene, we'll also be looking at 847 instances of the Huge Icelandic Lava Cliff</a> quixel megascan asset arranged in an 11x11x7 rectangular prism. This asset was too big to process in Bevy v0.14, so for this scene we'll only be looking at data from Bevy v0.15. Bunny scene in Bevy v0.14. Bunny scene in Bevy v0.15. Cliff scene in Bevy v0.15. GPU Timings</a> </h3> GPU timings to render the visbuffer (so excluding shading, and any CPU work) Pass</th> Bunny v0.14</th> Bunny v0.15</th> Cliff v0.15</th></tr> </thead> Fill Cluster Buffers</td> 0.30</td> 0.12</td> 0.31</td></tr> Culling First</td> 0.99</td> 0.19</td> 1.27</td></tr> Software Raster First</td> N/A</td> 0.42</td> 0.34</td></tr> Hardware Raster First</td> 3.44</td> < 0.01</td> 0.02</td></tr> Downsample Depth</td> 0.03</td> 0.03</td> 0.05</td></tr> Culling Second</td> 0.14</td> 0.06</td> 0.19</td></tr> Software Raster Second</td> N/A</td> < 0.01</td> < 0.01</td></tr> Hardware Raster Second</td> < 0.01</td> < 0.01</td> < 0.01</td></tr> Resolve Depth</td> N/A</td> 0.04</td> 0.05</td></tr> Resolve Material Depth</td> 0.04</td> 0.04</td> 0.04</td></tr> Downsample Depth</td> 0.03</td> 0.03</td> 0.05</td></tr> Total</td> 4.97 ms</td> 0.93 ms</td> 2.32 ms</td></tr> </tbody></table> DAG Layout</a> </h3> Bunny v0.14</th> Bunny v0.15</th> Cliff v0.15</th></tr> </thead> LOD Level</td> Meshlets</td> Meshlets With 64 Triangles (full)</td> LOD Level</td> Meshlets</td> Meshlets With 128 Triangles (full)</td> LOD Level</td> Meshlets</td> Meshlets With 128 Triangles (full)</td></tr> 0</td> 2251</td> 2250</td> 0</td> 1126</td> 1125</td> 0</td> 15616</td> 15615</td></tr> 1</td> 1320</td> 931</td> 1</td> 608</td> 517</td> 1</td> 7944</td> 7610</td></tr> 2</td> 672</td> 383</td> 2</td> 310</td> 251</td> 2</td> 4306</td> 3535</td></tr> 3</td> 373</td> 172</td> 3</td> 162</td> 129</td> 3</td> 2200</td> 1728</td></tr> 4</td> 173</td> 47</td> 4</td> 80</td> 61</td> 4</td> 1109</td> 844</td></tr> 5</td> 74</td> 15</td> 5</td> 38</td> 29</td> 5</td> 552</td> 425</td></tr> 6</td> 19</td> 4</td> 6</td> 20</td> 15</td> 6</td> 282</td> 214</td></tr> </td> </td> </td> 7</td> 10</td> 7</td> 7</td> 139</td> 105</td></tr> </td> </td> </td> 8</td> 5</td> 3</td> 8</td> 69</td> 51</td></tr> </td> </td> </td> 9</td> 3</td> 2</td> 9</td> 35</td> 26</td></tr> </td> </td> </td> 10</td> 2</td> 1</td> 10</td> 18</td> 13</td></tr> </td> </td> </td> 11</td> 1</td> 1</td> 11</td> 9</td> 6</td></tr> </td> </td> </td> </td> </td> </td> 12</td> 5</td> 3</td></tr> </td> </td> </td> </td> </td> </td> 13</td> 2</td> 1</td></tr> </td> </td> </td> </td> </td> </td> 14</td> 2</td> 1</td></tr> </tbody></table> Disk Usage</a> </h3> Bunny v0.14</th> Bunny v0.15</th> Cliff v0.15</th></tr> </thead> 5.05 MB</td> 3.61 MB</td> 49.83 MB</td></tr></tbody></table> Memory Usage</a> </h3> Bunny v0.14</th> Bunny v0.15</th> Cliff v0.15</th></tr></thead> Data Type</td> Size (bytes)</td> Data Type</td> Size (bytes)</td> Data Type</td> Size (bytes)</td></tr> Vertex Data</td> 3505296</td> Vertex Positions</td> 590132</td> Vertex Positions</td> 8537220</td></tr> Vertex IDs</td> 3651840</td> Vertex Normals</td> 788476</td> Vertex Normals</td> 10851996</td></tr> </td> </td> Vertex UVs</td> 1576952</td> Vertex UVs</td> 21703992</td></tr> Indices</td> 2738880</td> Indices</td> 1374336</td> Indices</td> 19245696</td></tr> Meshlets</td> 16 * 4882 = 78112</td> Meshlets</td> 32 * 2365 = 75680</td> Meshlets</td> 32 * 32288 = 1033216</td></tr> Bounding Spheres</td> 234336</td> Bounding Spheres</td> 113520</td> Bounding Spheres</td> 1549824</td></tr> </td> </td> Simplification Errors</td> 9460</td> Simplification Errors</td> 129152</td></tr> Total</td> 10.2 MB</td> Total</td> 4.5 MB</td> Total</td> 63.0 MB</td></tr> </tbody></table> </center> Discussion - Asset</a> </h3> First, lets compare the DAG layout between the Stanford bunny in Bevy 0.14 and 0.15. In Bevy 0.14, with 64 triangles max per meshlet, we start with 2250 meshlets at LOD 0. In Bevy 0.15, with 128 triangles max per meshlet, we have exactly half as many at 1125. In Bevy 0.14, the DAG has 7 levels, ending with 19 meshlets. In Bevy 0.15, the DAG has 12 levels, ending at a single meshlet! For an ideal DAG, we want half as many meshlets at each LOD level, resulting in half as many triangles at each level. That means that with 1125 meshlets at LOD level 0, we want ceil(log2(1125)) = 11</code> additional levels, for 12 total. In Bevy 0.15, we have 12! Meanwhile in Bevy 0.14, we also want 12 levels, but fall short at only 7 levels. We clearly improved the DAG structure compared to the previous version. Comparing meshlet fill rate (percentage of meshlets with the maximum number of triangles), both versions have an almost 100% fill rate at LOD 0 (the mesh is probably not a perfect multiple of the max triangle count). Meshoptimizer does a great job of equally partitioning triangles for the initial mesh. However, looking at further LOD levels, Bevy 0.14 performs very badly, going down to an abysmal 20% fill rate at the lowest. Bevy 0.15 is a lot better, with the worst fill rate being 76%, and the variance being a lot lower. It's still not perfect - a lot of the time we still have to deal with stuck triangles that never get simplified when processing complex meshes - but it's good progress! Memory and disk size are also much lower in Bevy 0.15 than Bevy 0.14, although a lot of this (but not all) comes down to the ~half as many overall meshlets in the DAG, meaning that there's less data to store in the first place. Still, adding up the vertex info for Bevy 0.14 (vertex data + vertex IDs = 7.16 MB</code>) and for Bevy 0.15 (vertex positions + normals + UVs = 2.956 MB</code>) shows a clear reduction in memory usage for the same amount of triangles in the original mesh. Discussion - Performance</a> </h3> Of course, asset size dosen't matter if performance is worse. After all, we could skip the additional LOD levels entirely to save on the cost of storing them, but we would get much worse runtime performance. The good news is that comparing the bunny scene in Bevy 0.14 to Bevy 0.15, rendering got almost 5x faster! Rasterization is the big immediate win. We were spending 3.44 ms on it in Bevy 0.14, and now only 0.42 ms on it in Bevy 0.15! Some of this comes down to software raster being faster than our non-mesh shader hardware raster, but a lot of it comes down to our improved DAG creation and LOD selection code. DAG building is really, really important - a huge chunk of your runtime performance comes down to building a good DAG, before you even start rendering! Culling (which is also LOD selection) got a little bit faster as well, going from 0.99 ms to 0.19 ms in the first pass, and 0.14 to 0.06 ms in the second pass. The culling pass no longer has to write out a list of triangles for visible clusters - now it's just writing a single cluster ID for each visible cluster, which is much faster. The other big win for culling is that with ~half as many meshlets to process, we only have to do half the work, as evidenced by the second pass performing a little over twice as well (the second pass here is basically just measuring overhead from spawning threads per cluster, since it's doing a single read + early-out for every single cluster as occlusion culling is near-perfect in static scene like this). Looking at the cliff scene with a much larger amount of meshlets and triangles, concentrated into much fewer instances, we can see some interesting results. Rasterization is actually faster in this scene than the bunny scene by 0.08 ms, but the first culling pass takes a whopping 1.27 ms, up from only 0.19 ms. Ouch. We ideally want similiar timings no matter the type of scene, so that artists don't have to care about things like number of triangles per mesh, but we're not quite there yet. Culling is the clear bottleneck. Finally, fill cluster buffers got a little bit faster as well, going down from 0.30 ms to 0.12 ms, with a good chunk of the performance again coming from having half as many total clusters in the scene. Roadmap</a> </h2> I got a lot done in Bevy 0.15, but there's still a ton left to do for Bevy 0.16 and beyond. The major, immediate priority (once I'm rested and ready to work on virtual geometry again) will be improving the culling/LOD selection pass. While cluster selection (I should rename the pass to that, that's a good name now that I think of it) is an embarrassingly parallel</a> problem in theory, in practice, having to dispatch a thread per cluster in the scene is an enormous waste of time. There can be million of clusters in the scene, and divergence and register usage on top of the sheer number of threads needed means that this pass is currently the biggest bottleneck. The fix is to (like Nanite does) traverse a BVH (tree) of clusters, where we only need to process clusters up until they would be the wrong LOD, and then can immediately stop processing their children. Doing tree traversal on a GPU is very tricky, and doing it maximally efficient depends on undefined behavior</a> of GPU schedulers that not all GPUs have, so I expect to spend a lot of time tweaking this once I get something working. The second major priority is getting rid of the need for the fill cluster buffers pass entirely. Besides letting us reclaim some more performance, the big win is that we could do away with the need to allocate buffers to hold instance ID + cluster ID per cluster in the scene, instead letting us store this data per visible (post LOD selection/culling) cluster in the scene. Besides the obvious memory savings, it also saves us from running into the cluster ID limit issue that was limiting our scene size before. We would no longer need a unique ID for each cluster in the scene - just a unique ID for visible clusters only, post culling and LOD selection, which is a much smaller amount. Besides cluster selection improvements, and improving on existing stuff, other big areas I could work on include: Streaming of meshlet vertex data (memory savings)</li> Disk-oriented asset compression (disk and load time savings)</li> Rendering clusters for all views at once (performance savings for shadow views)</li> Material shader optimizations (I haven't spent any time at all on this yet)</li> Occlusion culling fixes (I plan to port Hans-Kristian Arntzen's Granite renderer's HiZ shader</a> to WGSL)</li> Tooling to make working with MeshletMeshes easier</li> Testing and improving CPU performance for large amounts of instances</li> </ul> With any luck, in another few months I'll be writing about some of these topics in the post for Bevy 0.16. See you then! Appendix</a> </h2> Further resources on Nanite-style virtual geometry: https://advances.realtimerendering.com/s2021/Karis_Nanite_SIGGRAPH_Advances_2021_final.pdf</a></li> https://github.com/jglrxavpok/Carrot</a></li> https://github.com/LVSTRI/IrisVk</a></li> https://github.com/pettett/multires</a></li> https://github.com/Scthe/nanite-webgpu</a></li> https://github.com/ShawnTSH1229/SimNanite</a></li> https://github.com/SparkyPotato/radiance</a></li> https://github.com/zeux/meshoptimizer/blob/master/demo/nanite.cpp</a></li> </ul> Bevy's Fourth Birthday - A Year of Meshlets 2024-08-30T00:00:00+00:00 Written in response to Bevy's Fourth Birthday</a>. </blockquote> Introduction</a> </h3> The subtitle of this post is "Bevy's Fourth Birthday; Already???". I feel like I just wrote Bevy's third birthday reflections only a couple of months ago... time flies! It's been an awesome year, with a lot accomplished. Lets talk about that. A Year</del> 11 Months of Meshlets</a> </h3> What have I been doing in Bevy in the last year? The answer is learning and reimplementing the techniques behind Nanite (virtual geometry). That's mostly it. No really - the first commit I can find related to Bevy's meshlet feature (I need a better name for it...) is dated September 30th 2023. That's a little less than 2 months since the time I wrote Bevy's third birthday post, and around 11 months before the time of this writing. I did work on some other stuff - PCF being the most notable feature, along with some optimizations like async pipeline compilation to prevent shader stutter, and some experimental work that didn't pan out like solari and an improved render graph. But the large majority of my time spent has been on meshlets. In fact, this is going to be the third post on my blog in total - the first being Bevy's third birthday post, and the second being a huge writeup on my initial learnings from implementing meshlets. And I'm going to say it - I'm really proud of my work on this. It's an absolutely massive project spanning so many different concepts. It's been immensely rewarding, but also immensely draining. I've felt like quitting at times, and questioned the value it provides given that it's an AAA focused feature for a non-AAA-ready engine. But I've stuck with the project, and right now I can say that it's been worth it. Maybe it's not production ready yet (it's definitely not). Maybe there's still a ton of major things left to do, let alone optimize and tweak. Maybe occlusion culling is broken and I'm really avoiding looking at it because it's going to be painful to debug and fix; who can say? But I've learned a lot (really, a lot). It got referenced during a SIGGRAPH 2024 Advances in Real-Time Rendering in Games presentation. Brian Karis (the author of Nanite) mentioned that they enjoyed my blog post explaining it. Getting recognition and seeing people enjoy it has been awesome! And most of all, I'm immensely proud of myself and the work I put into it. Meshlets has been a journey, but a worthwhile one. Needless to say, in the next year, expect even more meshlet work. A lot has already been done since my last blog post - you'll see some of that when Bevy 0.15 releases. Hopefully I'll continue to be able to avoid burnout. Bevy in General</a> </h3> Bevy 0.12, 0.13, and 0.14 were all released in the last year, and have brought an absolutely massive amount of improvements. Nice job everyone! Bevy is not just one person, or even 10, and I think that has really shown this year more than ever. Unlike last year, I don't have much I want to discuss in depth here, but there's a few things I want to talk about, in no particular order. Alice was hired (thank you sponsors) as Bevy's project manager. She's done an amazing job helping push forward PRs, coordinate developers, and generally get things done. I think I speak for all the Bevy devs when I say getting things done is nice. I'm looking forwards to more of that next year - thanks Alice! One thing I'd like to see from project management going forwards is closing 90% of our open PRs. We have hundreds of open PRs, some dating back to 2021. There's no way any of that is getting merged, and in my opinion, we should be closing old PRs and converting stuff we still want to open issues. The more open PRs we have, the harder it is for maintainers and new contributors to help review and push forward work. A lot of PRs end up rotting, and we're losing contributors sadly. I've started this process for PRs labeled rendering, but there's still a lot left to do, and a bunch of PRs that need SME or maintainer decisions. Gamedev is a bit unique given the extremely wide range of subjects (rendering, physics, assets, game logic, artist tooling, etc), but maybe there are things we can learn from other large open source projects. Blender, Godot, others - any advice for us? One of Bevy's most requested features (including from me) is a GUI program (editor) for modifying, inspecting, and profiling scenes. A couple of months ago I volunteered to help coordinate and push forward editor-related work, and then uhh, pretty much stopped working on it a few weeks after. Turns out, I didn't have the motivation to do both editor work, and meshlet work. Meshlet work ended up winning out. Sorry to everyone I let down on that. I am still excited to work on the editor, but unfortunately I've realized I'm not so motivated to work on more foundational work such as scene editing, asset processing, and especially UI frameworks. These subjects tend to circular around with out much real progress, and turns out I am not a leader able to push forward discussions in these areas. Side note, I also released my own UI framework this year (competing with the tens of other Bevy UI projects). It's called bevy_dioxus</a>, and it builds on top of the excellent Dioxus library to provide reactivity. Rendering is handled by spawning bevy_ui entities. No documentation, but it's a fairly small amount of fairly clean code, and it's usable and integrates well with Bevy's ECS. No reinventing the wheel here. For a few weeks work, I'm pretty happy with the process of making it and how it turned out. Rendering is in pretty good shape now. Still lots more to implement or improve, but it's pretty usable! Going forwards, it would be nice to put more focus on documentation, ease of use, and ergonomics. The Material/AsBindGroup API is pretty footgun-y and not always performant, and there's a general lack of documentation for lower-level APIs besides "ask existing developers how to use things". A new render graph that automatically handled resources could help a lot with this, along with more asset-driven material APIs, and there's been some interest and design work in these space that I'm looking forward to. Assets and asset processing needs a lot of work. Ignoring the editor (which will need to build on these APIs), Bevy still needs a lot of work on extending the asset processing API, and implementing asset workflows for baking lighting, compressing textures, etc. A real battle-tested end-to-end asset workflow, from artists to developers to built game, really needs developing. I'm hoping that this will be a bigger focus next year, in parallel with the editor. Solarn't</a> </h3> Last year I demoed a realtime, fully dynamic raytraced GI solution I called Bevy Solari... and now a year later I've written nothing else on it, and a lot about virtual geometry. What gives? Well, I did work on it for a few months more after that blog post, but the project kind of died for a variety of reasons. I was using a custom, somewhat buggy fork of wgpu/naga/naga_oil, and it became very difficult to constantly rebase on top of Bevy's and those project's upstream branches. The approach I was using (screen space probes based on Lumen and GI-1.0, and later screen space radiance cascades) started souring on me for complexity and quality reasons. My world space radiance cache was completely broken (I've sinced learned what I was doing wrong, thanks Darius Bouma!) and I lost motivation to work on it. And finally I ended up starting meshlets, and later transitioned all of my time to it. So, Solari is dead, at least for now. Nowadays I feel like ReSTIR-based techniques (ReSTIR DI and GI plus screen space denoisers) hold much more promise. DDGI is also a great solution that I initially discarded for quality reasons, but its pretty simple to implement, very easy to scale up or down in cost, and gives fairly decent results all things considered. DDGI is probably worth another consideration, not even necessarily for being the main Bevy Solari project, but as an easier project to start with, and more scalable alternative to ReSTIR. No reason both could not coexist. If raytracing gets upstreamed into wgpu, I would happily pick this project back up, particularly as I start to feel the need for a break from meshlets. Writing</a> </h3> Last year I finally started a blog... but didn't end up writing much. Or I did, but 80% of it was concentrated into one really long, really time-consuming post. It took something like a month to write. I also wrote a rather long post on reddit's /r/rust about things I disliked in Rust (after using it for so long, and recently using a lot of Java developing an enterprise application) (I do still love Rust, that hasn't changed). Surprisingly to me, a lot of people liked it and it sparked some interesting discussions. Seperate from the post's contents, people also asked me if I had a blog where they could read more of my writing. I of course, had to tell them that yes I do, but it only has two posts, one of which is extremely niche and technical, so there's not much all that much to read. Needless to say, this year, I'd like to try to blog more. I'm going to try to get more writing out, and focus less on quality and spending so much time editing. Starting of course, with this post. With my new focus on spending less time writing, I'm ending this post now without trying to find a conclusion that flows better. See everyone next year! Virtual Geometry in Bevy 0.14 2024-06-09T00:00:00+00:00 Introduction</a> </h1> The 0.14 release of the open source Bevy</a> game engine is coming up, and with it, the release of an experimental virtual geometry feature that I've been working on for several months. In this blog post, I'm going to give a technical deep dive into Bevy's new "meshlet" feature, what improvements it brings, techniques I tried that did or did not work out, and what I'm looking to improve on in the future. There's a lot that I've learned (and a lot of code I've written and rewritten multiple times), and I'd like to share what I learned in the hope that it will help others. This post is going to be very long, so I suggest reading over it (and the Nanite slides) a couple of times to get a general overview of the pieces involved, before spending any time analyzing individual steps. At the time of this writing, my blog theme dosen't have a table of contents sidebar that follows you as you scroll the page. I apologize for that. If you want to go back and reference previous sections as you read this post, I suggest using multiple browser tabs. I'd also like to take a moment to thank LVSTRI</a> and jglrxavpok</a> for sharing their experiences with virtual geometry, atlv24</a> for their help in several areas, especially for their work adding some missing features I needed to wgpu/naga, other Bevy developers for testing and reviewing my PRs, Unreal Engine (Brian Karis, Rune Stubbe, Graham Wihlidal) for their excellent and highly detailed SIGGRAPH presentation</a>, and many more people than I can name who provided advice on this project. Code for this feature can be found on github</a>. If you're already familiar with Nanite, feel free to skip the next few sections of background info until you get to the Bevy-specific parts. Why Virtual Geometry?</a> </h2> Before talking about what virtual geometry is, I think it's worth looking at what problems it is trying to solve. Lets go over the high level steps your typical pre-2015 renderer</a> would perform to render some basic geometry. I've omitted some steps that aren't relevant to this post such as uploading mesh and texture data, shadow map rendering, lighting, and other shading details. First, on the CPU: Frustum culling of instances outside of the camera's view</li> Choosing the appropriate level of detail (LOD) for each instance</li> Sorting and batching instances into multiple draw lists</li> Recording draw calls into command buffers for each draw list</li> </ul> Then, on the GPU: Setting up GPU state according to the command buffers</li> Transforming vertices and rasterizing triangles</li> Depth testing triangle fragments</li> Shading visible fragments</li> </ul> Now lets try taking this renderer, and feeding it a dense cityscape made of 500 million triangles, and 150 thousand instances of different meshes. It's going to be slow. Why? Lets look at some of the problems: Frustum culling lets us skip preparing or drawing instances that are outside the camera's frustum, but what if you have an instance that's only partially visible? The GPU still needs to transform, clip, and process all vertices in the mesh. Or, what if the entire scene is in the camera's frustum?</li> If one instance is in front of another, it's a complete waste to draw an instance to the screen that will later be completely drawn over by another (overdraw).</li> Sorting, batching, and encoding the command buffers for all those instances are going to be slow. Each instance likely has a different vertex and index buffer, different set of textures to bind, different shader (pipeline) for vertex and fragment processing, etc.</li> The GPU will spend time spinning down and spinning back up as it switches state between each draw call.</li> </ul> Now, it's no longer 2015, there are a variety of techniques (some from before 2015, that I purposefully left out) to alleviate a lot of these issues. Deferred shading or a depth-only prepass means overdraw is less costly, bindless techniques and ubershaders reduce state switching, multi-draw can reduce draw count, etc. However, there are some more subtle issues that come up: Storing all that mesh data in memory takes too much VRAM. Modern mid-tier desktop GPUs tend to have 8-12 GB of VRAM, which means all your mesh data and 4k textures need to be able to fit in that amount of storage.</li> LODs were one of the steps that were meant to help reduce the amount of geometry you were feeding a GPU. However, they come with some downsides: 1) The transition between LODs tends to be noticable, even with a crossfade effect, 2) Artists need to spend time producing and tweaking LODs from their initial high-poly meshes, and 3) Like frustum culling, they don't help with the worst case of simply being close to a lot of high-poly geometry, unless you're willing to cap out at a lower resolution than the artist's original mesh.</li> </ul> There's also another issue I've saved for last. Despite all the culling and batching and LODs, we still have too much geometry to draw every frame. We need a better way to deal with it than simple LODs. What is Virtual Geometry?</a> </h2> With the introduction of Unreal Engine 5 in 2021 came the introduction of a new technique called Nanite</a>. Nanite is a system where you can preprocess your non-deforming opaque meshes, and at runtime be able to very efficiently render them, largely solving the above problems with draw counts, memory limits, high-poly mesh rasterization, and the deficiencies of traditional LODs. Nanite works by first splitting your base mesh into a series of meshlets - small, independent clusters of triangles. Nanite then takes those clusters, groups clusters together, and simplifies the groups into a smaller set of new clusters. By repeating this process, you get a tree of clusters where the leaves of the tree form the base mesh, and the root of the tree forms a simplified approximation of the base mesh. Now at runtime, we don't just have to render one level (LOD) of the tree. We can choose specific clusters from different levels of the tree so that if you're close to one part of the mesh, it'll render many high resolution clusters. If you're far from a different part of the mesh, however, then that part will use a couple low resolution clusters that are cheaper to render. Unlike traditional LODs, which are all or nothing, part of the mesh can be low resolution, part of the mesh can be high resolution, and a third part can be somewhere in between - all at the time same time, all on a very granular level. Additionally, the transitions between LODs can be virtually imperceptible and extremely smooth, without extra rendering work. Traditional LODs typically have to hide transitions with crossfaded opacity between two levels. Combine this LOD technique with some per-cluster culling, a visibility buffer, streaming in and out of individual cluster data to prevent high memory usage, a custom rasterizer, and a whole bunch of others parts, and you end up with a renderer that can deal with a scene made of 500 million triangles. I mentioned before that meshes have to be opaque, and can't deform or animate (for the initial release of Nanite in Unreal Engine 5.0 this is true, but it's an area Unreal is working to improve). Nanite isn't perfect - there are still limitations. But the ceiling of what's feasible is a lot higher. Virtual Geometry in Bevy</a> </h2> Now that the background is out of the way, lets talk about Bevy. For Bevy 0.14, I've written an initial implementation that largely copies the basic ideas of how Nanite works, without implementing every single optimization and technique. Currently, the feature is called meshlets (likely to change to virtual_geometry or something else in the future). In a minute, I'll get into the actual frame breakdown and code for meshlets, but first lets start with the user-facing API. Users wanting to use meshlets should compile with the meshlet</code> cargo feature at runtime, and meshlet_processor</code> cargo feature for preprocessing meshes (again, more on how that works later) into the special meshlet-specific format the meshlet renderer uses. Enabling the meshlet</code> feature unlocks a new module: bevy::pbr::experimental::meshlet</code>. First step, add MeshletPlugin</code> to your app: app.add_plugins(MeshletPlugin); </code></pre> Next, preprocess your Mesh</code> into a MeshletMesh</code>. Currently, this needs to be done manually via MeshletMesh::from_mesh()</code> (again, you need the meshlet_processor</code> feature enabled). This step is very slow, and should be done once ahead of time, and then saved to an asset file. Note that there are limitations on the types of meshes and materials supported, make sure to read the docs. I'm in the middle of working on</a> an asset processor system to automatically convert entire glTF scenes, but it's not quite ready yet. For now, you'll have to come up with your own asset processing and management system. Now, spawn your entities. In the same vein as MeshMaterialBundle</code>, there's a MeshletMeshMaterialBundle</code>, which uses a MeshletMesh</code> instead of the typical Mesh</code>. commands.spawn(MaterialMeshletMeshBundle { meshlet_mesh: meshlet_mesh_handle.clone(), material: material_handle.clone(), transform: Transform::default().with_translation(Vec3::new(x as f32 / 2.0, 0.0, 0.3)), ..default() }); </code></pre> Lastly, a note on materials. Meshlet entities use the same Material</code> trait as regular mesh entities. There are 3 new methods that meshlet entities use however: meshlet_mesh_fragment_shader</code>, meshlet_mesh_prepass_fragment_shader</code>, and meshlet_mesh_deferred_fragment_shader</code>. Notice that there is no access to vertex shaders. Meshlet rendering uses a hardcoded vertex shader that cannot be changed. Fragment shaders for meshlets are mostly the same as fragment shaders for regular mesh entities. The key difference is that instead of this: @fragment fn fragment(vertex_output: VertexOutput) -> @location(0) vec4<f32> { // ... } </code></pre> You should use this: #import bevy_pbr::meshlet_visibility_buffer_resolve::resolve_vertex_output @fragment fn fragment(@builtin(position) frag_coord: vec4<f32>) -> @location(0) vec4<f32> { let vertex_output = resolve_vertex_output(frag_coord); // ... } </code></pre> Mesh Conversion</a> </h1> We're now going to start the portion of this blog post going into how everything is implemented. The first step, before we can render anything, is to convert all meshes to meshlet meshes. I talked about the user-facing API earlier on, but in this section we'll dive into what MeshletMesh::from_mesh()</code> is doing under the hood in from_mesh.rs</code>. This section will be a bit dry, lacking commentary on why I did things, in favor of just describing the algorithm itself. The reason is that I don't have many unique insights into the conversion process. The steps taken are pretty much just copied from Nanite (except Nanite does it better). If you're interested in understanding this section in greater detail, definitely check out the original Nanite presentation. Feel free to skip ahead to the frame breakdown section if you are more interested in the runtime portion of the renderer. The high level steps for converting a mesh are as follows: Build LOD 0 meshlets</li> For each meshlet, find the set of all edges making up the triangles within the meshlet</li> For each meshlet, find the set of connected meshlets (sharing an edge)</li> Divide meshlets into groups of roughly 4</li> For each group of meshlets, build a new list of triangles approximating the original group</li> For each simplified group, break them apart into new meshlets</li> Repeat steps 3-7 using the set of new meshlets, until we run out of meshlets to simplify</li> </ol> Build LOD 0 Meshlets</a> </h2> We're starting with a generic triangle mesh, so the first step is to group its triangles into an initial set of meshlets. No simplification or modification of the mesh is involved - we're simply splitting up the original mesh into a set meshlets that would render exactly the same. The crate meshopt-rs</code> provides Rust bindings to the excellent meshoptimizer</code> library, which provides a nice build_meshlets()</code> function for us that I've wrapped into compute_meshlets()</code>. // Split the mesh into an initial list of meshlets (LOD 0) let vertex_buffer = mesh.get_vertex_buffer_data(); let vertex_stride = mesh.get_vertex_size() as usize; let vertices = VertexDataAdapter::new(&vertex_buffer, vertex_stride, 0).unwrap(); let mut meshlets = compute_meshlets(&indices, &vertices); </code></pre> We also need some bounding spheres for each meshlet. The culling bounding sphere is straightforward - compute_meshlet_bounds()</code>, again from meshopt-rs</code>, will give us a bounding sphere encompassing the meshlet that we can use for frustum and occlusion culling later on. The self_lod</code> and parent_lod</code> bounding spheres need a lot more explanation. As we simplify each group of meshlets into new meshlets, we will deform the mesh slightly. That deformity adds up over time, eventually giving a very visibly different mesh from the original. However, when viewing the very simplified mesh from far away, due to perspective the difference will be much less noticable. While we would want to view the original (or close to the original) mesh close-up, at longer distances we can get away with rendering a much simpler version of the mesh without noticeable differences. So, how to choose the right LOD level, or in our case, the right LOD tree cut? The LOD cut will be based on the simplification error of each meshlet along the cut, with the goal being to select a cut that is imperceptibly different from the original mesh at the distance we're viewing the mesh at. For reasons I'll get into later during the runtime section, we're going to treat the error as a bounding sphere around the meshlet, with the radius being the error. We're also going to want two of these: one for the current meshlet itself, and one for the less-simplified group of meshlets that we simplified into the current meshlet (the current meshlet's parents in the LOD tree). LOD 0 meshlets, being the original representation of the mesh, have no error (0.0). They also have no set of parent meshlets, which we will represent with an infinite amount of error (f32::MAX), again for reasons I will get into later. let mut bounding_spheres = meshlets .iter() .map(|meshlet| compute_meshlet_bounds(meshlet, &vertices)) .map(convert_meshlet_bounds) .map(|bounding_sphere| MeshletBoundingSpheres { self_culling: bounding_sphere, self_lod: MeshletBoundingSphere { center: bounding_sphere.center, radius: 0.0, }, parent_lod: MeshletBoundingSphere { center: bounding_sphere.center, radius: f32::MAX, }, }) .collect::<Vec<_>>(); </code></pre> Find Meshlet Edges</a> </h2> Now that we have our initial set of meshlets, we can start simplifying. The first step is to find the set of triangle edges that make up each meshlet. This can be done with a simple loop over triangles, building a hashset of edges where each edge is ordered such that the smaller numbered vertex comes before the larger number vertex. This ensures that we don't accidentally add both (v1, v2) and (v2, v1), which conceptually are the same edge. Each triangle has 3 vertices and 3 edges. let mut meshlet_triangle_edges = HashMap::new(); for i in meshlet.triangles.chunks(3) { let v0 = meshlet.vertices[i[0] as usize]; let v1 = meshlet.vertices[i[1] as usize]; let v2 = meshlet.vertices[i[2] as usize]; meshlet_triangle_edges.insert((v0.min(v1), v0.max(v1))); meshlet_triangle_edges.insert((v0.min(v2), v0.max(v2))); meshlet_triangle_edges.insert((v1.min(v2), v1.max(v2))); } </code></pre> Find Connected Meshlets</a> </h2> Next, we need to find the meshlets that connect to each other. A meshlet will be considered as connected to another meshlet if both meshlets share at least one edge. In the previous step, we built a set of edges for each meshlet. Finding if two meshlets share any edges can be done by simply taking the intersection of their two edge sets, and checking if the resulting set is not empty. We will also store the amount of shared edges between two meshlets, giving a heuristic for how "connected" each meshlet is to another. This is simply the size of the intersection set. Overally, we will build a list per meshlet, containing tuples of (meshlet_id, shared_edge_count) for each meshlet connected to the current meshlet. for (meshlet_id1, meshlet_id2) in simplification_queue.tuple_combinations() { let shared_edge_count = triangle_edges_per_meshlet[&meshlet_id1] .intersection(&triangle_edges_per_meshlet[&meshlet_id2]) .count(); if shared_edge_count != 0 { connected_meshlets_per_meshlet .get_mut(&meshlet_id1) .unwrap() .push((meshlet_id2, shared_edge_count)); connected_meshlets_per_meshlet .get_mut(&meshlet_id2) .unwrap() .push((meshlet_id1, shared_edge_count)); } } </code></pre> Partition Meshlets Into Groups</a> </h2> Now that we know which meshlets are connected, the next step is to group them together. We're going to aim for 4 meshlets per group, although there's no way of guaranteeing that. How should we determine which meshlets go in which group? You can view the connected meshlet sets as a graph. Each meshlet is a node, and bidirectional edges connect one meshlet to another in the graph if we determined that they were connected earlier. The weight of each edge is the amount of shared edges between the two meshlet nodes. Partitioning the meshlets into groups is now a matter of partitioning the graph. I use the metis-rs</code> crate which provides Rust bindings to the METIS</code> library. The edge weights will be used so that meshlets with a high shared edge count are more likely to be group together. The code to format this data for metis is a bit complicated, but in the end we have a list of groups, where each group is a list of meshlets. Simplify Groups</a> </h2> Now for an important step, and the most tricky. We take each group, and merge the triangle lists of the underlying meshlets together into one large list of triangles, forming a new mesh. Now, we can simplify this new mesh into a lower-resolution (faster to render) version. Meshopt again provides a helpful simplify()</code> function for us. Finally, less triangles to render! In addition to the new mesh, we get an "error" value, describing how much the mesh deformed by when simplifying. The quadratic error metric (QEM) returned from simplifying is a somewhat meaningless value, but we can use simplify_scale()</code> to get an object-space value. This value is still fairly meaningless, but we can treat it as the maximum amount of object-space distance a vertex was displaced by during simplification. The error represents displacement from the meshlets we simplified, but we want the displacement from the original (LOD 0) meshlets. We can add the max error of the meshlets that went into building the current meshlet group (child nodes of the parent node that we're currently building in the LOD tree) to make the error relative to LOD 0. If this all feels handwavy to you, that's because it is. And this is vertex positions only; we haven't even considered UV error during simplification, or how the mesh's eventual material influences perceptual differences between LOD levels. Perceptual simplification is very much an unsolved problem in computer graphics, and for now Bevy only uses positions for simplification. You'll have to take my word for it that using the error like this works. You'll see how it gets used to pick the LOD level during runtime in a later section. For now, we'll take the group error and build a bounding sphere out of it, and assign it as the parent LOD bounding sphere for the group's (parent node, higher LOD) underlying meshlets (child nodes, lower LOD). // Simplify the group to ~50% triangle count let Some((simplified_group_indices, mut group_error)) = simplify_meshlet_groups(group_meshlets, &meshlets, &vertices, lod_level) else { continue; }; // Add the maximum child error to the parent error to make parent error cumulative from LOD 0 // (we're currently building the parent from its children) group_error += group_meshlets.iter().fold(group_error, |acc, meshlet_id| { acc.max(bounding_spheres[meshlet_id].self_lod.radius) }); // Build a new LOD bounding sphere for the simplified group as a whole let mut group_bounding_sphere = convert_meshlet_bounds(compute_cluster_bounds( &simplified_group_indices, &vertices, )); group_bounding_sphere.radius = group_error; // For each meshlet in the group set their parent LOD bounding sphere to that of the simplified group for meshlet_id in group_meshlets { bounding_spheres[meshlet_id].parent_lod = group_bounding_sphere; } </code></pre> Split Groups</a> </h2> Finally, the last step is to take the large mesh formed from simplifying the entire meshlet group, and split it into a set of brand new meshlets. This is in fact the same process as splitting the original mesh into meshlets. If everything went optimally, we should have gone from the original 4 meshlets per group, to 2 new meshlets per group with 50% less triangles overall. For each new meshlet, we'll calculate a bounding sphere for culling, assign the self_lod bounding sphere as that of the group, and the parent_lod bounding sphere again as uninitialized. // Build new meshlets using the simplified group let new_meshlets_count = split_simplified_groups_into_new_meshlets( &simplified_group_indices, &vertices, &mut meshlets, ); // Calculate the culling bounding sphere for the new meshlets and set their LOD bounding spheres let new_meshlet_ids = (meshlets.len() - new_meshlets_count)..meshlets.len(); bounding_spheres.extend( new_meshlet_ids .map(|meshlet_id| { compute_meshlet_bounds(meshlets.get(meshlet_id), &vertices) }) .map(convert_meshlet_bounds) .map(|bounding_sphere| MeshletBoundingSpheres { self_culling: bounding_sphere, self_lod: group_bounding_sphere, parent_lod: MeshletBoundingSphere { center: group_bounding_sphere.center, radius: f32::MAX, }, }), ); </code></pre> We can repeat this whole process several times, ideally getting down to a single meshlet forming the root of the LOD tree. In practice, my current code can't get to that point for most meshes. Frame Breakdown</a> </h1> With the asset processing part out of the way, we can finally move onto the more interesting runtime code section. The frame capture we'll be looking at is this scene with 3092 copies of the Stanford Bunny. Five of the bunnies are using unique PBR materials (they're hiding in the top middle), while the rest use the same debug material that visualizes the clusters/triangles of the mesh. Each bunny is made of 144,042 triangles at LOD 0, with 4936 meshlets total in the LOD tree. GPU timings were measured on a RTX 3080 locked to base clock speeds (so not as fast as you would actually get in practice), rendering at 2240x1260, averaged over 10 frames. Clusters visualization Triangles visualization </blockquote> NSight profile </blockquote> The frame can be broken down into the following passes: Fill cluster buffers (0.22ms)</li> Cluster culling first pass (0.49ms)</li> Raster visbuffer first pass (1.85ms +/- 0.33ms)</li> Build depth pyramid for second pass (0.03ms)</li> Cluster culling second pass (0.11ms)</li> Raster visbuffer second pass (< 0.01ms)</li> Copy material depth (0.04ms)</li> Material shading (timings omitted as this is a poor test for materials)</li> Build depth pyramid for next frame (0.03ms)</li> </ol> Total GPU time is ~2.78ms +/- 0.33ms. There's a lot to cover, so I'm going to try and keep it fairly brief in each section. The high level concepts of all of these passes (besides the first pass) are copied from Nanite, so check out their presentation for further details. I'll be trying to focus more on the lower level code and reasons why I implemented things the way I did. My first attempt at a lot of these passes had bugs, and was way slower. The details and data flow is what takes the concept from a neat tech demo, to an actually usable and scalable renderer. Terminology</a> </h2> First, some terminology: asset buffers</code> - When a new MeshletMesh asset is loaded, we copy the buffers it's made of into large suballocated buffers. All the vertex data, meshlet data, bounding spheres, etc for multiple MeshletMesh assets are packed together into one large buffer per data type.</li> instance</code> - A single Bevy entity with a MeshletMesh and Material.</li> instance uniform</code> - A transform matrix and mesh flags for an instance.</li> material</code> - A combination of pipeline and bind group used for shading fragments.</li> meshlet</code> - A single meshlet from within a MeshletMesh asset, pointing to data within the asset buffers (more or less).</li> cluster</code> - A single renderable piece of an entity. Each cluster is associated with an instance and a meshlet. All of our shaders will operate on clusters, and not on meshlets. You can think of these like an instance of a meshlet for a specific entity, in the same way you can have an instance of a class in object-oriented programming languages.</li> Up to this point I've been using meshlet and cluster interchangeably. From now on, they have seperate, defined meanings.</li> </ul> </li> view</code> - A perspective or orthographic camera with an associated depth buffer and optional color output. The main camera is a view, and additional views can be dynamically generated for e.g. rendering shadowmaps.</li> id</code> - A u32 index into a buffer.</li> </ul> Fill Cluster Buffers</a> </h2> Now the first pass we're going to look at might be surprising. Over the course of the frame, for each cluster we will need its instance (giving us a transform and material), along with its meshlet (giving us vertex data and bounding spheres). While the cluster itself is implicit (each thread or workgroup of a shader will handle one cluster, with the global thread/workgroup ID being the cluster ID), we need some method of telling the GPU what the instance and meshlet for each cluster is. I.e., we need an array of instance IDs and meshlet IDs such that we can do let cluster_instance = instances[cluster_instance_ids[cluster_id]]</code> and let cluster_meshlet = meshlets[cluster_meshlet_ids[cluster_id]]</code>. The naive method would be to simply write out these two buffers from the CPU and transfer them to the GPU. This was how I implemented it initially, and it worked fine for my simple initial test scene with a single bunny, but I very quickly ran into performance problems when trying to scale up to rendering 3000 bunnies. Each ID is a 4-byte u32, and it's two IDs per cluster. That's 8 bytes per cluster. With 3092 bunnies in the scene, and 4936 meshlets per bunny, that's 8 * 3092 * 4936 bytes total = ~122.10 MBs total. For dedicated GPUs, uploading data from the system's RAM to the GPU's VRAM is done over PCIe. PCIe x16 Gen3 max bandwidth is 16 GB/s. Ignoring data copying costs and other overhead, and assuming max PCIe bandwidth, that would mean it would take ~7.63ms to upload cluster data. That's 7.63 / 16.6 = ~46% of our frame budget gone at 60fps, before we've even rendered anything! Obviously, we need a better method. Instead of uploading per-cluster data, we're going to stick to uploading only per-instance data. Specifically, two buffers called instance_meshlet_counts_prefix_sum</code> and instance_meshlet_slice_starts</code>. Each buffer will be an array of integers, with an entry per instance. The former will contain a prefix sum (calculated on the CPU while writing out the buffer) of how many meshlets each instance is made of. The latter will contain the index of where in the meshlet asset buffer each instance's list of meshlets begin. Now we're uploading only 8 bytes per instance, and not per cluster, which is much, much cheaper. Looking back at our scene, we're uploading 3092 * 8 bytes total = ~0.025 MBs total. This is a huge improvement over the ~122.10 MBs from before. Once the GPU has this data, we can have the GPU write out the cluster_instance_ids</code> and cluster_meshlet_ids</code> buffers from a compute shader. Max VRAM bandwidth on my RTX 3080 is a whopping 760.3 GB/s; ~47.5x faster than the 16 GB/s of bandwidth we had over PCIe. Each thread of the compute shader will handle one cluster, and do a binary search over the prefix sum array to find to what instance it belongs to. Binary search might seem surprising - it's multiple dependent divergent memory accesses within a thread, and one of the biggest performance metrics for GPU code is cache efficiency. However, it's very coherent across threads within the subgroup, and scales extremely well (O log n) with the number of instances in the scene. In practice, while it could be improved, the performance of this pass has not been a bottleneck. Now that we know what instance the cluster belongs to, it's trivial to calculate the meshlet index of the cluster within the instance's meshlet mesh asset. Adding that to the instance's meshlet_slice_start using the other buffer we uploaded gives us the global meshlet index within the overall meshlet asset buffer. The thread can then write out the two calculated IDs for the cluster. This is the only pass that runs once per-frame. The rest of the passes all run once per-view. /// Writes out instance_id and meshlet_id to the global buffers for each cluster in the scene. @compute @workgroup_size(128, 1, 1) // 128 threads per workgroup, 1 cluster per thread fn fill_cluster_buffers( @builtin(workgroup_id) workgroup_id: vec3<u32>, @builtin(num_workgroups) num_workgroups: vec3<u32>, @builtin(local_invocation_id) local_invocation_id: vec3<u32> ) { // Calculate the cluster ID for this thread let cluster_id = local_invocation_id.x + 128u * dot(workgroup_id, vec3(num_workgroups.x num_workgroups.x, num_workgroups.x, 1u)); if cluster_id >= cluster_count { return; } // Binary search to find the instance this cluster belongs to var left = 0u; var right = arrayLength(&meshlet_instance_meshlet_counts_prefix_sum) - 1u; while left <= right { let mid = (left + right) / 2u; if meshlet_instance_meshlet_counts_prefix_sum[mid] <= cluster_id { left = mid + 1u; } else { right = mid - 1u; } } let instance_id = right; // Find the meshlet ID for this cluster within the instance's MeshletMesh let meshlet_id_local = cluster_id - meshlet_instance_meshlet_counts_prefix_sum[instance_id]; // Find the overall meshlet ID in the global meshlet buffer let meshlet_id = meshlet_id_local + meshlet_instance_meshlet_slice_starts[instance_id]; // Write results to buffers meshlet_cluster_instance_ids[cluster_id] = instance_id; meshlet_cluster_meshlet_ids[cluster_id] = meshlet_id; } </code></pre> Culling (First Pass)</a> </h2> I mentioned earlier that frustum culling is not sufficent for complex scenes. With meshlets, we're going to have a lot of geometry in view at once. Rendering all of that is way too expensive, and unnecessary. It's a complete waste to spend time rendering a bunch of detailed rocks and trees, only to draw a wall in front of it later on (overdraw). Two pass occlusion culling is the method that we're going to use to reduce overdraw. We're going to start by drawing all the clusters that actually contributed to the rendered image last frame, under the assumption that those are a good approximation of what will contribute to the rendered image this frame. That's the first pass. Then, we can build a depth pyramid, and use that to cull all the clusters that we didn't look at in the first pass, i.e. that didn't render last frame. The clusters that survive the culling get drawn. That's the second pass. In the example with the wall with the rocks and trees behind it, we could see that last frame the wall clusters contributed pixels to the final image, but none of the rock or tree clusters did. Therefore in the first pass, we would draw only the wall, and then build a depth pyramid from the resulting depth. In the second pass, we would test the remaining clusters (all the trees and rocks) against the depth pyramid, and see that they would still be occluded by the wall, and therefore we can skip drawing them. If there were some new rocks that came into view as we peeked around the corner, they'd be drawn here. The second pass functions as a cleanup pass, for rendering the objects that we missed in the first pass. Done correctly, two pass occlusion culling reduces the amount of clusters we draw in an average frame, saving rendering time without any visible artifacts. Initial Cluster Processing</a> </h3> Before we start looking at the algorithm steps and code, I'd like to note that this shader is very performance and bug sensitive. I've written and rewritten it several times. While the concepts are simple, it's easy to break the culling, and the choices in data management that we make here affect the rest of the rendering pipeline quite significantly. This is going to be a long and complicated shader, so let's dive into it. The first pass of occlusion culling is another compute shader dispatch with one thread per cluster. A minor detail that I didn't mention last time we saw this pattern, is that with millions of clusters in a scene, you would quickly hit the limit of the maximum number of workgroups you can spawn per dispatch dimension if you did a 1d dispatch over all clusters. To work around this, we instead we do a 3d dispatch with each dimension of size ceil(cbrt(workgroup_count))</code>. We can then swizzle the workgroup and thread indices back to 1d in the shader. @compute @workgroup_size(128, 1, 1) // 128 threads per workgroup, 1 cluster per thread fn cull_meshlets( @builtin(workgroup_id) workgroup_id: vec3<u32>, @builtin(num_workgroups) num_workgroups: vec3<u32>, @builtin(local_invocation_id) local_invocation_id: vec3<u32>, ) { // Calculate the cluster ID for this thread let cluster_id = local_invocation_id.x + 128u dot(workgroup_id, vec3(num_workgroups.x num_workgroups.x, num_workgroups.x, 1u)); if cluster_id >= arrayLength(&meshlet_cluster_meshlet_ids) { return; } </code></pre> Once we know what cluster this thread should process, the next step is to check instance culling. Bevy has the concept of render layers, where certain entities only render for certain views. Before rendering, we uploaded a bitmask of whether each instance was visible for the current view or not. In the shader, we'll just check that bitmask, and early-out if the cluster belongs to an instance that should be culled. The instance ID can be found via indexing into the per-cluster data buffer that we computed in the previous pass (fill cluster buffers). // Check for instance culling let instance_id = meshlet_cluster_instance_ids[cluster_id]; let bit_offset = instance_id % 32u; let packed_visibility = meshlet_view_instance_visibility[instance_id / 32u]; let should_cull_instance = bool(extractBits(packed_visibility, bit_offset, 1u)); if should_cull_instance { return; } </code></pre> Assuming the cluster's instance was not culled, we can now start fetching the rest of the cluster's data for culling. The instance ID we found also gives us access to the instance uniform, and we can fetch the meshlet ID the same way we did the instance ID. With these two indices, we can also fetch the culling bounding sphere for the cluster's meshlet, and convert it from local to world-space. // Calculate world-space culling bounding sphere for the cluster let instance_uniform = meshlet_instance_uniforms[instance_id]; let meshlet_id = meshlet_cluster_meshlet_ids[cluster_id]; let world_from_local = affine3_to_square(instance_uniform.world_from_local); let world_scale = max(length(world_from_local[0]), max(length(world_from_local[1]), length(world_from_local[2]))); let bounding_spheres = meshlet_bounding_spheres[meshlet_id]; var culling_bounding_sphere_center = world_from_local vec4(bounding_spheres.self_culling.center, 1.0); var culling_bounding_sphere_radius = world_scale bounding_spheres.self_culling.radius; </code></pre> A simple frustum test lets us cull out of view clusters (an early return means the cluster is culled). // Frustum culling for (var i = 0u; i < 6u; i++) { if dot(view.frustum[i], culling_bounding_sphere_center) + culling_bounding_sphere_radius <= 0.0 { return; } } </code></pre> LOD Selection</a> </h3> Now that we know if a cluster is in view, the next question we need to ask is "Is this cluster's meshlet part of the right cut of the LOD tree?" The goal is to select the set of simplified meshlets such that at the distance we're viewing them from, they have less than 1 pixel of geometric difference from the original set of meshlets at LOD 0 (the base mesh). Note that we're accounting only for geometric differences, and not taking into account material or lighting differences. Doing so is a much harder problem. So, the question is then "how do we determine if the group this meshlet belongs to has less than 1 pixel of geometric error?" When building the meshlet groups during asset preprocessing, we stored the group error relative to the base mesh as the radius of the bounding sphere. We can convert this bounding sphere from local to world-space, project it to view-space, and then check how many pixels on the screen it takes up. If it's less than 1 pixel, then the cluster is imperceptibly different. We're essentially answering the question "if the mesh deformed by X meters, how many pixels of change is that when viewed from the current camera"? // https://stackoverflow.com/questions/21648630/radius-of-projected-sphere-in-screen-space/21649403#21649403 fn lod_error_is_imperceptible(sphere_center: vec3<f32>, sphere_radius: f32) -> bool { let d2 = dot(sphere_center, sphere_center); let r2 = sphere_radius sphere_radius; let sphere_diameter_uv = view.clip_from_view[0][0] sphere_radius / sqrt(d2 - r2); let view_size = f32(max(view.width, view.height)); let sphere_diameter_pixels = sphere_diameter_uv view_size; return sphere_diameter_pixels < 1.0; } </code></pre> Knowing if the cluster has imperceptible error is not sufficent by itself. Say you have 4 sets of meshlets - the original one (group 0), and 3 progressively simplified versions (groups 1-3). If group 2 has imperceptible error for the current view, then so would groups 1 and 0. In fact, group 0 will always have imperceptible error, given that it is the base mesh. Given multiple sets of imperceptibly different meshlets, the best set to select is the one made of the fewest triangles (most simplified), which is the highest LOD. Since we're processing each cluster in parallel, we can't communicate between them to choose the correct LOD cut. Instead, we can use a neat trick. We can design a procedure where each cluster evaluates some data, and decides independently whether it's at the correct LOD, in a way that's consistent across all the clusters. The Nanite slides go into the theory more, but it boils down to checking if error is imperceptible for the current cluster, and that its parent's error is not imperceptible. I.e. this is the most simple cluster we can choose with imperceptible error, and going up to it's even more simple parent would cause visible error. We can take the two LOD bounding spheres (the ones containing simplification error) for each meshlet, transform them to view-space, check if the error for each one is imperceptible or not, and then early-out if this cluster is not part of the correct LOD cut. // Calculate view-space LOD bounding sphere for the meshlet let lod_bounding_sphere_center = world_from_local * vec4(bounding_spheres.self_lod.center, 1.0); let lod_bounding_sphere_radius = world_scale bounding_spheres.self_lod.radius; let lod_bounding_sphere_center_view_space = (view.view_from_world vec4(lod_bounding_sphere_center.xyz, 1.0)).xyz; // Calculate view-space LOD bounding sphere for the meshlet's parent let parent_lod_bounding_sphere_center = world_from_local * vec4(bounding_spheres.parent_lod.center, 1.0); let parent_lod_bounding_sphere_radius = world_scale bounding_spheres.parent_lod.radius; let parent_lod_bounding_sphere_center_view_space = (view.view_from_world vec4(parent_lod_bounding_sphere_center.xyz, 1.0)).xyz; // Check LOD cut (meshlet error imperceptible, and parent error not imperceptible) let lod_is_ok = lod_error_is_imperceptible(lod_bounding_sphere_center_view_space, lod_bounding_sphere_radius); let parent_lod_is_ok = lod_error_is_imperceptible(parent_lod_bounding_sphere_center_view_space, parent_lod_bounding_sphere_radius); if !lod_is_ok || parent_lod_is_ok { return; } </code></pre> Occlusion Culling Test</a> </h3> We've checked if the cluster is in view (frustum and render layer culling), as well as if it's part of the correct LOD cut. It's now time for the actual occlusion culling part of the first of the two passes for two pass occlusion culling. Our goal in the first pass is to render only clusters that were visible last frame. One possible method would be to store another bitmask of whether each cluster was visible in the current frame, and read from it in the next frame. The problem with this is that it uses a good chunk of memory, and more importantly, does not play well with LODs. Before I implemented LODs I used this method, but with LODs, a cluster that was visible last frame might not be part of the LOD cut in this frame and therefore incorrect to render. Instead of explicitly storing whether a cluster is visible, we're instead going to occlusion cull the clusters against the depth pyramid from the previous frame. We can take the culling bounding sphere of the cluster, project it to view-space using the previous frame's set of transforms, and then project it to a screen-space axis-aligned bounding box (AABB). We can then compare the view-space depth of the bounding sphere's extents with every pixel of the depth buffer that the AABB we calculated covers. If all depth pixels show that there is geometry in front of the bounding sphere, then the mesh was not visible last frame, and therefore should not be rendered in the first occlusion culling pass. Of course sampling every pixel an AABB covers would be extremely expensive, and cache inefficient. Instead we'll use a depth pyramid, which is a mipmapped version of the depth buffer. Each pixel in MIP 1 corresponds to the min of 4 pixels from MIP 0, each pixel in MIP 2 corresponds to the min of 4 pixels from MIP 1, etc down to a 1x1 layer. Now we only have to sample 4 pixels for each AABB, choosing the mip level that best fits the AABB onto a 2x2 quad. Don't worry about how we generate the depth pyramid for now, we'll talk about that more later. If any of that was confusing, read up on occlusion culling and depth pyramids. The important takeaway is that we're using the previous frame's depth pyramid in the first occlusion culling pass to find which clusters would have been visible last frame. // Project the culling bounding sphere to view-space for occlusion culling let previous_world_from_local = affine3_to_square(instance_uniform.previous_world_from_local); let previous_world_from_local_scale = max(length(previous_world_from_local[0]), max(length(previous_world_from_local[1]), length(previous_world_from_local[2]))); culling_bounding_sphere_center = previous_world_from_local * vec4(bounding_spheres.self_culling.center, 1.0); culling_bounding_sphere_radius = previous_world_from_local_scale bounding_spheres.self_culling.radius; let culling_bounding_sphere_center_view_space = (view.view_from_world vec4(culling_bounding_sphere_center.xyz, 1.0)).xyz; let aabb = project_view_space_sphere_to_screen_space_aabb(culling_bounding_sphere_center_view_space, culling_bounding_sphere_radius); // Halve the view-space AABB size as the depth pyramid is half the view size let depth_pyramid_size_mip_0 = vec2<f32>(textureDimensions(depth_pyramid, 0)) * 0.5; let width = (aabb.z - aabb.x) depth_pyramid_size_mip_0.x; let height = (aabb.w - aabb.y) depth_pyramid_size_mip_0.y; // Note: I've seen people use floor instead of ceil here, but it seems to result in culling bugs. // The max(0, x) is also important to prevent out of bounds accesses. let depth_level = max(0, u32(ceil(log2(max(width, height))))); let depth_pyramid_size = vec2<f32>(textureDimensions(depth_pyramid, depth_level)); let aabb_top_left = vec2<u32>(aabb.xy depth_pyramid_size); // Note: I'd use a min sampler reduction here if it were available in wgpu. // textureGather() can't be used either, as it dosen't let you specify a mip level. let depth_quad_a = textureLoad(depth_pyramid, aabb_top_left, depth_level).x; let depth_quad_b = textureLoad(depth_pyramid, aabb_top_left + vec2(1u, 0u), depth_level).x; let depth_quad_c = textureLoad(depth_pyramid, aabb_top_left + vec2(0u, 1u), depth_level).x; let depth_quad_d = textureLoad(depth_pyramid, aabb_top_left + vec2(1u, 1u), depth_level).x; let occluder_depth = min(min(depth_quad_a, depth_quad_b), min(depth_quad_c, depth_quad_d)); // Check whether or not the cluster would be occluded if drawn var cluster_visible: bool; if view.clip_from_view[3][3] == 1.0 { // Orthographic let sphere_depth = view.clip_from_view[3][2] + (culling_bounding_sphere_center_view_space.z + culling_bounding_sphere_radius) view.clip_from_view[2][2]; cluster_visible = sphere_depth >= occluder_depth; } else { // Perspective let sphere_depth = -view.clip_from_view[3][2] / (culling_bounding_sphere_center_view_space.z + culling_bounding_sphere_radius); cluster_visible = sphere_depth >= occluder_depth; } </code></pre> Result Writeout</a> </h3> We're finally at the last step of the first occlusion culling pass/dispatch. As a reminder, everything from after the fill cluster buffers step until the end of this section has all been one shader. I warned you it would be long! The last step for this pass is to write out the results of what clusters should render. This pass is just a compute shader - it dosen't actually render anything. We're just going to fill out the arguments for a single indirect draw command (more on this in the next pass). First, before we get to the indirect draw, we need to write out another piece of data. The second occlusion culling pass later will want to operate only on clusters in view, that passed the LOD test, and that were not drawn in the first pass. That means we didn't early return during the frustum culling or LOD test, and that cluster_visible was false from the occlusion culling test. In order for the second occlusion pass to know which clusters satisfy these conditions, we'll write out another bitmask of 1 bit per cluster, with clusters that the second occlusion pass should operate on having their bit set to 1. An atomicOr takes care of setting each cluster's bit in parallel amongst all threads. // Write if the cluster should be occlusion tested in the second pass if !cluster_visible { let bit = 1u << cluster_id % 32u; atomicOr(&meshlet_second_pass_candidates[cluster_id / 32u], bit); } </code></pre> Now we have the final step of filling out the indirect draw data for the clusters that we do want to draw in the first pass. We can do an atomicAdd on the DrawIndirectArgs::vertex_count with the meshlet's vertex count (triangle count * 3). This does two things: Adds more vertex invocations to the indirect draw for this cluster's triangles</li> Reserves space in a large buffer for all of this cluster's triangles to write out a per-triangle number</li> </ol> With the draw_triangle_buffer space reserved, we can then fill it with an encoded u32 integer: 26 bits for the cluster ID, and 6 bits for the triangle ID within the cluster's meshlet. 6 bits gives us 2^6 = 64 possible values, which is perfect as when we were building meshlets during asset preprocessing, we limited each meshlet to max 64 vertices and 64 triangles. During vertex shading in the next pass, each vertex invocation will be able to use this buffer to know what triangle and cluster it belongs to. // Append a list of this cluster's triangles to draw if not culled if cluster_visible { let meshlet_triangle_count = meshlets[meshlet_id].triangle_count; let buffer_start = atomicAdd(&draw_indirect_args.vertex_count, meshlet_triangle_count 3u) / 3u; let cluster_id_packed = cluster_id << 6u; for (var triangle_id = 0u; triangle_id < meshlet_triangle_count; triangle_id++) { draw_triangle_buffer[buffer_start + triangle_id] = cluster_id_packed | triangle_id; } } </code></pre> Raster (First Pass)</a> </h2> We've now determined what to draw, so it's time to draw it. As I mentioned in the previous section, we're doing a single draw_indirect() call to rasterize every single cluster at once, using the DrawIndirectArgs buffer we filled out in the previous pass. We're going to render to a few different render targets: Depth buffer</li> Visibility buffer (optional, not rendered for shadow map views)</li> Material depth (optional, not rendered for shadow map views)</li> </ul> The depth buffer is straightforward. The visibility buffer is a R32Uint texture storing the cluster ID + triangle ID packed together in the same way as during the culling pass. Material depth is a R16Uint texture storing the material ID. The visibility buffer and material depth textures will be used in a later pass for shading. Note that it would be better to skip writing material depth here, and write it out as part of the later copy material depth pass. This pass is going to change in the near future when I add software rasterization however (more on this in a second), so for now I've left it as-is. I won't show the entire shader, but getting the triangle data to render for each vertex is fairly straightforward. The vertex invocation index can be used to index into the draw_triangle_buffer that we wrote out during the culling pass, giving us a packed cluster ID and triangle ID. The vertex invocation index % 3 gives us which vertex within the triangle this is, and then we can lookup the cluster's meshlet and instance data as normal. Vertex data can be obtained by following the tree of indices using the index ID and meshlet info. @vertex fn vertex(@builtin(vertex_index) vertex_index: u32) -> VertexOutput { let packed_ids = draw_triangle_buffer[vertex_index / 3u]; let cluster_id = packed_ids >> 6u; let meshlet_id = meshlet_cluster_meshlet_ids[cluster_id]; let meshlet = meshlets[meshlet_id]; let triangle_id = extractBits(packed_ids, 0u, 6u); let index_id = (triangle_id 3u) + (vertex_index % 3u); let index = get_meshlet_index(meshlet.start_index_id + index_id); let vertex_id = meshlet_vertex_ids[meshlet.start_vertex_id + index]; let vertex = unpack_meshlet_vertex(meshlet_vertex_data[vertex_id]); let instance_id = meshlet_cluster_instance_ids[cluster_id]; let instance_uniform = meshlet_instance_uniforms[instance_id]; // ... } </code></pre> Quad overdraw from Renderdoc Triangle size from Renderdoc </blockquote> With the overview out of the way, the real topic to discuss for this pass is "why a single draw indirect?" There are several other possibilities I could have gone with: Mesh shaders</li> Single draw indexed indirect after writing out an index buffer during the culling pass</li> Single draw indirect, with a cluster ID buffer, snapping extra vertex invocations to NaN</li> Multi draw indirect with a sub-draw per cluster</li> Multi draw indirect with a sub-draw per meshlet triangle count bin</li> Software rasterization</li> </ul> Mesh shaders are sadly not supported by wgpu, so that's out. They would be the best option for taking advantage of GPU hardware. Single draw indexed indirect was what I originally used. It's about 10-20% faster (if I remember correctly, it's been a while) than the non-indexed variant I use now. However, that means we would need to allocate an index buffer for our worst case usage at 12 bytes/triangle. That's extremely expensive for the amount of geometry we want to deal with, and you'd quickly run into buffer size limits (~2gb on most platforms). You could dynamically allocate a new buffer size based on amount of rendered triangles after culling with some CPU readback and some heuristics, but that's more complicated and still very memory hungry. Single draw indirect with the 4 bytes/triangle draw_triangle_buffer that I ended up using is still expensive, but good enough to scrape by for now. Single draw indirect with a buffer of cluster IDs is also an option. Each meshlet has max 64 triangles, so we could spawn cluster_count * 64 * 3 vertex invocations. Vertex invocation index / (64 * 3) would give you an index into the cluster ID buffer, and triangle ID is easy to recover via some simple arithmetic. At 4 bytes/cluster, this option is much cheaper in memory than any of the previous methods. The problem is how to handle excess vertex invocations. Not all meshlets will have a full 64 triangles. It's easy enough to have each vertex invocation check the meshlet's triangle count, and if it's not needed, write out a NaN position, causing the GPU to ignore the triangle. The problem is that this performed very poorly when I tested it. All those dummy NaN triangles took up valuable fixed-function time that the GPU could have spent processing other triangles. Maybe performance would be better if I were able to get meshlets much closer to the max triangle count, or halving the max triangle count to 32 per meshlet to spawn less dummy triangles, but I ended up not pursuing this method. Multi draw is also an option. We could write out a buffer with 1 DrawIndirectArgs per cluster, giving 16 bytes/cluster. Each sub-draw would contain exactly the right amount of vertex invocations per cluster. Each vertex invocation would be able to recover their cluster ID via the instance_id builtin, as we would set DrawIndirectArgs::first_instance to the cluster ID. On the CPU, this would still be a single draw call. In practice, I found this still performed poorly. While we are no longer bottlenecked by the GPU having to process dummy triangles, now the GPU's command processor has to process all these sub-commands. At 1 sub-command per cluster, that's a lot of commands. Like the fixed 64 vertex invocations per cluster path, we're again bottlenecked on something that isn't actual rasterization work. An additional idea I thought of while writing this section is to bin each cluster by its meshlet triangle count. All clusters whose meshlets have 10 triangles would go in one bin, 12 triangles in a second bin, 46 triangles in a third bin, etc, for 63 bins total (we would never have a meshlet with 0 triangles). We could then write out a DrawIndirectArgs and list of cluster IDs per bin, and do a single multi_draw_indirect() call on the CPU, similiar to the last section. I haven't tested it out, but this seems like a decent option in theory. I believe Nanite does something similiar in recent versions of Unreal Engine 5 in order to support different types of vertex shaders. Finally, we could use software rasterization. We could write out a list of cluster IDs, spawn 1 workgroup per cluster, and have each workgroup manually rasterize the cluster via some linear algebra, bypassing fixed-function GPU hardware entirely. This is what Nanite does for over 90% of their clusters. Only large clusters and clusters needing depth clipping are rendered via hardware draws. Not only is this one of the most memory efficent options, it's faster than hardware draws for the majority of clusters (hence why Nanite uses it so heavily). Unfortunately, wgpu once again lacks support for a needed feature, this time 64bit texture atomics. The good news is that @atlv24 is working on adding support for this feature, and I'm looking forward to implementing software rendering in a future release of Bevy. Downsample Depth</a> </h2> With the first of the two passes of two pass occlusion culling rendered, it's time to prepare for the second pass. Namely, we need to generate a new depth pyramid based on the depth buffer we just rendered. For generating the depth pyramid, I ported the FidelityFX Single Pass Downsampler (SPD) to Bevy. SPD lets us perform the downsampling very efficiently, entirely in a single compute dispatch. You could use multiple raster passes, but that's extremely expensive in both CPU time (command recording and wgpu resource tracking), and GPU time (bandwidth reading/writing between passes, pipeline bubbles as the GPU spins up and down between passes). For now, we're actually using two compute dispatches, not one. Wgpu lacks support for globallycoherent buffers, so we have to split the dispatch in two to ensure writes made by the first are visible to the second. I also did not implement the subgroup version of SPD, as wgpu lacked support at the time (it has it now, minus quad operations, which SPD does need). Still very fast despite these small deficiencies. One important note is that we need to ensure that the depth pyramid is conservative. For non-power-of-two depth textures, for instance, we might need special handling of the downsampling. Same for when we sample the depth pyramid during occlusion culling. I haven't done anything special to handle this, but it seems to work well enough. I'm not entirely confident in the edge cases here though. Culling (Second Pass)</a> </h2> The second culling pass is where we decide whether to render the rest of the clusters - the ones that we didn't think were a good set of occluders for the scene, and decided to hold off on rendering. This culling pass is much the same as the first, with a few key differences: We skip frustum and LOD culling, as we did it the first time</li> We operate only on the clusters that we explicitly marked as second pass candidates during the first culling pass We're still doing a large 3d dispatch over all clusters in the scene, but we can early-out for the clusters that are not second pass candidates</li> </ul> </li> We use the current transforms for occlusion culling, instead of last frame's</li> We occlusion cull using the depth pyramid generated from the previous pass</li> </ul> By doing this, we can skip drawing any clusters that would be occluded by the existing geometry that we rendered in the first pass. As a result of this pass, we have another DrawIndirectArgs we can use to draw the remaining clusters. Raster (Second Pass)</a> </h2> This pass is identical to the first raster pass, just with the new set of clusters from the second culling pass. Given that the camera and scene is static in the example frame that we're looking at, the first pass perfectly calculated occlusion, and there is nothing to actually render in this pass. Copy Material Depth</a> </h2> For reasons we'll get to in the material shading pass, we need to copy the R16Uint material depth texture we rasterized earlier to an actual Depth16Unorm depth texture. A simple fullscreen triangle pass with a sample and a divide performs the copy. I mentioned earlier that ideally we wouldn't write out the material depth during the rasterization pass. It would be better to instead write it out during this pass, by sampling the visibility buffer, looking up the material ID from the cluster ID, and then writing it out to the depth texture directly. I intend to switch to this method in the near future. #import bevy_core_pipeline::fullscreen_vertex_shader::FullscreenVertexOutput @group(0) @binding(0) var material_depth: texture_2d<u32>; /// This pass copies the R16Uint material depth texture to an actual Depth16Unorm depth texture. @fragment fn copy_material_depth(in: FullscreenVertexOutput) -> @builtin(frag_depth) f32 { return f32(textureLoad(material_depth, vec2<i32>(in.position.xy), 0).r) / 65535.0; } </code></pre> Material Shading</a> </h2> At this point we have the visibility buffer texture containing packed cluster and triangle IDs per pixel, and the material depth texture containing the material ID as a floating point depth value. Now, it's time to apply materials to the frame in a set of "material shading" draws. Note that we're not necessarily rendering a lit and shaded scene. The meshlet feature works with all of Bevy's existing rendering modes (forward, forward + prepass, and deferred). For instance, we could be rendering a GBuffer here, or a normal and motion vector prepass. Vertex Shader</a> </h3> For each material, we will perform one draw call of a fullscreen triangle. // 1 fullscreen triangle draw per material for (material_id, material_pipeline_id, material_bind_group) in meshlet_view_materials.iter() { if meshlet_gpu_scene.material_present_in_scene(material_id) { if let Some(material_pipeline) = pipeline_cache.get_render_pipeline(material_pipeline_id) { let x = material_id * 3; render_pass.set_render_pipeline(material_pipeline); render_pass.set_bind_group(2, material_bind_group, &[]); render_pass.draw(x..(x + 3), 0..1); } } } </code></pre> Note that we're not drawing the typical 0..3 vertices for a fullscreen triangle. Instead, we're drawing 0..3 for the first material, 3..6 for the second material, 6..9 for the third material, etc. In the vertex shader (which is hardcoded for all materials), we can derive the material_id of the draw from the vertex index, and then use that to set the depth of the triangle. @vertex fn vertex(@builtin(vertex_index) vertex_input: u32) -> @builtin(position) vec4<f32> { let vertex_index = vertex_input % 3u; let material_id = vertex_input / 3u; let material_depth = f32(material_id) / 65535.0; let uv = vec2<f32>(vec2(vertex_index >> 1u, vertex_index & 1u)) * 2.0; return vec4(uv_to_ndc(uv), material_depth, 1.0); } </code></pre> The material's pipeline depth comparison function will be set to equals, so we only shade fragments for which the depth of the triangle is equal to the depth in the depth buffer. The depth buffer attached here is the material depth texture we rendered earlier. Thus, each fullscreen triangle draw per material will only shade the fragments for that material. Note that this is pretty inefficent if you have many materials. Each fullscreen triangle will cost an entire screen's worth of depth comparisons. In the future I'd like to switch to compute-shader based material shading. Fragment Shader</a> </h3> Now that we've determined what fragments to shade, it's time to apply the material's shader code to those fragments. Each fragment can sample the visibility buffer, recovering the cluster ID and triangle ID. Like before, this provides us access to the rest of the instance and mesh data. The remaining tricky bit is that since we're not actually rendering a mesh in the draw call, and are using a single triangle just to cover some fragments to shade, we don't have automatic interpolation of vertex attributes within a mesh triangle or screen-space derivatives for mipmapped texture sampling. To compute this data ourselves, each fragment can load all 3 vertices of its mesh triangle, and compute the barycentrics and derivatives manually. Big thanks to The Forge for this code. In Bevy, all the visibility buffer loading, data loading and unpacking, vertex interpolation calculations, etc is wrapped up in the resolve_vertex_output()</code> function for ease of use. /// Load the visibility buffer texture and resolve it into a VertexOutput. fn resolve_vertex_output(frag_coord: vec4<f32>) -> VertexOutput { let packed_ids = textureLoad(meshlet_visibility_buffer, vec2<i32>(frag_coord.xy), 0).r; let cluster_id = packed_ids >> 6u; let meshlet_id = meshlet_cluster_meshlet_ids[cluster_id]; let meshlet = meshlets[meshlet_id]; let triangle_id = extractBits(packed_ids, 0u, 6u); // ... // https://github.com/ConfettiFX/The-Forge/blob/2d453f376ef278f66f97cbaf36c0d12e4361e275/Examples_3/Visibility_Buffer/src/Shaders/FSL/visibilityBuffer_shade.frag.fsl#L83-L139 let partial_derivatives = compute_partial_derivatives( array(clip_position_1, clip_position_2, clip_position_3), frag_coord_ndc, view.viewport.zw, ); // ... let world_position = mat3x4(world_position_1, world_position_2, world_position_3) partial_derivatives.barycentrics; let uv = mat3x2(vertex_1.uv, vertex_2.uv, vertex_3.uv) partial_derivatives.barycentrics; let ddx_uv = mat3x2(vertex_1.uv, vertex_2.uv, vertex_3.uv) partial_derivatives.ddx; let ddy_uv = mat3x2(vertex_1.uv, vertex_2.uv, vertex_3.uv) partial_derivatives.ddy; // ... } </code></pre> Downsample Depth (Again)</a> </h2> Lastly, for next frame's first culling pass, we're going to need the previous frame's depth pyramid. This is where we'll generate it. We'll use the same exact process that we used for the first depth downsample, but this time we'll use the depth buffer generated as a result of the second raster pass, instead of the first. Future Work</a> </h1> And with that we're done with the frame breakdown. I've covered all the major steps and shaders of how virtual geometry will work in Bevy 0.14. I did skip some of the CPU-side data management, but it's fairly boring and subject to a rewrite soon anyways. However, Bevy 0.14 is just the start. There's tons of improvements I'm hoping to implement in a future version, such as: Major improvements to the rasterization passes via software rasterization, and trying out my multi draw with bins idea for hardware raster</li> Copying Nanite's idea of culling and LOD selection via persistent threads. This should let us eliminate the separate fill_cluster_buffers step, speedup culling, and remove the need for large 3d dispatches over all clusters in the scene</li> Compressing asset vertex data by using screen-derived tangents and octahedral-encoded normals, and possibly position/UV quantization</li> Performance, quality, reliability, and workflow improvements for the mesh to meshlet mesh asset preprocessing</li> Compute-based material shading passes instead of the fullscreen triangle method, and possibly software variable rate shading, inspired by Unreal Engine 5.4's GPU-driven Nanite materials</a> and this set of blog posts</a> from John Hable</li> Streaming in and out asset data from/to disk instead of keeping all of it in memory all the time</li> </ul> With any luck, and a lot of hard work, I'll be back for another blog post about all these changes in the future. Until then, enjoy Bevy 0.14! Bevy's Third Birthday - Reflections on Rendering 2023-09-12T00:00:00+00:00 Written in response to Bevy's Third Birthday</a>. </blockquote> Introduction</a> </h1> You can skip this section if you're only interested in hearing about Bevy - we'll get to that in a minute. </blockquote> Who am I?</a> </h2> Hi, I'm JMS55, and I've been working on Bevy's 3D renderer for the past ~10 months. I've also been involved in the Rust gamedev community for a long time: I have been using Rust since pre-1.0 (around ~7 years ago).</li> Tried out Piston when it first came out; same with Amythest.</li> Contributed a (very tiny) bit to Veloren</a>.</li> Wrote a demo</a> for a cool RTS simulation kind of game where you program your units via Rust-compiled-to-WASM (and would love to get back to it at some point), using Wasmtime and Macroquad.</li> Wrote a falling sand game</a> using pixels, wgpu, and imgui-rs (I also tried egui and yakui). I wrote the shaders for it in GLSL - this was before wgpu started using WGSL!</li> </ul> Pixels</a> was the first time I ever made a non-trivial contribution to an open source library. Now, I'm working on Bevy pretty much daily! Contributing to Bevy</a> </h2> First, some overall thoughts on my experience contributing to Bevy. It's been exceedingly rewarding to work on an open source project with this kind of community. Before Bevy, I mainly worked on my own projects, and inevitably got burnt out. It's hard to maintain motivation when you're the only one involved. It's been super energizing getting to bounce ideas off of the other amazing developers working on Bevy! Additionally, seeing code I write translate directly into real-world use is awesome. Thank you to all the other developers and users of Bevy! If you're a user of Bevy, and are thinking about getting involved in Bevy's development, I highly recommended it! One of the great things about Bevy is it's modularity and focus on ECS. "Engine code" and "user code" are not substantially different. If you've used Bevy before, chances are you can write code for Bevy. Quoting Cart, "Bevy users are bevy developers, they just don't know it yet". The developers and community are super friendly. Feel free to join our Discord</a>, pick a topic you find interesting - say, #rendering-dev :) - and starting asking lots of questions! Post overview</a> </h2> With that out of the way: this blog post will be my reflection on nearly a year of Bevy development (Bevy 0.9-dev to 0.12-dev). Specifically, bevy_pbr, bevy_core_pipeline, and bevy_render, along with some related crates such as wgpu and naga. I (mostly) won't be talking about the ECS, 2D renderer, UI, and other areas of Bevy. I'll be covering what I (and others) worked on, what went well, important items we need to spend time developing, and some new features I'm excited to work on in the coming months. This Year</a> </h1> What we achieved</a> </h2> This year, I've worked on and merged the following features: Bloom (In collaboration with others) (Bevy 0.9, 0.10)</li> EnvironmentMapLight (IBL) (Bevy 0.10)</li> Temporal antialiasing (TAA) (Bevy 0.11)</li> Screen-space ambient occlusion (SSAO) (Bevy 0.11)</li> Skybox (Bevy 0.11)</li> </ul> I've also worked on, and either didn't end up merging or am still working on: Percentage-closer filtering (PCF) for smoothing out the edges of shadows</li> Automated EnvironmentMapLight generation (replacing glTF-IBL-Sampler)</li> Support for AMD's FSR and Nvidia's DLSS upscalers</li> Multithreaded rendering for improved performance</li> Clear coat layer for StandardMaterial</li> GPU pass timing overlay for profiling the renderer (GPU timestamps)</li> Ergonomic improvements for the renderer internals</li> Real-time fully dynamic global illumination (more on this later!)</li> </ul> and many other smaller PRs not interesting enough to mention, new examples contributed, discussion posts and conversations, bug investigations, performance profiling sessions, and reviewing other peoples' PRs. Additional major rendering features that we merged, but that I did not directly work on include: Fast approximate antialiasing (FXAA)</li> Depth and normal prepasses</li> Cascaded shadow maps (CSM)</li> Fog effects</li> Better tonemapping</li> Morph targets</li> A complete revamp of rendering system sets</li> Ergonomic improvements for render node APIs</li> Many performance improvements</li> </ul> Things I feel went well</a> </h2> Overall, I'm fairly satisfied both with what Bevy has accomplished, and what I've personally learned and accomplished this year. Bevy has gone from "we have some basic PBR shaders with analytic direct lighting" to ~70% of the way to a fully production-ready, indie game-usable renderer with much fewer caveats, and much fancier lighting and post processing! Take Slime Rancher, a hit indie game from 2017. This post</a> goes into detail on the tricks and rendering techniques the game used to achieve its graphics. Bevy 0.11 has support for almost all of the listed techniques! The only thing we're missing are decals, and refraction (although there's a PR open that implements screen-space refaction!). I would specifically like to note the amount of people working on rendering features, and how it's increased over time. It's a great sign to see that rendering isn't the domain of only 1 or 2 dedicated developers. Rather, we have a fairly large amount of people contributing major rendering features and improvements. Too often, I feel rendering is seen as a kind of opaque witchcraft. To some extent, I feel that perception is true. Writing a shader (GPU program) is not like writing a program for the CPU in Rust. The graphics APIs themselves (in our case, wgpu) are not the most intuitive, and are often subject to compatibility or performance constraints that lead to poor ergonomics. Furthermore, even if you can write a shader, and know the graphics APIs, it's not always clear how to assemble all that together into a performant, compatible, ergonomic renderer. Here's where the "but" comes: I don't think it's that much worse than programming any other part of a game engine in general. Designing a complete UI system, or an ECS, or a physics library, etc, is rarely a simple one person job. Designing a rendering engine is much the same. Bevy has been able to consistently attract new rendering developers, often with little or no professional rendering experience. To me, that's an encouraging sign that we're doing something right. We may not rival Unity's or Godot's renderers now, but in another year, I'm confident that we'll surpass them in a few areas, and at least match them in most of the important ones :) Things I feel we need to work on</a> </h2> Now that we've covered what I felt went well, it's time to talk about things we need to improve on. These are pain points either I or other developers have consistently faced, or things I've seen users brought up many times. In no specific order, here are some things I feel we need to prioritize. More dynamic, comprehensive, and accessible test scenes</a> </h3> Most rendering development is currently done with either dedicated Bevy example scenes, or Lumberyard's Bistro or Intel's Sponza scenes. The former tend to be too simple for more intensive rendering tests, and the latter are difficult to setup, and don't have the dynamism a real game would have. Furthermore, we don't have any scenes that excercise all of Bevy's rendering features at once, and how they might interact. It would be great to get more test scenes that are easy to setup and tweak, demonstrate many of Bevy's rendering features working in tandem, and overall provide real-world uses cases that we can test against, rather than toy scenes. Currently, thoroughly exercising a new rendering feature or performance change requires almost as much work as writing the feature itself. To some extent, I'm asking us to develop a small game, focused on polished rendering and animation. Performance</a> </h3> A fact that is probably surprising to developers without much experience in rendering is that Bevy's renderer performance is currently heavily CPU-limited - not GPU-limited, as you might expect. There's two factor to this: Bevy is inefficent with how it stores and uses rendering data</li> Bevy makes too many draw calls, and does too much state binding changes between draws</li> </ol> In order to become a serious renderer, we'll need to dramatically improve our CPU performance. Thankfully, we have a lot of changes in progress towards this goal. Many core parts of the renderer that have been neglected in favor of working on new features are being revamped and improved. Expect large performance gains in Bevy 0.12, and probably 0.13. Long-term, we'll want to support GPU-driven rendering, where the GPU handles almost all of the rendering work. An extreme example of this kind of architecture is Unreal Engine's Nanite, which is capable of rendering micro-poly meshes. We (almost certainly) won't go that far, but implementing 60% of the techniques (bindless, draw indirect, compute-based rasterizer, compute-based fustrum culling, two pass occlusion culling, and also asset streaming) should give us 90% of the benefit, and allow complex scenes with many orders of magnitude greater amounts of meshes. This is an exciting area to work on, and there's a lot to do! Documentation</a> </h3> While performance and new features have been steadily improving, documentation, not so much. The docs for bevy_render, bevy_core_pipeline, and bevy_pbr are, uhh, sparse at best. Frequent questions I see include "how do I do <custom kind of rendering>, which should be a fairly routine kind of extension, but I have no idea where to start integrating it with Bevy", "what shader imports are available", "what does this error mean", and "my rendering looks bad / performance is bad, how do I improve this and understand why?" We need to write more API docs, more module docs, and more long-form guides on how Bevy's renderer is structured and how to achieve common tasks. This is something I've been wanting to work on, but much like blogging, I've discovered writing clear, useful docs is quite hard. This is a great area of new Bevy devs to get involved in! As an aside, this is my first blog post. It's been something I've been meaning to do for many years, but never actually gotten around to doing. Writing a blog post is a lot of work, and there's the temptation to polish the writing until it's perfect. Doing so would leave me no time to actually work on rendering!, so I'm going ahead and publishing this despite the fact that it's not perfect :) </blockquote> Ease of use</a> </h3> Similarly to the above section, while our features may be pretty good, internally, they're not that great to write. The main rendering pass APIs are too abstract, and go through many traits, generic systems, and levels of indirection that greatly complicate understanding the renderer. Writing new post processing or lighting passes involves a lot of boilerplate, especially around bind groups and resource/pipeline creation. These two things are also a large barrier to entry in getting new contributors to work on rendering. Finally, the core Material code is brittle and not very extensible. A Bevy user's options are to either use StandardMaterial, or write a completely custom material from scratch. There's no easy way to do something like take the StandardMaterial, but animate the texture UVs according to this shader fragment, and then pass the result into some other shader fragment. Furthermore, writing the shader for a custom material involves some complicated coordination between vertex and fragment stage input and output types, and data bindings. Mismatched bindings or types is a common source of confusing errors for authors of custom materials. People have floated some ideas on how to improve this, but not a ton of concrete code yet. It's something we'll need to work on going forwards. Review speed</a> </h3> We have too many open PRs, and not enough reviewers! Reviews take a long time. Part of this is the fact that like I talked about above, rendering boilerplate can get pretty gnarly. Reviewing a rendering PR often involves one session to review the CPU-side code changes, and another entirely to review the GPU-side code. If we can improve rendering boilerplate, we will not only make it easier to write new features, but shorten the (currently fairly substantial) review time each feature has to go through before it can be merged. Another issue is testing. Testing rendering does not lend itself well to unit tests. You need a variety of scenes, setups, and specific GPU and OS platforms (platform-specific bugs are sadly common, and feature support and performance widely varies). This all slows down reviews, and often we miss fairly impactful rendering bugs anyways. Ecosystem investment</a> </h3> First, I'd like to thank the maintainers of the wgpu and naga crates, of which bevy_render sits atop, for their awesome work. Bevy's renderer would not be possible without them! These crates, however, form an entirely new graphics API/toolchain, with a focus on wide compatibility and API safety. They don't currently support some of the latest GPU features such as ray tracing, mesh shaders, threadgroup/wave/warp intrinsics, async compute, and mutable binding arrays (bindless textures), or have specific caveats. This is totally understandable after all - not many developers want or need these things, and it's not WebGPU's focus. The solution is of course for us to invest more time in writing those features and helping them out ourselves :). I'm not sure how we foster it, but it would be great to see more investment in wgpu, naga, and naga_oil from Bevy's developers. Bevy editor</a> </h3> This isn't quite rendering related, but like everyone else, I'm eagerly awaiting bevy_editor. It'll be super useful for testing out rendering features, as clicking through GUI buttons is much easier than writing a system to manually toggle several features on and off with keypresses and rendering an in-game UI to show the enabled settings. I'm also really looking forward to developing Bevy's editor. I originally joined this project to do just that, and somehow ended up working on rendering instead! I mentioned before my Rust gamedev experience, but I also have a lot of experience with Rust UI dev and UI dev in general. The key missing parts are twofold: An ergonomic, reactive, pretty, capable, and scalable UI system</li> Concrete direction on how the editor will actually operate (as a seperate process with message passing, as a bevy_app plugin to the game process, etc)</li> </ul> I'm interested in doing the work of designing the UI for the editor and writing all the UI code and features, but not so much figuring out the basic foundations. Hopefully others will take on this task :) Things I'm excited to work on in the next year(?)</a> </h1> Finally, I'd like to mention some things I'm excited to work on! Some of these I've already talked about, and others less so: FSR / DLSS</li> Procedural skybox</li> Bevy editor</li> GPU driven rendering</li> Profiling tools and system/entity tracing and statistics</li> OpenPBR material</li> Raytraced direct lighting (ReSTIR DI / RTX DI)</li> Screen-space reflections, indirect lighting, and SSAO improvements</li> Global illumination</li> </ul> Bevy Solari</a> </h2> It's not something I've mentioned at all yet, but one of the things I've been spending a lot of time on the past several months is a project I'm calling bevy_solari. Bevy currently has support for direct lighting - i.e., simulating the light coming from a light source, and hitting a surface. In real life however, light dosen't just stop at the first surface it hits. Light bounces around a scene, leading to mirror or blurry reflections, color bleeding, micro-shadows, and more. Simulating these many bounces of light is called global illumination (GI), and tends to be very expensive and slow to do in real time. Without GI, however, lighting tends to look kinda off, and a lot less prettier. Most games tend to approximate global illumination via baked static lighting methods such as lightmaps, irradiance volumes, and environment maps, as well as very limited dynamic methods such as planar reflections, light/reflection probes, and screen-space raytracing. Of these, Bevy currently only supports environment maps and SSAO, although I know that some people are working on implementing the other methods. Thanks to recent advances in GPU hardware and algorithm development, however, fully dynamic, real time global illumination has become feasible. The field is rapidly developing, but there's been many promising approaches including Tomasz Stachowiak's Kajiya, DDGI, Unreal Engine's Lumen (as seen in Unreal's Lumen in the Land of Nanite demo, as well as Fornite), Nvidia's ReSTIR GI / RTX GI (as seen in Cyberpunk 2077), AMD's GI-1.0, and Alexander Sannikov's Radiance Cascades (as seen in the recent Path of Exile 2). It's a super exciting area of research, and something I've been having an absolute (and sometimes frustrating!) blast learning. There's too much literature and detail to cover here (it deserves, and may eventually get, its own blog post), but suffice it to say that I've been working on my own GI system for Bevy inspired by many of these techniques. It utilizes GPU hardware-accelerated raytracing, and is targeted at high end GPUs. It's not going to be released any time soon, partially due to wgpu lacking official raytracing support, and partially due to the massive amount of work and experimentation I still need to do. However, it's open source, so feel free to try the demo example here</a>. Run cargo run --example solari</code> from the repo root. Below are some static screenshots of the renderer, but keep in mind that this is all running in realtime on a Nvidia RTX 3080 GPU, and is fully dynamic with movable camera, lights, and objects :) Bevy Solari in a cornell box scene - with GI </blockquote> Direct light only - no GI </blockquote> Indirect irradiance debug view </blockquote> World irradiance cache debug view </blockquote>

Resampling Correlations/Bias

Results
PICA PICA </a> </h3>

Bistro </a> </h3>

Dragons </a> </h3>

Cornell Box </a> </h3>

Performance

Realtime Raytracing in Bevy 0.17 (Solari)

Why Raytracing for Bevy?
DOOM: The Dark Ages</a> and Cyberpunk 2077</a> rely heavily on raytracing, and artists working on these types of projects expect their tools to support similar techniques.</p>
And honestly? It's just cool, and something I love working on :)</p>

Performance </a> </h2>
Numbers

Thank You
If you find Solari useful, consider donating</a> to help fund future development.</p>

Bevy's Fifth Birthday - Progress and Production Readiness

Virtual Geometry in Bevy 0.16

Virtual Geometry in Bevy 0.15

Software Rasterization </a> </h2>
PR #14623</a> improves our visbuffer rasterization performance for clusters that appear small on screen (i.e. almost all of them). I rewrote pretty much the entire virtual geometry codebase in this PR, so this is going to be a really long section.</p>

Compressed Per-Meshlet Vertex Data </a> </h2>
PR #15643</a> stores copies of the overall mesh's vertex attribute data per-meshlet, and then heavily compresses it.</p>

Bevy's Fourth Birthday - A Year of Meshlets

Virtual Geometry in Bevy 0.14

Bevy's Third Birthday - Reflections on Rendering

Contributing to Bevy
Discord</a>, pick a topic you find interesting - say, #rendering-dev :) - and starting asking lots of questions!</p>

JMS55's Blog

Realtime Raytracing in Bevy 0.18 (Solari)

JMS55's Blog

Realtime Raytracing in Bevy 0.18 (Solari)

Realtime Raytracing in Bevy 0.17 (Solari)

Bevy's Fifth Birthday - Progress and Production Readiness

Virtual Geometry in Bevy 0.16

Virtual Geometry in Bevy 0.15

Bevy's Fourth Birthday - A Year of Meshlets

Virtual Geometry in Bevy 0.14