ThreeL: Lighting via Light Linked List

ThreeL is an implementation of a somewhat unusual lighting algorithm, real-time lighting via light linked list.

The project is a bespoke forward renderer written in C++ and HLSL, targeting Direct3D 12. It utilizes physically based rendering (PBR), bindless textures and mesh data, compute-based particles and mipmap generation, the glTF model format, and Dear ImGui.

You can find the source code on GitHub or download pre-built binaries for Windows to check it out for yourself–or continue reading to learn more.

Table of Contents

Lighting via Light Linked List

Lighting via Light Linked List (LvLLL) was first described by Abdul Bezrati from Insomniac Games after developing it for Sunset Overdrive.

While not described as such, the algorithm is a variant of tiled shading (better known as Forward+) where the screen is divided into many tiles and each tile stores a list of lights which intersect with it.

Lighting is applied to surfaces by fetching the light list corresponding to the target pixel and iterating over the lights that intersect with the surface.

A visualization of a light buffer with 8x8 tiles, warmer colors represent more lights in that tile

Compared to basic forward rendering, this has a major benefit of reducing work done to handle lights that don’t actually affect the surface (either by skipping them on the CPU or in the pixel shader.)

Compared to typical deferred shading you benefit from being able to use the same lighting algorithm for all your objects. (The lighting strategy in deferred shading can’t be applied to transparents or atypical materials like hair and skin, so you end up with a forward lighting strategy just for them.) It also saves memory bandwidth when there’s a significant number of lights in the scene.

The main thing that sets LvLLL apart from what I would call typical Forward+ is in how it stores the list of lights. Forward+ implementations tend to store lights per tile in a tight array. Either a small array per tile (which limits your lights per tile or uses a ton of VRAM) or a variable-sized array per tile (which involves significantly more complexity during light fill.)

In contrast (and as the name suggests), LvLLL stores lights in a linked list per tile. Each tile is represented by an index into a light links heap.

Each light link stores four pieces of information:

The light index (which points to the description of the light its self: color, intensity, etc)
The minimum and maximum depth at which the light intersects that tile
An index to the next light link in the list

The inclusion of depth here is optional but is desirable as it allows us to quickly skip lights which do not intersect the surface. (In this way you might consider LvLLL to exist somewhere between tiled shading and clustered shading.)

The obvious downside to using linked lists over nice dense arrays is that it’s not super cache friendly.

However while I have not done extensive performance testing yet, some very preliminary experiments seem to suggest that the cost of the PBR lighting calculations actually hides any latency from cache misses.

Regardless though, I didn’t create ThreeL because LvLLL is the best tiled shading method out there. I developed ThreeL because I’ve always had an affinity for this algorithm ever since I learned about it.

LvLLL’s relative simplicity while remaining flexible has a certain elegance that appeals to me, plus it’s lived in my head rent free ever since I played Sunset Overdrive back in 2014.

Which is why I’m pretty happy to be able to tell you about…

My improvement to Lighting via Light Linked List

As mentioned earlier, each light link contains the minimum and maximum depth values of the light for that tile.

In Bezrati’s original presentation, he explained that he gathered these depths by rendering both the back and the front faces of each light shell (with depth testing disabled so they both come through.) A separate depth bounds buffer was used to keep track of the depths from both halves of the shell.

The two pixel shader invocations communicated through this buffer using InterlockedExchange. Whoever gets rendered first stores their light index and depth, whoever renders second sees the matching light index and knows they have their partner’s depth. (Which will then get stored in the light links heap.)

In ThreeL I was able to simplify this step, I only have to render the back faces of my light meshes.

Rather than using the depth from the rasterizer, I instead cast a ray from the camera along the view, finding the two world-space points on the light where it intersects. Those two points are then projected into clip space and their depths are recorded in the light link.

To save on complexity in the light fill pixel shader, I calculate said ray in the vertex shader by projecting the vertex onto the near and far plane. Those plane positions are then interpolated in screen space (IE: using noperspective.)

Additionally the projection of the two world-space points is also relatively cheap as only the W clip-space component and one of the Z’s is actually needed.

This has a few benefits:

You can skip allocating the depth bounds buffer entirely
- (For 8x8 tiles that’s a modest savings of ~128 KB @ 1080p and 512 KB @ 2160p after alignment padding)
Eliminates all the memory bandwidth and atomic operations associated with depth collection
Avoids needing to deal with handling the edge case where the camera is inside the light
Ensures the “shape” of the light in the light buffer more closely matches a sphere

The last point is perhaps a bit tricky to understand at first glance. To understand we need to talk a bit about what light fill actually does.

Light fill works by rendering each light without any render target, instead “rendering” the light to the light buffer. Since we’re rendering something, we have to have a mesh!

Here’s the light sphere from ThreeL:

As is typical, it’s got some retro charm and isn’t super spherical.

A side-effect of this super low-poly sphere is that it ends up generating unnecessary light links because it’s quite a bit bigger and more jagged than a perfect sphere.

However our ray intersection is done against a perfect mathematic light sphere, so in cases where there isn’t any intersection we can skip emitting these light links all together!

We can visualize this using ThreeL’s light boundaries view:

The visualization shows the status of the light hit across the surface. White indicates that the pixel intersected with a light in the list and would be shaded by it. Gray indicates an intersection which would have no impact due to being outside of the light’s actual range. (Some gray is inevitable, but less is better.)

The main thing to note here is that the shape of the light buffer data is not jagged like our light sphere mesh. It’s nice and round.

The 8x8 tile version is inherently jagged due to the low light buffer resolution, but neither has unnecessary light links within the pointy parts of the mesh.

Performance

I have not done extensive performance testing yet, but results have been favorable so far.

The current test scene features 256 lights scattered about Sponza Palace. Depending on where you are, the light list length peaks around 35 lights in a single tile.

To give a slightly more standardized benchmark though, I’ve set up a test scene similar to the one used by Bezrati–with 16 lights covering the entire screen.

Each light is large enough to affect every pixel on the screen, so no lights are being skipped while rendering objects.

The benchmark scene, 16 lights per tile for every tile

Unfortunately I don’t have time to dig my Radeon HD 7870 out of storage to allow us to compare directly to Bezrati’s results, but I did get performance stats for a variety of GPUs:

NVIDIA GTX 1080 (at both 1080p and 2160p)
AMD Radeon RX 6700 XT (at 1080p–thanks to Zoey for the stats!)
NVIDIA GTX 1060 (thermally constrained laptop, 1080p only)
Intel UHD 620 (same thermally constrained laptop)

All statistics collected with 8x8 light tiles.

To make things easier to read, I’ve reproduced the most important GPU times below. All times are in milliseconds:

Measurement	(A) GTX 1080 3840x2160	(B) GTX 1080 1920x1080	(C) RX 6700 XT 1920x1080	(D) GTX 1060 1920x1080	(E) UHD 620 1920x1080
FillLightLinkedList	0.38	0.10	0.17	0.23	1.05
OpaquePass	8.77	2.71	4.18	7.57	25.93
ParticleRender	2.06	0.64	1.19	1.80	7.43

It’s hard to put a ton of stock in the absolute numbers here without something to compare to, but nothing here is too surprising and I’m pretty happy overall.

The GTX 1080 results for 2160p (A) are roughly 4x the results from 1080p on the same system (B), which is what you’d expect at quadruple the number of pixels.

The more modest GTX 1060 (D) performs sensibly as well, on par with the GTX 1080 @ 2160p (A).

I was a little surprised by the RX 6700 XT (C). It did fine, but I expected to see it do better considering it’s 5 years newer than the NVIDIA cards. A casual perusal of benchmarks seems to agree that the RX 6700 XT should probably have beaten a GTX 1080, so this is probably worth looking into.

The Intel UHD 620 (E) performs poorly. It’s an anemic iGPU in an already somewhat constrained i7-8650U, so this doesn’t come as much of a surprise. It does OK at filling out the light linked list, but less so at actually rendering. More work is necessary to determine exactly why this is, whether it’s due to the PBR lighting calculations or the actual light linked list traversal.

As a bit of an aside, it’s interesting to see the high present time of nearly 4 ms for the GTX 1060 (D). This GPU lives across a PCIe 3.0 x4 bus and relies on the Intel GPU to present its output and it shows.

Out of curiosity, I also created a worst case scenario scene which has 256 full-screen lights.

(I’ll save your eyeballs from being subjected to what that looked like.)

Nothing too surprising here, the worst case is indeed pretty bad. This is the equivalent of an old-school forward pipeline looping through all 256 lights…except it’s through a linked list instead of a constant buffer.

It is mildly interesting to see though just how little the light fill step takes even under these conditions.

In the future I would like to go more in-depth with performance analysis. In particular it’d be interesting to know how much of the cost in the opaque/particle passes is coming from the list traversal and how much is coming from PBR lighting calculations. I’d also be interested in testing some other older GPUs, and maybe figuring out what exactly is tripping up the Intel GPU.

Everything else

One of the benefits of LvLLL is that it works well with transparent materials. To demonstrate this in ThreeL, I implemented a compute-based particle system to create a plume of smoke.

Here’s a better view of the smoke plume intersecting with a handful of lights:

As you can see, the various lights near the smoke transmit and scatter within it just as you’d expect. It’s using the same light lists as all the opaque objects.

This screenshot also shows the little particle editor included with ThreeL. It’s not the most flexible particle system in the world, but it’s fun to play with!

Thanks for taking some time to read about ThreeL!

Download

Screenshots

A screenshot from the lower floor along with performance statistics

A birds-eye view showing many of the lights scattered about the scene
(The size of the sprite indicates the range of the light)

The particle system editor along with a modified smoke particle system

The light count debug view from the same perspective – warmer colors represent more lights in that tile