Raph Levien’s blog

I want a good parallel computer

2025-03-21T17:30:42+00:00

The GPU in your computer is about 10 to 100 times more powerful than the CPU, depending on workload. For real-time graphics rendering and machine learning, you are enjoying that power, and doing those workloads on a CPU is not viable. Why aren’t we exploiting that power for other workloads? What prevents a GPU from being a more general purpose computer?

I believe there are two main things holding it back. One is an impoverished execution model, which makes certain tasks difficult or impossible to do efficiently; GPUs excel at big blocks of data with predictable shape, such as dense matrix multiplication, but struggle when the workload is dynamic. Second, our languages and tools are inadequate. Programming a parallel computer is just a lot harder.

Modern GPUs are also extremely complex, and getting more so rapidly. New features such as mesh shaders and work graphs are two steps forward one step back; for each new capability there is a basic task that isn’t fully supported.

I believe a simpler, more powerful parallel computer is possible, and that there are signs in the historical record. In a slightly alternate universe, we would have those computers now, and be doing the work of designing algorithms and writing programs to run well on them, for a very broad range of tasks.

Last April, I gave a colloquium (video) at the UCSC CSE program with the same title. This blog is a companion to that.

Memory efficiency of sophisticated GPU programs

I’ve been working on Vello, an advanced 2D vector graphics renderer, for many years. The CPU uploads a scene description in a simplified binary SVG-like format, then the compute shaders take care of the rest, producing a 2D rendered image at the end. The compute shaders parse tree structures, do advanced computational geometry for stroke expansion, and sorting-like algorithms for binning. Internally, it’s essentially a simple compiler, producing a separate optimized byte-code like program for each 16x16 pixel tile, then interpreting those programs. What it cannot do, a problem I am increasingly frustrated by, is run in bounded memory. Each stage produces intermediate data structures, and the number and size of these structures depends on the input in an unpredictable way. For example, changing a single transform in the encoded scene can result in profoundly different rendering plans.

The problem is that the buffers for the intermediate results need to be allocated (under CPU control) before kicking off the pipeline. There are a number of imperfect potential solutions. We could estimate memory requirements on the CPU before starting a render, but that’s expensive and may not be precise, resulting either in failure or waste. We could try a render, detect failure, and retry if buffers were exceeded, but doing readback from GPU to CPU is a big performance problem, and creates a significant architectural burden on other engines we’d interface with.

The details of the specific problem are interesting but beyond the scope of this blog post. The interested reader is directed to the Potato design document, which explores the question of how far you can get doing scheduling on CPU, respecting bounded GPU resources, while using the GPU for actual pixel wrangling. It also touches on several more recent extensions to the standard GPU execution model, all of which are complex and non-portable, and none of which quite seem to solve the problem.

Fundamentally, it shouldn’t be necessary to allocate large buffers to store intermediate results. Since they will be consumed by downstream stages, it’s far more efficient to put them in queues, sized large enough to keep enough items in flight to exploit available parallelism. Many GPU operations internally work as queues (the standard vertex shader / fragment shader / rasterop pipeline being the classic example), so it’s a question of exposing that underlying functionality to applications. The GRAMPS paper from 2009 suggests this direction, as did the Brook project, a predecessor to CUDA.

There are a lot of potential solutions to running Vello-like algorithms in bounded memory; most have a fatal flaw on hardware today. It’s interesting to speculate about changes that would unlock the capability. It’s worth emphasizing, I’m not feeling held back by the amount of parallelism I can exploit, as my approach of breaking the problem into variants of prefix sum easily scales to hundreds of thousands of threads. Rather, it’s the inability to organize the overall as stages operating in parallel, connected through queues tuned to use only the amount of buffer memory needed to keep everything smoothly, as opposed to the compute shader execution model of large dispatches separated by pipeline barriers.

Parallel computers of the past

The lack of a good parallel computer today is especially frustrating because there were some promising designs in the past, which failed to catch on for various complex reasons, leaving us with overly complex and limited GPUs, and extremely limited, though efficient, AI accelerators.

Connection Machine

I’m listing this not because it’s a particularly promising design, but because it expressed the dream of a good parallel computer in the clearest way. The first Connection Machine shipped in 1985, and contained up to 64k processors, connected in a hypercube network. The number of individual threads is large even by today’s standards, though each individual processor was extremely underpowered.

Perhaps more than anything else, the CM spurred tremendous research into parallel algorithms. The pioneering work by Blelloch on prefix sum was largely done on the Connection Machine, and I find early paper on sorting on CM-2 to be quite fascinating.

Connection Machine 1 (1985) at KIT / Informatics / TECO • by KIT TECO • CC0

Cell

Another important pioneering parallel computer was Cell, which shipped as part of the PlayStation 3 in 2006. That device shipped in fairly good volume (about 87.4 million units), and had fascinating applications including high performance computing, but was a dead end; the Playstation 4 switched to a fairly vanilla rendering pipeline based on a Radeon GPU.

Probably one of the biggest challenges in the Cell was the programming model. In the version shipped on the PS3, there were 8 parallel cores, each with 256kB of static RAM, and each with 128 bit wide vector SIMD. The programmer had to manually copy data into local SRAM, where a kernel would then do some computation. There was little or no support for high level programming; thus people wanting to target this platform had to painstakingly architect and implement parallel algorithms.

All that said, the Cell basically met my requirements for a “good parallel computer.” The individual cores could run effectively arbitrary programs, and there was a global job queue.

The Cell had approximately 200 GFLOPS of total throughput, which was impressive at the time, but pales in comparison to modern GPUs or even a modern CPU (Intel i9-13900K is approximately 850 GFLOPS, with a medium-high end Ryzen 7 is 3379 GFLOPS).

Larrabee

Perhaps the most poignant road not taken in the history of GPU design is Larrabee. The 2008 SIGGRAPH paper makes a compelling case, but ultimately the project failed. It’s hard to say why exactly, but I think it’s possible it was just poor execution on Intel’s part, and with more persistence and a couple of iterations to improve the shortcomings in the original version, it might well have succeeded. At heart, Larrabee is a standard x86 computer with wide (512 bit) SIMD units and just a bit of special hardware to optimize graphics tasks. Most graphics functions are implemented in software. If it had succeeded, it would very easily fulfill my wishes; work creation and queuing is done in software and can be entirely dynamic at a fine level of granularity.

Bits of Larrabee live on. The upcoming AVX10 instruction set is an evolution of Larrabee’s AVX-512, and supports 32 lanes of f16 operations. In fact, Tom Forsyth, one of its creators, argues that Larrabee did not indeed fail but that its legacy is a success. It did ship in modest volumes as Xeon Phi. Another valuable facet of legacy is ISPC, and Matt Pharr’s blog on The story of ispc sheds light on the Larrabee project.

Likely one of the problems of Larrabee was power consumption, which has emerged as one of the limiting factors in parallel computer performance. The fully coherent (total store order) memory hierarchy, while making software easier, also added to the cost of the system, and since then we’ve gained a lot of knowledge in how to write performant software in weaker memory models.

Another aspect that definitely held Larrabee back was the software, which is always challenging, especially for innovative directions. The drivers didn’t expose the special capabilities of the highly programmable hardware, and performance on traditional triangle-based 3D graphics scenes was underwhelming. Even so, it did quite well on CAD workloads involving lots of antialiased lines, driven by a standard OpenGL interface.

The changing workload

Even within games, compute is becoming a much larger fraction of the total workload (for AI, it’s everything). Analysis of Starfield by Chips and Cheese shows that about half the time is in compute. The Nanite renderer also uses compute even for rasterization of small triangles, as hardware is only more efficient for triangles above a certain size. As games do more image filtering, global illumination, and primitives such as Gaussian splatting, the trend will almost certainly continue.

In 2009, Tim Sweeney gave a thought-provoking talk entitled The end of the GPU roadmap, in which he proposed that the concept of GPU would go away entirely, replaced by a highly parallel general purpose computer. That has not come to pass, though there has been some movement in that direction: the Larrabee project (as described above), the groundbreaking cudaraster paper from 2011 implemented the traditional 3D rasterization pipeline entirely in compute, and found (simplifying quite a bit) that it was about 2x slower than using fixed function hardware, and more recent academic GPU designs based on grids of RISC-V cores. It’s worth noting, a more recent update from Tellusim suggests that cudaraster-like rendering in compute can be close to parity on modern hardware.

An excellent 2017 presentation, Future Directions for Compute-for-Graphics by Andrew Lauritzen, highlighted many of the challenges of incorporating advanced compute techniques into graphics pipelines. There’s been some progress since then, but it speaks to many of the same problems I’m raising in this blog post. Also see comments by Josh Barczak, which also links the GRAMPS work and discusses issues with language support.

Paths forward

I can see a few ways to get from the current state to a good parallel computer. Each basically picks a starting point that might have been on the right track but got derailed.

Big grid of cores: Cell reborn

The original promise of the Cell still has some appeal. A modern high end CPU chip has north of 100 billion transistors, while a reasonably competent RISC CPU can be made with orders of magnitude fewer. Why not place hundreds or even thousands of cores on a chip? For maximum throughput, put a vector (SIMD) unit on each core. Indeed, there are at least two AI accelerator chips based on this idea: Esperanto and Tenstorrent. I’m particularly interested in the latter because its software stack is open source.

That said, there are most definitely challenges. A CPU by itself isn’t enough, it also needs high bandwidth local memory and communications with other cores. One reason the Cell was so hard to program is that the local memory was small and needed to be managed explicitly - your program needed to do explicit transfers through the network to get data in and out. The trend in CPU (and GPU) design is to virtualize everything, so that there’s an abstraction of a big pool of memory that all the cores share. You’ll still want to make your algorithm cache-aware for performance, but if not, the program will still run. It’s possible a sufficiently smart compiler can adapt a high-level description of the problem to the actual hardware (and this is the approach taken by Tenstorrent’s TT-Buda stack, specialized to AI workloads). In analogy to exploiting instruction-level parallelism through VLIW, the Itanium stands as a cautionary tale.

From my read of the Tenstorrent docs, the matrix unit is limited to just matrix multiplication and a few supporting operations such as transpose, so it’s not clear it would be a significant speedup for complex algorithms as needed in 2D rendering. But I think it’s worth exploring, to see how far it can be pushed, and perhaps whether practical extensions to the matrix unit to support permutations and so on would unlock more algorithms.

Most of the “big grid of cores” designs are targeted toward AI acceleration, and for good reason: it is hungry for raw throughput with low power costs, so alternatives to traditional CPU approaches are appealing. See the New Silicon for Supercomputers talk by Ian Cutress for a great survey of the field.

Running Vulkan commands from GPU-side

A relatively small delta to existing GPUs would be the ability to dispatch work from a controller mounted on the GPU and sharing address space with the shaders. In its most general form, users would be able to run threads on this controller that could run the full graphics API (for example, Vulkan). The programming model could be similar to now, just that the thread submitting work is running close to the compute units and therefore has dramatically lower latency.

In their earliest form, GPU’s were not distributed systems, they were co-processors, tightly coupled to the host CPU’s instruction stream. These days, work is issued to the GPU by the equivalent of async remote procedure calls, with end-to-end latency often as high as 100µs. This proposal essentially calls for a return to less of a distributed system model, where work can efficiently be issued on a much finer grain and with much more responsiveness to the data. For dynamic work creation, latency is the most important blocker.

Note that GPU APIs are slowly inventing a more complex, more limited version of this anyway. While it’s not possible to run the Vulkan API directly from a shader, with a recent Vulkan extension (VK_EXT_device_generated_commands) it is possible to encode some commands into a command buffer from a shader. Metal has this capability as well (see gpuweb#431 for more details about portability). It’s worth noting that the ability to run indirect commands to recursively generate more work is one of the missing functions; it seems that the designers did not take Hofstadter to heart.

It is interesting to contemplate actually running Vulkan API directly from a shader. Since the Vulkan API is expressed in terms of C, one of the requirements is the ability to run C. This is being done on an experimental basis (see the vcc project), but is not yet practical. Of course, CUDA can run C. CUDA 12.4 also has support for conditional nodes, and as of 12.0 it had support for device graph launch, which reduces latency considerably.

Work graphs

Work graphs are a recent new extension to the GPU execution model. Briefly, the program is structured as a graph of nodes (kernel programs) and edges (queues) all running in parallel. As a node generates output, filling its output queues, the GPU dispatches kernels (at workgroup granularity) to process those outputs further. To a large extent, this is a modern reinvention of the GRAMPS idea.

While exciting, and very likely useful for an increasing range of graphics tasks, work graphs also have serious limitations; I researched whether I could use them for the existing Vello design and found three major problems. First, they cannot easily express joins, where progress of a node is dependent on synchronized input from two different queues. Vello uses joins extensively, for example one kernel to compute a bounding box of a draw object (aggregating multiple path segments), and another to process the geometry within that bounding box. Second, there is no ordering guarantee between the elements pushed into a queue, and 2D graphics ultimately does require ordering (the whiskers of the tiger must be drawn over the tiger’s face). Third, work graphs don’t support variable-size elements.

The lack of an ordering guarantee is particularly frustrating, because the traditional 3D pipeline does maintain ordering, among other reasons, to prevent Z-fighting artifacts (for a fascinating discussion of how GPU hardware preserves the blend order guarantee, see A trip through the Graphics Pipeline part 9). It is not possible to faithfully emulate the traditional vertex/fragment pipeline using the new capability. Obviously, maintaining ordering guarantees in parallel systems is expensive, but ideally there is a way to opt in when needed, or at least couple work graphs with another mechanism (some form of sorting, which is possible to implement efficiently on GPUs) to re-establish ordering as needed. Thus, I see work graphs as two steps forward, one step back.

CPU convergent evolution

In theory, when running highly parallel workloads, a traditional multi-core CPU design is doing the same thing as a GPU, and if fully optimized for efficiency, should be competitive. That, arguably, is the design brief for Larrabee, and also motivation for more recent academic work like Vortex. Probably the biggest challenge is power efficiency. As a general trend, CPU designs are diverging into those optimizing single-core performance (performance cores) and those optimizing power efficiency (efficiency cores), with cores of both types commonly present on the same chip. As E-cores become more prevalent, algorithms designed to exploit parallelism at scale may start winning, incentivizing provision of even larger numbers of increasingly efficient cores, even if underpowered for single-threaded tasks.

An advantage of this approach is that it doesn’t change the execution model, so existing languages and tools can still be used. Unfortunately, most existing languages are poor at expressing and exploiting parallelism at both the SIMD and thread level – shaders have a more limited execution model but at least it’s clear how to execute them in parallel efficiently. And for thread-level parallelism, avoiding performance loss from context switches is challenging. Hopefully, newer languages such as Mojo will help, and potentially can be adapted to GPU-like execution models as well.

I’m skeptical this approach will actually become competitive with GPUs and AI accelerators, as there is just a huge gap in throughput per watt compared with GPUs – about an order of magnitude. Also, GPUs and AI accelerators won’t be standing still either.

Maybe the hardware is already there?

It’s possible that there is hardware currently shipping that meets my criteria for a good parallel computer, but its potential is held back by software. GPUs generally have a “command processor” onboard, which, in cooperation with the host-side driver, breaks down the rendering and compute commands into chunks to be run by the actual execution units. Invariably, this command processor is hidden and cannot run user code. Opening that up could be quite interesting. A taste of that is in Hans-Kristian Arntzen’s notes on implementing work graphs in open source drivers: Workgraphs in vkd3d-proton.

GPU designs vary in how much is baked into the hardware and how much is done by a command processor. Programmability is a good way to make things more flexible. The main limiting factor is the secrecy around such designs. Even in GPUs with open source drivers, the firmware (which is what runs on the command processor) is very locked down. Of course, a related challenge is security; opening up the command processor to user code increases the vulnerability surface area considerably. But from a research perspective, it should be interesting to explore what’s possible aside from security concerns.

Another interesting direction is the rise of “Accelerated Processing Units” which integrate GPUs and powerful CPUs in the same address space. Conceptually, these are similar to integrated graphics chips, but those rarely have enough performance to be interesting. From what I’ve seen, running existing APIs on such hardware (Vulkan for compute shaders, or one of the modern variants of OpenCL) would not have significant latency advantages for synchronizing work back to the CPU, due to context switching overhead. It’s possible a high priority or dedicated thread might quickly process items placed in a queue by GPU-side tasks. The key idea is queues running at full throughput, rather than async remote procedure calls with potentially huge latency.

Complexity

Taking a step back, one of the main features of the GPU ecosystem is a dizzying level of complexity. There’s the core parallel computer, then lots of special function hardware (and the scope of this is increasing, especially with newer features such as ray tracing), then big clunky mechanisms to get work scheduled and run. Those start with the basic compute shader dispatch mechanism (a 3D grid of x, y, z dimensions, 16 bits each), and then augment that with various indirect command encoding extensions.

Work graphs also fit into the category of complexifying the execution model to work around the limitations of the primitive 3D grid. I was initially excited about their prospect, but when I took a closer look, I found they were inadequate for expressing any of the producer/consumer relationships in Vello.

There’s a lot of accidental complexity as well. There are multiple competing APIs, each with subtly different semantics, which makes it especially hard to write code once and have it just work.

CUDA is adding lots of new features, some of which improve autonomy as I’ve been wanting, and there is a tendency for graphics APIs to adopt features from CUDA. However, there’s also a lot of divergence between these ecosystems (work graphs can’t be readily adapted to CUDA, and it’s very unlikely graphics shaders will get independent thread scheduling any time soon).

The complexity of the GPU ecosystem has many downstream effects. Drivers and shader compilers are buggy and insecure, and there is probably no path to really fixing that. Core APIs tend to be very limited in functionality and performance, so there’s a dazzling array of extensions that need to be detected at runtime, and the most appropriate permutation selected. This in turn makes it far more likely to run into bugs that appear only with specific combinations of features, or on particular hardware.

All this is in fairly stark contrast to the CPU world. A modern CPU is also dazzlingly complex, with billions of transistors, but it is rooted in a much simpler computational model. From a programmer perspective, writing code for a 25 billion transistor Apple M3 isn’t that different from, say, a Cortex M0, which can be made with about 48,000 transistors. Similarly, a low performance RISC-V implementation is a reasonable student project. Obviously the M3 is doing a lot more with branch prediction, superscalar issue, memory hierarchies, op fusion, and other performance tricks, but it’s recognizably doing the same thing as a vastly smaller and simpler chip.

In the past, there were economic pressures towards replacing special-purpose circuitry with general purpose compute performance, but those incentives are shifting. Basically, if you’re optimizing for number of transistors, then somewhat less efficient general purpose compute can be kept busy almost all the time, while special purpose hardware is only justified if there is high enough utilization in the workload. However, as Dennard scaling has ended and we’re more constrained by power than transistor count, special purpose hardware starts winning more; it can simply be powered down if it isn’t used by the workload. The days of a purely RISC computational model are probably over. What I’d like to see replacing it is an agile core (likely RISC-V) serving as the control function for a bunch of special-purpose accelerator extensions. That certainly is the model of the Vortex project among others.

Conclusion

In his talk shortly before retirement, Nvidia GPU architect Erik Lindholm said (in the context of work creation and queuing systems), “my career has been about making things more flexible, more programmable. It’s not finished yet. There’s one more step that I feel that needs to be done, and I’ve been pursuing this at Nvidia Research for many years.” I agree, and my own work would benefit greatly. Now that he has retired, it is not clear who will take up the mantle. It may be Nvidia disrupting their previous product line with a new approach as they have in the past. It may be an upstart AI accelerator making a huge grid of low power processors with vector units, that just happens to be programmable. It might be CPU efficiency cores evolving to become so efficient they compete with GPUs.

Or it might not happen at all. On the current trajectory, GPUs will squeeze out incremental improvements on existing graphics workloads at the cost of increasing complexity, and AI accelerators will focus on improving the throughput of slop generation to the exclusion of everything else.

In any case, there is an opportunity for intellectually curious people to explore the alternate universe in which the good parallel computer exists; architectures can be simulated on FPGA like Vortex, and algorithms can be prototyped on multicore wide-SIMD CPUs. We can also start to think about what a proper programming language for such a machine might look like, as frustrating as it is to not have real hardware to run it on.

Progress on a good parallel computer would help my own little sliver of work, trying to make a fully parallel 2D renderer with modest resource requirements. I’ve got to imagine it would in addition help AI efforts, potentially unlocking sparse techniques that can’t run on existing hardware. I also think there’s a golden era of algorithms that can be parallel but aren’t a win on current GPUs, waiting to be developed.

A note on Metal shader converter

2023-06-12T18:05:42+00:00

At WWDC, Apple introduced Metal shader converter, a tool for converting shaders from DXIL (the main compilation target of HLSL in DirectX12) to Metal. While it is no doubt useful for reducing the cost of porting games from DirectX to Metal, I feel it does not move us any closer to a world of robust GPU infrastructure, and in many ways just adds more underspecified layers of complexity.

The specific feature I’m salty about is atomic barriers that allow for some sharing of work between threadgroups. These barriers are present in HLSL, and in fact have been since 2009, when Direct3D 11 and Shader Model 5 were first introduced. This barrier is not supported in Metal, and of the major GPU APIs, Metal is the only one that doesn’t support it. That holds back WebGPU’s performance (see gpuweb#3935 for discussion), as WebGPU must be portable across the major APIs.

I’ve discussed the value of this barrier in my blog post Prefix sum on portable compute shaders, but I’ll briefly recap. Among other things, it enables a single-pass implementation of prefix sum, using a technique such as decoupled look-back or the SAM prefix sum algorithm. A single-pass implementation can achieve the same throughput as memcpy, while a more traditional tree-reduction approach can at best achieve 2/3 that throughput, as it has to read the entire input in two separate dispatches. Further, tree reduction can actually be more complex to implement in practice, as the number of dispatches varies with the input size (it is typically 2 * ceil(log(n) / log(threadgroup size))). Prefix sum, in turn, is an important primitive for advanced compute workloads. There are a number of instances of it in the Vello pipeline, and it’s also commonly used in stream compaction, decoding of variable length data streams, and compression.

I believe there are other important techniques that are similarly unlocked by the availability of these primitives. For example, Nanite’s advanced compute pipelines schedule work through job queues, and in general it is not possible to reliably coordinate work between different threadgroups (even within the same dispatch) without such a barrier.

Complexity and reasoning

The GPU ecosystem exists at the knife edge of being strangled by complexity. A big part of the problem is that features tend to inhabit a quantum superposition of existing and not existing. Typically there is an anemic core, surrounded by a cloud of optional features. The Vulkan ecosystem is notorious for this: the extension list at vulkan.gpuinfo.org currently lists 146 extensions.

The widespread use of shader translation makes the situation even worse. When writing HLSL that will be translated into other shader languages, it’s no longer sufficient to consider Shader Model 5 to be a baseline, but rather the developer needs to keep in mind all the features that don’t translate to other languages. In some cases, the semantics change subtly (the rules for the various flavors “count leading zeros” when the input is 0 vary), and in other cases, like these device scoped barriers.

A separate category is things technically forbidden by the spec, but expected to work in practice. A good example here is the mixing of atomic and non-atomic memory operations (see gpuweb#2229). The spirv-cross shader translation tool casts non-atomic pointers to atomic pointers to support this common pattern, which is technically undefined behavior in C++, but in practice lots of people would be unhappy if the Metal shader compiler did anything other than the reasonable thing. Since Metal’s semantics are based on C++, I’d personally love to see this resolved by adopting std::atomic_ref from C++20 (Metal is still based on C++14). I’ll also note that the official Metal shader compiler tool generates reasonable IR for this pattern. It’s concerning that using open source tools such as spirv-cross triggers technical undefined behavior, but it’s probably not a big problem in practice.

I understand the incentives, but overall I find it disappointing that Metal chases shiny new features like ray-tracing, while failing to provide a solid, spec-compliant foundation for GPU compute.

Onward

The Metal announcements from WWDC move us no closer to a world of robust GPU infrastructure. But there is much we can still do.

For one, there is a GPU infrastructure stack that is based on careful specification and conformance testing, and has two high quality, open source implementations enabling deployment to almost all reasonably current GPU hardware. I speak of course of WebGPU. It’s lacking the shiny features – raytracing, bindless, and cooperative matrix operations (marketed as “tensor cores” and quite important for maximum performance in AI workloads) – but what is there should work.

For two, we can cheer on the work of Asahi Linux. They have recently announced OpenGL 3.1 support on Apple Silicon, and an intent to implement Vulkan. That work may be highly challenging, as obviously that implies implementing barriers which the Apple GPU engineers haven’t been able to manage. But they have done consistently impressive work so far, and I certainly hope they succeed. If nothing else, their work will result in much better public documentation of the hardware’s capabilities and limitations.

I have a recommendations for Apple as well. I hope that they document which HLSL features are expected to work and which are not. Currently in their documentation (which is admittedly beta), it just says “Some features not supported,” which I personally find not very useful. I would also like to give them credit for clarifying the Metal Shading Language Specification with respect to the scope of the mem_device flag to threadgroup_barrier. It now says, “The flag ensures the GPU correctly orders the memory operations to device memory for threads in the threadgroup,” which to a very careful reader does indicate threadgroup scope and no guarantee at device scope. Previously it said “Ensure correct ordering of memory operations to device memory,” which could easily be misinterpreted as providing a device scope guarantee.

I am optimistic in the long term about having really good, portable infrastructure for GPU compute, but it is clear that we have a long way to go.

Simplifying Bézier paths

2023-04-18T13:07:42+00:00

Finding the optimal Bézier path to fit some source curve is, surprisingly, not yet a completely solved problem. Previous posts have given good solutions for specific instances: Fitting cubic Bézier curves primarily addressed Euler spirals (which are very smooth), while Parallel curves of cubic Béziers rendered parallel curves. In this post, I describe refinements of the ideas to solve the much more general problem of simplifying arbitrary paths. Along with the theoretical ideas, a reasonably solid implementation is landing in kurbo, a Rust library for 2D shapes.

The techniques in this post, and code in kurbo, come close to producing the globally optimum Bézier path to approximate the source curve, with fairly decent performance (as well as a faster option that’s not always minimal in the number of segments), though as we’ll see, there is considerable subtlety to what “optimum” means. Potential applications are many:

Simplification of Bézier paths for more efficient publishing or editing
Conversion of fonts into cubic outlines
Tracing of bitmap images into vectors
Rendering of offset curves
Conversion from other curve types including NURBS, piecewise spiral
Distortions and transforms including perspective transform

While the primary motivation is simplification of an existing Bézier path into a new one with fewer segments, the techniques are quite general. One of the innovations is the ParamCurveFit trait, which designed for efficient evaluation of any smooth curve segment for curve fitting. Implementations of this trait are provided for parallel curves and path simplification, but it is intentionally open ended and can be implemented by user code.

This work is an opportunity to revisit some of the topics from my thesis, where I also explored systematic search for optimal Bézier fits. The new work is several orders of magnitude faster and more systematic about not getting stuck in local minima.

The ParamCurveFit trait

At its best, a trait in Rust is a way to model some fact about the problem, then serves as an interface or abstraction boundary between pieces of code. The ParamCurveFit trait models an arbitrary curve for the purpose of generating a Bézier path approximating that source curve. The key insight is to sample both the position and derivative of the source curve for given parameters, and there is also a mechanism for detecting cusps (particularly important for parallel curves).

The curve fitting process uses those position and derivative samples to compute candidate approximation curves, then measure the error between source curve and approximation.

Two implementations are provided: parallel curves of cubic Béziers, and arbitrary Bézier paths for simplification. Implementing the trait is not very difficult, so it should be possible to add more source curves and transformations, taking advantage of powerful, general mechanisms to compute an optimized Bézier path.

The core cubic Bézier fitting algorithm is based on measuring the area and moment of the source curve. A default implementation is provided by the ParamCurveFit trait implementing numerical integration based on derivatives (Green’s theorem), but if there’s a better way to compute area and moment, source curves can provide their own implementation.

For Bézier paths, an efficient analytic implementation is possible. No blog post of mine is complete without some reference to a monoid, and this one does not disappoint. It’s relatively straightforward to compute area and moments from a cubic Bézier using symbolic evaluation of Green’s theorem. The area and moments of a sequence of Bézier segments is the sum of the individual values for each segment. Thus, we precompute the prefix sum of the areas and moments for all segments in a path, then can query an arbitrary range in O(1) time by taking the difference between the end point and the start point.

Error metrics

The problem of finding an optimal Bézier path approximation is always relative to some error metric. In general, an optimal path is one with a minimal number of segments while still meeting the error bound, and a minimal error compared to other paths with the same number of segments. A curve fitting algorithm will evaluate many error metrics, at least one for each candidate approximation to determine whether it meets the error bound. The simplest technique subdivides in half when the bound is not met, but a more sophisticated approach attempts to optimize the positions of the subdivision points as well.

The main error metric used for curve fitting is Fréchet distance. Intuitively, it captures the idea of the maximum distance between two curves while also preserving orientation of direction along a path (the related Hausdorff metric does not preserve orientation and so a curve with a sharp zigzag may have a small Hausdorff metric measured against a straight line). The difference is shown in the following illustration:

Computing the exact Fréchet distance between two curves is not in general tractable, so we have to use approximations. It is important for the approximation to not underestimate, as this will yield a result that exceeds the error bound.

The classic Tiller and Hanson paper on parallel curves proposed a practical, reasonably accurate, and efficient error metric to approximate Fréchet distance. It samples the source curve at n points, and for each point casts a ray along the normal from the point, detecting intersections with the cubic approximation. The maxium distance from a source curve point to the corresponding intersection is the metric.

Unfortunately, there is a case Tiller-Hanson handles poorly, and equally unfortunately, it does come up when doing simplification of arbitrary paths. Consider a superellipse shape, approximated (badly) by a cubic Bézier with a loop. The rays strike only parts of the approximating curve and miss the loop entirely.

Increasing the sampling density helps a little but doesn’t guarantee that the approximation will be well covered. Indeed, it is the high curvature in the source curve that makes this coverage more uneven.

A more robust metric is to parametrize the curves by arc length, and measure the distance between samples that share a corresponding portion of the total arc length. This ensures that all parts of both curves are considered. When curves are close (which is a valid assumption for curve fitting), it closely approximates Fréchet distance, though potentially can overestimate (because nearest points don’t necessarily coincide with arc length parametrization) and underestimate due to sampling.

However, arc length parametrization is considerably slower (about a factor of 10) because it requires inverse arc length computations. The approach implemented in kurbo#269 is to classify whether the source curve is “spicy” (considering the deltas between successive normal angles) and use the more robust computation only in those cases.

Finding subdivision points

An extremely common approach is adaptive subdivision. Compute an approximation, evaluate the error metric, and subdivide in half when it is exceeded. This approach is simple, robust, and performant (the total number of evaluations is within a factor of 2 of the number of segments). However, it tends to produce results with more segments than absolutely needed. In the limit, it tends to be about 1.5 times the optimum.

Fancier techniques try to optimize the subdivision points to reduce the number of segments. That is equivalent to dividing the source curve into ranges such that each range is just barely below the threshold; ironically, it basically amounts to maximizing the error metric right up to the constraint.

One technique is given in section 9.6.4 of my thesis. Basically you start on one end and for each segment find a subdivision point from the last point to one that’s just barely under the error threshold. Under the assumption that errors are monotonic (which is not always going to be the case), this finds the global minimum number of segments needed. The last segment will have an error well below the threshold. Then, another search finds the minimum error for which this process yields the same number of segments. Again, if error is monotonic, the result is the Fréchet distance of all segments being equal, which is (at least roughly) equivalent to the overall Fréchet distance being minimized.

For smooth source curves, monotonic error is a reasonable assumption. Even so, the above technique seems to work fairly robustly, producing fewer segments than simple adaptive subdivision, though it is somewhere around 50x slower.

There may be scope to improve this optimization process further. Crates like argmin implement general purpose multidimensional optimization algorithms, and it’s worth exploring whether those could produce equally good results with fewer evaluations.

Bumps

Something unexpected arose during testing: the resulting “optimum” simplified paths had bumps. Obviously at first I thought this was some kind of failure of the algorithm, but now I think something more subtle is happening.

The solution below is with an error tolerance of 0.15, which produces a total of 43 segments:

Near the bottom is a bump. It almost looks like a discontinuity (and other similar examples even more so), but on closer examination it is simply one very long and one very short control arm (ie distance between control point and corresponding endpoint), which creates a high degree of curvature variation:

The underlying problem is that Fréchet distance optimizes only for a distance metric, and does not by itself guarantee a low angle (or curvature) error. In many cases, these objectives are not in tension – the curve that minimizes distance error also smoothly hugs the source curve. But there are cases where there is in fact a tradeoff, and when such tradeoffs exist, agressively optimizing for one causes the other to suffer.

For this particular range of the test data, I believe the Fréchet-minimizing cubic Bézier approximation does indeed exhibit the bump; tweaking the parameters will certainly improve smoothness, but also result in a greater distance from the source curve.

This state of affairs is not acceptable in most applications. It is evidence that Fréchet does not capture all aspects of the objective function. Section 9.2 of the thesis discusses this point.

Given that aggressively optimizing Fréchet may give undesirable results, what is to be done? The most systematic approach would be to design an error metric that takes both distance and angle into account, and calibrate it to correlate strongly with perception. That is perhaps a tall order, requiring research to properly establish tuning parameters, and likely with complexity and runtime performance implications to evaluate. Even so, I think it should be pursued.

A simpler approach, implemented in the current code, is to recognize that these bumpy cubic Béziers can be recognized by their parameters; when the distance from the endpoints to the control points roughly equal to the chord length, then curves can have cusps or nearly so. These are the δ values from the core quartic-based curve fitting solver; a simple approach would be to simply exclude curves with δ values greater than some threshold (0.85 works well), and this also improves performance by decreasing the number of error evaluations needed, but a more sophisticated approach is to multiply the error by a penalty factor for larger δ. One advantage of the latter approach is that it is still possible to fit every actual exact Bézier input.

Applying this tweak gives a much smoother result, though it does require one more segment (44 rather than 43 previously).

Another point to make at this point is that the number of path segments needed scales very gently as the error bound is tightened; changing the threshold from 0.15 to 0.05 increases the number of segments only from 44 to 60. In the limit, the error scales as O(n^6). This extremely fast convergence is one reason that cubic Béziers can be considered a universal representation of curved paths; while some other representation such as NURBS may be able to represent conic sections exactly, any such smooth curve can be approximated to an extremely tight tolerance using a modest number of cubic segments. The path simplification technique in this blog (and in kurbo) gives a practical way to actually attain this degree of accuracy.

And taking the error down to 0.01 requires only 90 segments. This creates an extremely close fit to the source curve, still without requiring an excessive number of segments.

These are vector images, so please feel free to open them in a new tab and zoom in to inspect them more carefully. You should see all solutions display an impressive amount of control over the curvature variation afforded by cubic Béziers. The code is available, so also feel free to experiment with it with your own test data and for your own applications. I’m especially interested in cases where the algorithm doesn’t perform well; it hasn’t been carefully validated yet.

Low pass filtering

The “best” curve depends on the use case. When the source curve is an authoritative source of truth, then making each segment G1 continuous with it at the endpoints is reasonable. However, when the source curve is noisy, perhaps because it’s derived from a scanned image or digitizer samples, then an optimum simplified path may have angle deviations relative to the source that make the overall curve smoother.

I haven’t been working from noisy data and haven’t done experiments, but I do suggest a possibility: use a global optimizing technique such as that provided by argmin, and jointly optimize both the location of the subdivision points and a delta to be applied to the angle (equally on both sides of a subdivision, so the resulting curve remains G1). Another possibility is to explicitly apply a low-pass filter, tuned so the amount of smoothing is consistent with the amount of simplification. In any case, using the existing code with no further tuning may yield less than optimum results.

Discussion

The current code in kurbo is likely considerably better than what’s in your current drawing tool, but curve fitting remains work in progress. The core primitive feels solid, but applying it might require different tuning depending on the specifics of the use case. I invite collaboration along these lines.

Many of the points in this blog were discussed in Zulip threads, particularly curve offsets and More thoughts on cubic fitting. The Zulip is a great place to discuss these ideas, and we welcome newcomers and people bringing questions.

Thanks to Siqi Wang for insightful questions and making test data available.

Moving from Rust to C++

2023-04-01T13:00:42+00:00

Note to readers: this is an April Fools post, intended to satirize bad arguments made against Rust. I have mixed feelings about it now, as I think people who understood the context appreciated the humor, but some people were confused. That’s partly because it’s written in a very persuasive style, and I mixed in some good points. For a good discussion of C++ safety in particular, see JF Bastien’s talk Safety and Security: The Future of C++.

I’ve been involved in Rust and the Rust community for many years now. Much of my work has been related to creating infrastructure for building GUI toolkits in Rust. However, I have found my frustrations with the language growing, and pine for the stable, mature foundation provided by C++.

Choice of build systems

One of the more limiting aspects of the Rust ecosystem is the near-monoculture of the Cargo build system and package manager. While other build systems are used to some extent (including Bazel when integrating into larger polyglot projects), they are not well supported by tools.

In C++, by contrast, there are many choices of build systems, allowing each developer to pick the one best suited for their needs. A very common choice is CMake, but there’s also Meson, Blaze and its variants as well, and of course it’s always possible to fall back on Autotools and make. The latter is especially useful if we want to compile Unix systems such as AIX and DEC OSF/1 AXP. If none of those suffice, there are plenty of other choices including SCons, and no doubt new ones will be created on a regular basis.

We haven’t firmly decided on a build system for the Linebender projects yet, but most likely it will be CMake with plans to evaluate other systems and migrate, perhaps Meson.

Safety

Probably the most controversial aspect of this change is giving up the safety guarantees of the Rust language. However, I do not think this will be a big problem in practice, for three reasons.

First, I consider myself a good enough programmer that I can avoid writing code with safety problems. Sure, I’ve been responsible for some CVEs (including font parsing code in Android), but I’ve learned from that experience, and am confident I can avoid such mistakes in the future.

Second, I think the dangers from memory safety problems are overblown. The Linebender projects are primarily focused on 2D graphics, partly games and partly components for creating GUI applications. If a GUI program crashes, it’s not that big a deal. In the case that the bug is due to a library we use as a dependency, our customers will understand that it’s not our fault. Memory safety bugs are not fundamentally different than logic errors and other bugs, and we’ll just fix those as they surface.

Third, the C++ language is evolving to become safer. We can use modern C++ techniques to avoid many of the dangers of raw pointers (though string views can outlive their backing strings, and closures as used in UI can have very interesting lifetime patterns). The C++ core guidelines are useful, even if they’re honored in the breach. And, as discussed in the next section, the language itself can be expected to improve.

Language evolution

One notable characteristic of C++ is the rapid adoption of new features, these days with a new version every 3 years. C++20 brings us modules, an innovative new feature, and one I’m looking forward to actually being implemented fairly soon. Looking forward, C++26 will likely have stackful coroutines, the ability to embed binary file contents to initialize arrays, a safe range-based for loop, and many other goodies.

By comparison, the pace of innovation in Rust has become more sedate. It used to be quite rapid, and async has been a major effort, but recently landed features such as generic associated types have more of the flavor of making combinations of existing features work as expected rather than bringing any fundamentally new capabilities; the forthcoming “type alias impl trait” is similar. Here, it is clear that Rust is held back by its commitment to not break existing code (enacted by the edition mechanism), while C++ is free to implement backwards compatibility breaking changes in each new version.

C++ has even more exciting changes to look forward to, including potentially a new syntax as proposed by Herb Sutter cppfront, and even the new C++ compatible Carbon language.

Fortunately, we have excellent leadership in the C++ community. Stroustrup’s paper on safety is a remarkably wise and perceptive document, showing a deep understanding of the problems C++ faces, and presenting a compelling roadmap into the future.

Community

I’ll close with some thoughts on community. I try not to spend a lot of time on social media, but I hear that the Rust community can be extremely imperious, constantly demanding that projects be rewritten in Rust, and denigrating all other programming languages. I will be glad to be rid of that, and am sure the C++ community will be much nicer and friendlier. In particular, the C++ subreddit is well known for their sense of humor.

The Rust community is also known for its code of conduct and other policies promoting diversity. I firmly believe that programming languages should be free of politics, focused strictly on the technology itself. In computer science, we have moved beyond prejudice and discrimination. I understand how some identity groups may desire more representation, but I feel it is not necessary.

Conclusion

Rust was a good experiment, and there are aspects I will look back on fondly, but it is time to take the Linebender projects to a mature, production-ready language. I look forward to productive collaboration with others in the C++ community.

We’re looking for people to help with the C++ rewrite. If you’re interested, please join our Zulip instance.

Discuss on Hacker News and /r/rust.

Requiem for piet-gpu-hal

2023-01-07T17:12:42+00:00

Vello is a new GPU accelerated renderer for 2D graphics that relies heavily on compute shaders for its operation. (It was formerly known as piet-gpu, but we renamed it recently because it is no longer based on the [Piet] render context abstraction, and has been substantially rewritten). As such, it depends heavily on having good infrastructure for writing and running compute shaders, consistent with its goals of running portably and reliably across a wide range of GPU hardware. Unfortunately, there hasn’t been a great off-the-shelf solution for that, so we’ve spent a good part of the last couple years building our own hand-rolled GPU abstraction layer, piet-gpu-hal.

During that time, the emerging WebGPU standard, along with the allied WGSL shader language, showed promise for providing such infrastructure, but it has seemed immature, and also has limitations that would have prevented us from doing experiments involving advanced GPU features. This motivated continued work on developing our own abstraction layer, so much so that the piet-gpu vision document has a section entitled “Why not wgpu?”.

Recently, however, we switched Vello to using wgpu for its GPU infrastructure and deleted piet-gpu-hal entirely. This was the right decision, but there was something special about piet-gpu-hal and I hope its vision is realized some day. This post talks about the reasons for the change and the tradeoffs involved.

In the process of doing piet-gpu-hal, we learned a lot about portable compute infrastructure. We will apply that knowledge to improving WebGPU implementations to better suit our needs.

Goals

The goals of piet-gpu-hal were admirable, and I believe there is still a niche for something that implements them. Essentially, it was to provide a lightweight yet powerful runtime for running compute shaders on GPU. Those goals are fulfilled to a large extent by WebGPU, it’s just that it’s not quite as lightweight and not quite as powerful, but on the other hand the developer experience is much better.

The most important goal, especially compared with WebGPU implementations, is to reduce the cost of shader compilation by precompiling them to the intermediate representation (IR) expected by the GPU, rather than compiling them at runtime. This avoids anywhere from about 1MB to about 20MB of binary size for the shader compilation infrastructure, and somewhere around 10ms to 100ms of startup time to compile the shaders.

An additional goal was to unlock the powerful capabilities present on many (but not all) GPUs by runtime query. For our needs, the most important ones are subgroups, descriptor indexing, and detection of unified memory (avoiding the need for separate staging buffers). However, for the coming months we are prioritizing getting things working well for the common denominator, which is well represented by WebGPU, as it’s generally possible to work around the lack of such features by doing a bit more shuffling around in memory.

Implementation choices

In this section, I’ll talk a bit about implementation choices made in piet-gpu-hal and their consequences.

Shader language

After considering the alternatives, we landed on GLSL as the authoritative source for writing shaders. HLSL was in the running, and it seems to be the most popular choice (primarily in the game world) for writing portable shaders, but ultimately GLSL won because it is capable of expressing all of what Vulkan can do. In particular, I wanted to experiment with the Vulkan memory model.

Another choice considered seriously was rust-gpu. That looks promising, and it has many desirable properties, not least being able to run the same code on CPU and GPU, but just isn’t mature enough. Hopefully that will change. I think porting Vello to it would be an interesting exercise, and would shake out many of the issues needing to be solved to use it in production.

Another appealing choice would be Circle, a C++ dialect that targets compute shaders among other things. It might have made it over the edge if it were released as open source; someone should buy out Sean’s excellent work and relicense it.

Ahead of time compilation

From the GLSL source, we also had a pipeline to compile this to shader IR. For all targets, the first step was glslangValidator to compile the GLSL to SPIR-V. And for Vulkan, that was sufficient.

For DX12, the next step was spirv-cross to convert the SPIR-V to HLSL, followed by DXC to convert this to DXIL. We really wanted to use DXC, especially over relying on the system provided shader compiler (some arbitrary version of FXC), both to access advanced features in Shader Model 6 such as wave operations, and also because FXC is somewhat buggy and we have evidence it will cause problems. In fact, I see the current reliance on FXC to be one of the biggest risks for WebGPU working well on Windows. (In the longer term, it is likely that both Tint and naga will compile WGSL directly to DXIL and perhaps DXBC, which will solve these problems, but will take a while)

For Metal, we used spirv-cross to convert SPIR-V to Metal Shading Language. We intended to go one step further, to AIR (also known as metallib), using the command-line tools, but we didn’t actually do that, as it would have made setting up the CI more difficult.

All this was controlled through a simple hand-rolled ninja file. At first, we ran this by hand and committed the results, but it was tricky to make sure all the tools were in place (which would have been considerably worse had we required both DXC with the signing DLL in place and the Metal tools), and it also resulted in messy commits, prone to merge conflicts. Our solution to this was to run shader compilation in CI, which was better but still added considerably to the friction, as it was easy to be in the wrong branch for committing PRs.

GPU abstraction layer

The core of piet-gpu-hal was an abstraction layer over GPU APIs. Following the tradition of gfx-hal, we called this a HAL (hardware abstraction layer), but that’s not quite accurate. Vulkan, Metal, and DX12 are all hardware abstraction layers, responsible among other things for compiling shader IR into the actual ISA run by the hardware, and dispatching compute and other work.

The design was based loosely on gfx-hal, but more tuned to our needs. To summarize, gfx-hal was a layer semantically very close to Vulkan, so that the Vulkan back-end was a very thin layer, and the Metal back-end resembled MoltenVK, with the DX12 back-end also simulating Vulkan semantics in terms of the underlying API. This ultimately wasn’t a very satisfying approach, and for wgpu was deprecated in favor of wgpu-hal.

We only implemented the subset needed for compute, which turned out to be a significant limitation because we found we also needed to run the rasterization pipeline for some tasks such as image scaling. The need to add new functionality to the HAL was also a major friction point. We were able to add features like runtime query for advanced capabilities such as subgroup size control, and a polished timer query implementation on all platforms (the latter is still not quite there for wgpu). Overall it worked pretty well.

I make one general observation: all these APIs and HALs are very object oriented. There are objects representing the adapter, device, command lists, and resources such as buffers and images. Managing these objects is nontrivial, especially because the lifetimes are complex due to the asynchronous nature of GPU work submission; you can’t destroy a resource until all command buffers referencing it have completed. These patterns are fairly cumbersome, and especially don’t translate easily to Rust.

For others seeking to build cross-platform GPU infrastructure, I suggest exploring a more declarative, data-oriented approach. In that approach, build a declarative render graph using simple value types as much as possible, then write specialized engines for each back-end API. This should lead to more straightforward code with less dynamic dispatching, and also resolve the need to find common-denominator abstractions. We are moving in this direction in Vello, and may explore it further as we encounter the need for native back-ends.

WebGPU pain points

Overall the task of porting piet-gpu to WGSL went well, but we ran into some issues. We expect these to be fixed and improved, but at the moment the experience is a bit rough. In particular, the demo is not running on DX12 at all, either in Chrome Canary for the WebGPU version or through wgpu. The main sticking point is uniformity analysis, which only very recently has a good solution in WGSL (workgroupUniformLoad) and the implementation hasn’t fully landed.

Collaboration with community

A major motivation for switching to WGSL and WebGPU is to interoperate better with other projects in that ecosystem. We already have a demo of Bevy interop, which is particularly exciting.

One such project is wgsl-analyzer, which gives us interactive language server features as we develop WGSL shader code: errors and warnings, inline type hints, go to definition, and more. Thanks especially to Daniel McNab, they’ve been super responsive to our needs and we maintain a working configuration. I strongly recommend using such tools; the old way sometimes felt like submitting shader code on a stack of punchcards to the shader compiler.

Obviously another major advantage is deploying to the web using WebGPU, which we have working in at least prototype form. With an intent to ship from Chrome and active involvement from Mozilla and Apple, the prospect of WebGPU shipping in usable state on real browsers seems close.

Precompiled shaders on roadmap?

The goals of a lightweight runtime for GPU compute remain compelling, and in our roadmap we plan to return to precompiled shaders so it is possible to use Vello without a need to carry along runtime shader compilation infrastructure, and also to more aggressively exploit advanced GPU features. There are two avenues we are exploring.

One is to add support for precompiled shaders to either wgpu or wgpu-hal. We have an issue with some thoughts. The advantage of this approach is that it potentially benefits many applications developed on WebGPU, for example native binary distributions of games. If it is possible to identify all permutations of shaders which will actually be used, and bundle the IR.

The other is to add additional native renderer back-ends. The new architecture is much more declarative and less object oriented, so the module that runs the render graph can be written directly in terms of the target API rather than going through an abstraction layer. If there is a desire to ship Vello for a targeted application (for example, animation playback on Android), this will be the best way to go.

Other GPU compute clients

One thing we were watching for was whether there was any interest in using piet-gpu-hal for other applications. Matt Keeter did some experimentation with it, but otherwise there was not much interest. Sometimes the “portable GPU compute” space feels a bit lonely.

An intriguing potential application space is machine learning. It would be an ambitious but doable project to get, say, Stable Diffusion running on portable compute using either piet-gpu-hal or something like it, so that very little runtime (probably less than a megabyte of code) would be required. Related projects include Kompute.cc, which runs machine learning workloads but is Vulkan only, and also MediaPipe.

One downside to trying to implement machine learning workloads in terms of portable compute shaders is that it doesn’t get access to neural accelerators such the Apple Neural Engine. When running in native Vulkan, you may get access to cooperative matrix features, which on Nvidia are branded “tensor cores,” but for the most part these are proprietary vendor extensions and it is not clear if and when they might be exposed through WebGPU. Even so, at least on Nvidia hardware it seems likely that using these features can unlock very high performance.

Going forward, one approach I find particularly promising for running machine learning is wonnx, which implements the ONNX spec on top of WebGPU. No doubt in the first release, performance will lag highly tuned native implementations considerably, but once such a thing exists as a viable open source project, I think it will be improved rapidly. And WebGPU is not standing still…

Beyond WebGPU 1.0

WebGPU 1.0 is basically a “least common denominator” of current GPUs. This has advantages and disadvantages. It means that there are fewer choices (and fewer permutations), and that code written in WGSL can run on all modern GPUs. The downside is that there are a number of features that most (but not all) current GPUs have that can speed things further, and those features are not available.

The likely path through this forest is to define extensions to WebGPU. These can start as privately implemented extensions, running in native, then, hopefully based on that experience, proposed for standardization on the web. They would be optional, meaning that more shader permutations will need to be written to make use of them when available, or fall back when not. One such extension, fp16 arithmetic, has already been standardized, though we don’t yet exploit it in Vello.

In Vello, we have identified three promising candidates for further extension.

First, descriptor indexing, which is the ability to create an array of textures, then have a shader dynamically access textures from that array. Without it, a shader can only have a fixed number of textures bound. To work around that limitation, we plan to have an image atlas, copy the source images into that atlas using rasterization stages, then access regions (defined by uv quads) from the single binding. We don’t expect performance to be that bad, as such copies are fairly cheap, but it does require extra memory and is not ideal. For fully native, the GPU world is moving to bindless, which is popular in DX12, and the recent descriptor buffer extension makes Vulkan work the same way. It is likely that WebGPU will standardize on something more like descriptor indexing than full bindless, because the latter is basically raw pointers and thus unsafe by design. In any case, see the image resources issue for more discussion.

Second, device-scoped barriers, which unlock single-pass (decoupled look-back) prefix sum techniques. They are not present in Metal, but otherwise should be straightforward to add. I wrote about this in my Prefix sum on portable compute shaders blog post. In the meantime, we are using multiple dispatches, which is much more portable, but not quite as performant.

Third, subgroups, also known as SIMD groups, wave operations, and warp operations. Within a workgroup, especially for stages resembling prefix sum, Vello uses a lot of workgroup shared memory and barriers. With subgroups, it’s possible to reduce that traffic, in some cases dramatically. That should make the biggest difference on Intel GPUs, which have relatively slow shared memory.

Unfortunately, there is a tricky aspect to subgroups, which is that in most cases there is no control in advance over the subgroup size, the shader compiler picks it on the basis of a heuristic. The subgroup size control extension, which is mandatory in Vulkan 1.3, fixes this, and allows writing code specialized to a particular subgroup size. Otherwise, it will be necessary to write a lot of conditional code and hope that the compiler constant-folds the control flow based on subgroup size. Another challenge is that it is more difficult to test code for a particular subgroup size, in contrast to workgroup shared memory which is quite portable. More experimentation is needed to determine whether subgroups without the size control extension work well across a wide range of hardware.

And on the topic of that experimentation, it’s difficult to do so without adequate GPU infrastructure. I may find myself reaching for the archived version of piet-gpu-hal.

Conclusion

Choosing the right GPU infrastructure depends on the goals, as sadly there is not yet a good consensus choice for a GPU runtime suitable for a wide variety of workloads including compute. For the goals of researching the cutting edge of performance, hand-rolled infrastructure was the right choice, and piet-gpu-hal served that well. For the goal of lowering the friction for developing our engine, and also interoperating with other projects, WebGPU and wgpu are a better choice. Our experience with the port suggests that the performance and features are good enough, and that it is a good experience all-around.

We hope to make Vello useful enough to use in production within the next few months. For many applications, WebGPU will be an appropriate infrastructure. For others, where the overhead of runtime shader compilation is not acceptable, we have a path forward but will need to consider alternatives. Either ahead-of-time shader compilation can be retrofitted to wpgu, or we will explore a more native approach.

In any case, we look forward to productive development and collaboration with the broader community.

Raph’s reflections and wishes for 2023

2022-12-31T14:44:42+00:00

This post reflects a bit on 2022 and contains wishes for 2023. It mixes project and technical stuff, which I write about fairly frequently, and more personal reflections, which I don’t.

Reflections on 2022

Overall 2022 was a good year for me, though it was stressful and challenging in some ways. A lot of the stuff I did I’ve talked about in this blog, but there are a few other things, and quite a few more things in the pipeline. It was not a year of shipping anything big, and that is one thing I’d like to see different in 2023.

What I do straddles a number of more traditional categories. I do research, but most of my output is blog posts and code, not academic papers. My area of focus is primarily 2D graphics and GUI infrastructure, and these are to a large extent neglected step-children of academic computer science - there aren’t conferences or journals that specialize in the topic, nor are there decent textbooks. As a result, knowledge tends to be somewhat arcane, and just as much “lore” shared among practitioners as a literature. Even so, I find the field intellectually very stimulating and consider this odd situation to be an opportunity.

My portfolio of projects is very ambitious, as it basically includes an entire Rust UI toolkit plus a lot of the supporting technologies for that. It’s arguably too much for one person to take on, but I’m trying my best to make it work, largely by fostering a community around the projects. In late 2022, a big step was setting up weekly office hours, one hour a week, where we check in and discuss the various projects. I think that’s working well and look forward to continuing it.

On happiness

I’m not one of those “quantified self” people, but I have noticed that my happiness tends to correlate pretty directly with how much code I’m writing. I’m sure some of that is simply because if there’s stressful stuff going on that gets in the way of coding, that makes me unhappy, but obviously I just really enjoy it.

In particular, I love solving deep puzzles, and I find plenty of opportunity for that. An especially enjoyable track is adapting algorithms to be massively parallel so they run efficiently on GPUs (especially compute). Sometimes that leads to friction; my projects often have a “rocket science” nature to them, making them hard to contribute to.

Aside from puzzle-solving, which is largely a solitary activity, I also like the aspects of teaching, getting people up to speed, especially in topics that are not easily accessible through a standard computer science education or textbooks. Writing, including this blog, is a big part of that.

To a large extent, I will try to make time in 2023 to focus on these things I really enjoy, that I find make me happy. I am extremely fortunate that my paying job, as a research software engineer on Google Fonts, lets me do that.

Vello

Of the various projects in flight (and there are many!), one is clearly rising to the top of the stack: Vello, the GPU-accelerated 2D renderer. We made a lot of progress last year, but it’s still not quite ready to ship. Part of it is that it’s trying to solve a very hard problem, and part of it is that some solid GPU infrastructure needs to be in place for it to fly. A lot of time and energy last year went into GPU infrastructure in various ways, both pushing piet-gpu-hal forward, and then doing a full rewrite into WGSL and WebGPU (writeup to come).

For those who might be confused by the name, Vello is a rename of piet-gpu but still fundamentally the same project. The old name didn’t really fit, as “Piet” refers to a trait/method abstraction for the traditional 2D graphics API, and we’re moving away from that. The new name is intended to evoke both vellum (parchment, as in books and manuscripts) and velocity.

It is clear to me that there is strong demand for a good, cross-platform 2D rendering engine. There are a few other interesting things going on the space, but I’ll be honest, what I see, I want to compete with. There are big missing pieces (the most important by far is the ability to import images), but I see a pretty clear path to getting all that done.

So this is by far the biggest goal for the year: get Vello to a usable state. That involves shipping a crate, doing a proper writeup, which will probably be a 20-30 page self-standing report, and doing quantitative performance evaluation against other renderers.

Xilem

Another major initiative is Xilem. This was originally just a reactive layer, intended to be generic over the underlying widget tree, but is emerging as an umbrella for the larger project.

My hypothesis is (still) that Xilem is the best known reactive architecture for Rust. If I am correct, that means it is more concise, more ergonomic, more efficient, and better integrated with async than comparable work on Dioxus, Sycamore, pax-lang, and the Elm variants. That would be an exciting result, in which case I hope and expect some interesting systems will be built on the architecture. If I am wrong (which is entirely possible), then those other reactive approaches are all likely good enough for practical use, and have more evidence behind them than Xilem currently does. I feel strongly that it is worthwhile to test the hypothesis, and the only way to do that is to try to build real systems on the architecture. That experience will also likely drive improvements to the Xilem architecture, and I expect we’ll learn something interesting in any case. [Note: this paragraph is a rewrite of the original; see raphlinus#88 for the diff]

Progress on Xilem will have a considerably different texture than Vello. I hope that a huge fraction of the actual implementation work will be done by the community. Most of that work is most decidedly not rocket science, as I expect lots of it to be straightforward adaptation of the existing Druid widget set into the new architecture (and I’m making certain decisions explicitly to make that easier). I’m trying hard not to lick the cookie too much, and encourage other people to take on subprojects. I’m also trying to foster a culture where everyone in the community feels empowered to review PR’s and keep things moving, as having that block on me has been difficult.

One thing I am looking forward to working on is immutable data structures for efficiently (sparsely) diffing collections, which will hopefully realize the promise of my RustLab 2020 talk. This is a problem I feel has never been fully solved in UI toolkits - either you use complex and fragile mechanisms to incrementally update the UI, or you end up diffing the whole collection every time - and I believe using solid computer science to solve it, plus good Rust API design, would be very satisfying.

I consider Xilem to be more speculative and riskier than Vello. The extent to which it succeeds is largely based on how well the community can organize around getting the work done; if it gated on me, it’s a good question when it would all get done, especially with other projects competing for my attention.

The best place to learn more about Xilem is my High Performance Rust UI talk. I go over the goals and motivations of the project, and there’s good Q&A at the end. There’s a bit more about the Rust language aspects, particuarly trying to do ergonomic API design, in my RustLab 2022 keynote. In any case, I will be writing more.

Obviously Xilem depends on having good 2D rendering, so clearly time spent getting Vello production-ready contributes toward the overall success of the project.

To the people who want a good GUI toolkit they can use right now: I apologize, and ask for patience. In the mean time, you might check out Iced, Egui, or Slint. Those are all pretty good right now, and continually improving. I personally find the end goal of Xilem to be extremely compelling, but even in the best case it will take a while to get there.

Curves and other research

Over the holiday break, I let my mind roam more freely than usual, and I found myself coming back to various problems in curves. Probably the most satisfying work was refining the Bézier curve fitting ideas (see kurbo#230 for the code, hopefully writeup coming before too long). I also have what I consider a very promising idea to improve hyperbeziers, which have been on the back burner for a couple years, and an idea for robust boolean path operations.

But probably the juiciest bit of work will be perfecting the path geometry parts in Vello. In particular, I have a fairly compelling prototype of a combined flatten + offset operation in terms of Euler spirals, which I feel can be implemented efficiently on GPU as a compute shader and integrated into the Vello pipeline. That will improve handling of strokes (to handle all the join/cap options), and also serve as the basis for stem darkening of font outlines. Fun fact: the parallel curve work I did a few months ago was motivated by trying to get a good GPU implementation, but ultimately I believe that particular work is best suited for creative vector graphics applications, while the Euler spiral approach is better suited for GPU.

Non-goals

A year ago, I said only half as a joke, that my main New Years resolution for 2022 would be to not write a shader language. I’m pleased to report that I have succeeded. I will renew that vow for next year as well.

In many ways, it would make sense to start designing a shader language. My process for writing compute shaders is basically to design them in a fictional high level language with operations like prefix sum, stream compaction, etc., then translate by hand into the much lower level WGSL (formerly GLSL). If a good high level language existed (Futhark is the closest of what I’ve seen so far, but rust-gpu is also a contender), it would in theory streamline that work and let me write more efficiently.

But I think the advice in Don’t write a programming language is pretty sound. A programming language is a multi-year endeavor, and with a poor chance of success even in the most favorable conditions. In addition, I’ve found that everything in GPU land is at least 5 times harder than it is on the CPU. Case in point, I had trouble even getting Futhark to run on the GPU hardware of my main development machines, as it doesn’t have any compute shader back-ends, only OpenCL (which is basically dead) and CUDA (not much help on AMD or Apple M1 hardware). And that’s a relatively successful effort with over 8 years of experience!

I am also unlikely to write papers for academic journals and conferences. Perhaps it’s sour grapes, but I haven’t had a good experience, and feel that the (nontrivial) time and effort doesn’t really pay off. I’ll continue to use this blog, and publication of open source software, as the primary way I communicate research results. That said, I had one good experience of being asked to collaborate on an ASPLOS paper, and I remain open to collaborations. For someone with an incentive to publish academically, I think there’s quite a bit of raw material to work with.

Conclusion

I really feel that a lot of the research effort of the past year and beyond is poised to pay off in the next. Most especially, I am excited about shipping Vello, as I think that will advance the state of 2D rendering and also points the way for doing other interesting things in GPU compute. I also really look forward to working deeply with the community forming around Xilem to push that forward and hopefully make it reality.

There’s lots more that could be said, especially about the many individual subprojects, but in this reflection I hoped to give a general overview.

Discussion on /r/rust.

In any case, I wish everybody a happy and healthy 2023.

Minikin retrospective

2022-11-08T18:40:42+00:00

There’s a lot of new interest in open source text layout projects, including cosmic-text, and the parley work being done by Chad Brokaw. I’m hopeful we’ll get a good Rust solution soon.

I encourage people working on text to study existing open source code bases, as much of the knowledge is arcane, and there is unfortunately no good textbook on the subject. Obviously browser engines facilitate extremely sophisticated layout, but because of their complexity and the constraints of HTML and CSS compatibility, they may be hard going. The APIs of DirectWrite and Core Text are also worthy of study, but unfortunately their implementations remain closed.

An interesting codebase is Minikin. It powers text layout in the Android framework. I wrote the original version starting back around 2013, but it is since maintained by the very capable Android text team.

A bit of additional context. The design of Minikin was constrained by compatibility with the existing Android text API (mostly exposed through Java, though these days there is a nontrivial NDK surface as well). There is a higher level layer, responsible for rich text representation, editing, and other things, while Minikin is the lower level that powers that and does shaping, itemization, hit testing, and so on. (See Text layout is a loose hierarchy of segmentation if these terms are unfamiliar). For the most part, the interface between the higher and lower levels is runs of text. Line breaking is an interesting case where the responsibility for crossing levels is shared across the layer boundary. For the most part, the higher level iterates its own rich text representation and hands runs to Minikin.

A special feature of Minikin is its optimized line breaking algorithm, strongly inspired by Knuth-Plass. This was motivated largely by the need to handle small screens better, but has some tweaks to make it even better for mobile. The heuristics try not to place a word by itself on the last line, and in general try to balance line lengths. Line breaking generally follows ICU rules, but does special case email addresses and URLs, as those are very common on mobile and the ICU rules work poorly.

There’s also fun stuff in there for grapheme boundaries, deciding between color and text emoji presentation forms, and calculating exactly what should be deleted on backspace (the logic for that is surprisingly complicated and as far as I know has not been written down anywhere).

Much of the internal structure of Minikin is dedicated to itemization, which is done largely on the basis of the cmap coverage of the fonts in the fallback chain. Doing that query font-by-font for every character would be expensive, so there are fancy bitmap acceleration structures. A good, general, cross-platform way to do itemization and fallback is a hard problem, but I think this solution works well for Android’s specific needs.

While there’s unfortunately no excellent documentation on Minkin’s internals, there are some resources, and part of the purpose of this blog is to point people to them. I’ve just gotten permission to publish the Project Minikin slide deck from 2013 (PDF), which explains the motivation and some of the early design ideas. There’s also an lwn article which goes into some detail, largely based on my 2015 ATypI talk (video, slides).

Flutter’s LibTxt was originally based on Minikin, but no doubt has diverged considerably since then, as they have fairly different requirements and of course are not bound by compatibility with Android.

But if you have deep interest in this, I recommend studying the code, as that’s the ultimate source of truth. It’s extremely fortunate that Android’s open source development model gives us access to it!

If you have questions, let me know (an issue on this repo is a good way, or you can ask on Mastodon), and I’ll do my best to answer.

Parallel curves of cubic Béziers

2022-09-09T17:45:42+00:00

Distance

Accuracy

Method

The problem of parallel or offset curves has remained challenging for a long time. Parallel curves have applications in 2D graphics (for drawing strokes and also adding weight to fonts), and also robotic path planning and manufacturing, among others. The exact offset curve of a cubic Bézier can be described (it is an analytic curve of degree 10) but it not tractable to work with. Thus, in practice the approach is almost always to compute an approximation to the true parallel curve. A single cubic Bézier might not be a good enough approximation to the parallel curve of the source cubic Bézier, so in those cases it is sudivided into multiple Bézier segments.

A number of algorithms have been published, of varying quality. Many popular algorithms aren’t very accurate, yielding either visually incorrect results or excessive subdivision, depending on how carefully the error metric has been implemented. This blogpost gives a practical implementation of a nearly optimal result. Essentially, it tries to find the cubic Bézier that’s closest to the desired curve. To this end, we take a curve-fitting approach and apply an array of numerical techniques to make it work. The result is a visibly more accurate curve even when only one Bézier is used, and a minimal number of subdivisions when a tighter tolerance is applied. In fact, we claim $O(n^6)$ scaling: if a curve is divided in half, the error of the approximation will decrease by a factor of 64. I suggested a previous approach, Cleaner parallel curves with Euler spirals, with $O(n^4)$ scaling, in other words only a 16-fold reduction of error.

Though there are quite a number of published algorithms, the need for a really good solution remains strong. Some really good reading is the Paper.js issue on adding an offset function. After much discussion and prototyping, there is still no consensus on the best approach, and the feature has not landed in Paper.js despite obvious demand. There’s also some interesting discussion of stroking in an issue in the COLRv1 spec repo.

Outline of approach

The fundamental concept is curve fitting, or finding the parameters for a cubic Bézier that most closely approximate the desired curve. We also employ a sequence of numerical techniques in support of that basic concept:

Finding the cusps and subdividing the curve at the cusp points.
Computing area and moment of the target curve
- Green’s theorem to convert double integral into a single integral
- Gauss-Legendre quadrature for efficient numerical integration
Quartic root finding to solve for cubic Béziers with desired area and moment
Measure error to choose best candidate and decide whether to subdivide
- Cubic Bézier/ray intersection

Each of these numeric techniques has its own subtleties.

Cusp finding

One of the challenges of parallel curves in general is cusps. These happen when the curvature of the source curve is equal to one over the offset distance. Cubic Béziers have fairly complex curvature profiles, so there can be a number of cusps - it’s easy to find examples with four, and it wouldn’t be surprising to me if there were more. By contrast, Euler spirals have simple curvature profiles, and the location of the cusp is extremely simple to determine.

The general equation for curvature of a parametric curve is as follows:

\[\kappa = \frac{\mathbf{x}''(t) \times \mathbf{x}'(t)}{|\mathbf{x}'(t)|^3}\]

The cusp happens when $\kappa d + 1 = 0$. With a bit of rewriting, we get

\[(\mathbf{x}''(t) \times \mathbf{x}'(t))d + |\mathbf{x}'(t)|^3 = 0\]

As with many such numerical root-finding approaches, missing a cusp is a risk. The approach currently used in the code in this blog post is a form of interval arithmetic: over the (t0..t1) interval, a minimum and maximum value of $|\mathbf{x}’|$ is computed, while the cross product is quadratic in t. Solving that partitions the interval into ranges where the curvature is definitely above or below the threshold for a cusp, and a (hopefully) smaller interval where it’s possible.

This algorithm is robust, but convergence is not super-fast - it often hits the case where it has to subdivide in half, so convergence is similar to a bisection approach for root-finding. I’m exploring another approach of computing bounding parabolas, and that seems to have cubic convergence, but is a bit more complicated and fiddly.

In cases where you know you have one simple cusp, a simple and generic root-finding method like ITP (about more which below) would be effective. But that leaves the problem of detecting when that’s the case. Robust detection of possible cusps generally also gives the locations of the cusps when iterated.

Computing area and moment of the target curve

The primary input to the curve fitting algorithm is a set of parameters for the curve. Not control points of a Bézier, but other measurements of the curve. The position of the endpoints and the tangents can be determined directly, which, just counting parameters, leaves two free. Those are the area and x-moment. These are generally described as integrals. For an arbitrary parametric curve (a family which easily includes offsets of Béziers), Green’s theorem is a powerful and efficient technique for approximating these integrals.

For area, the specific instance of Green’s theorem we’re looking for is this. Let the curve be defined as x(t) and y(t), where t goes from 0 to 1. Let D be the region enclosed by the curve. If the curve is closed, then we have this relation:

\[\iint_D dx \,dy = \int_0^1 y(t)\, x'(t)\, dt\]

I won’t go into the details here, but all this still works even when the curve is open (one way to square up the accounting is to add the return path of the chord, from the end point back to the start), and when the area contains regions of both positive and negative signs, which can be the case for S-shaped curves. The x moment is also very similar and just involves an additional $x$ term:

\[\iint_D x \, dx \,dy = \int_0^1 x(t)\, y(t)\, x'(t)\, dt\]

Especially given that the function being integrated is (mostly) smooth, the best way to compute the approximate integral is Gauss-Legendre quadrature, which has an extremely simple implementation: it’s just the dot product between a vector of weights and a vector of the function sampled at certain points, where the weights and points are carefully chosen to minimize error; in particular they result in zero error when the function being integrated is a polynomial of order up to that of the number of samples. The JavaScript code on this page just uses a single quadrature of order 32, but a more sophisticated approach (as is used for arc length computation) would be to first estimate the error and then choose a number of samples based on that.

Note that the area and moments of a cubic Bézier curve can be efficiently computed analytically and don’t need an approximate numerical technique. Adding in the offset term is numerically similar to an arc length computation, bringing it out of the range where analytical techniques are effective, but fortunately similar numerical techniques as for computing arc length are effective.

The basic approach to curve fitting was described in Fitting cubic Bézier curves. Those ideas are good, but there were some rough edges to be filled in and other refinements.

To recap, the goal is to find the closest Bézier, in the space of all cubic Béziers, to the desired curve (in this case the parallel curve of a source Bézier, but the curve fitting approach is general). That’s a large space to search, but we can immediately nail down some of the parameters. The endpoints should definitely be fixed, and we’ll also set the tangent angles at the endpoints to match the desired curve.

One loose end was the solving technique. My prototype code used numerical methods, but I’ve now settled on root finding of the quartic equation. A major reason for that is that I’ve found that the quartic solver in the Orellana and De Michele paper works well - it is fast, robust, and stable. The JavaScript code on this page uses a fairly direct implementation of that (which I expect may be useful for other applications - all the code on this blog is licensed under Apache 2, so feel free to adapt it within the terms of that license).

Another loose end was the treatment of “near misses.” Those happen when the function comes close to zero but doesn’t quite cross it. In terms of roots of a polynomial, those are a conjugate pair of complex roots, and I take the real part of that as a candidate. It would certainly be possible to express this logic by having the quartic solver output complex roots as well as real ones, but I found an effective shortcut: the algorithm actually factors the original quartic equation into two quadratics, one of which always has real roots and the other some of the time, and finding the real part of the roots of a quadratic is trivial (it’s just -b/2a).

Recently, Cem Yuksel has proposed a variation of Newton-style polynomial solving. It’s likely this could be used, but there were a few reasons I went with the more analytic approach. For one, I want multiple roots and this works best when only one is desired. Second, it’s hard to bound a priori the interval to search for roots. Third, it’s not easy to get the complex roots (if you did want to do this, the best route is probably deflation). Lastly, the accuracy numbers don’t seem as good (the Orellana and De Michele paper presents the results of very careful testing), and in empirical testing I have found that accuracy in root finding is a real problem that can affect the quality of the final results. A Rust implementation of the Orellana and De Michele technique clocks in at 390ns on an M1 Max, which certainly makes it competitive with the fastest techniques out there.

The last loose end was the treatment of near-zero slightly negative arm lengths. These are roots of the polynomial but are not acceptable candidate curves, as the tangent would end up pointing the wrong way. My original thought was to clamp the relevant length to zero (on the basis that it is an acceptable curve that is “nearby” the numerical solution), but that also doesn’t give ideal results. In particular, if you set one length to zero and set the other one based on exact signed area, the tangent at the zero-length side might be wrong (enough to be visually objectionable). After some experimentation, I’ve decided to set the other control point to be the intersection of the tangents, which gets tangents right but possibly results in an error in area, depending on the exact parameters. The general approach is to throw these as candidates into the mix, and let the error measurement sort it out.

Error measurement

A significant amount of total time spent in the algorithm is measuring the distance between the exact curve and the cubic approximation, both to decide when to subdivide and also to choose between multiple candidates from the Bézier fitting. I implemented the technique from Tiller and Hanson and found it to work well. They sample the exact curve at a sequence of points, then for each of those points project that point onto the approximation along the normal. That is equivalent to computing the intersection of a ray and a cubic Bézier. The maximum distance between the projected and true point is the error. This is a fairly good approximation to the Fréchet distance but significantly cheaper to compute.

Computing the intersection of a ray and a cubic Bézier is equivalent to finding the root of a cubic polynomial, a challenging numerical problem in its own right. In the course of working on this, I found that the cubic solver in kurbo would sometimes report inaccurate results (especially when the coefficient on the $x^3$ term was near-zero, which can easily happen when cubic Bézier segments are near raised quadratics), and so implemented a better cubic solver based on a blog post on cubics by momentsingraphics. That’s still not perfect, and there is more work to be done to arrive at a gold-plated cubic solver. The Yuksel polynomial solving approach might be a good fit for this, especially as you only care about results for t strictly within the (0..1) range. It might also be worth pointing out that the fma instruction used in the Rust implementation is not available in JavaScript, so the accuracy of the solver here won’t be quite as good.

The error metric is a critical component of a complete offset curve algorithm. It accounts for a good part of the total CPU time, and also must be accurate. If it underestimates true error, it risks letting inaccurate results slip through. If it overestimates error, it creates excessive subdivision. Incidentally, I suspect that the error measurement technique in the Elber, Lee and Kim paper (cited below) may be flawed; it seems like it may overestimate error in the case where the two curves being compared differ in parametrization, which will happen commonly with offset problems, particularly near cusps. The Tiller-Hanson technique is largely insensitive to parametrization (though perhaps more care should be taken to ensure that the sample points are actually evenly spaced).

Subdivision

Right now the subdivision approach is quite simple: if none of the candidate cubic Béziers meet the error bound, then the curve is subdivided at t = 0.5 and each half is fit. The scaling is n^6, so in general that reduces the error by a factor of 64.

If generation of an absolute minimum number of output segments is the goal, then a smarter approach to choosing subdivisions would be in order. For absolutely optimal results, in general what you want to do is figure out the minimum number of subdivisions, then adjust the subdivision points so the error of all segments are equal. This technique is described in section 9.6.3 of my thesis. In the limit, it can be expected to reduce the number of subdivisions by a factor of 1.5 compared with “subdivide in half,” but not a significant improvement when most curves can be rendered with one or two cubic segments.

Evaluation

Somebody evaluating this work for use in production would care about several factors: accuracy of result, robustness, and performance. The interactive demo on this page speaks for itself: the results are accurate, the performance is quite good for interactive use, and it is robust (though I make no claims it handles all adversarial inputs correctly; that always tends to require extra work).

In terms of accuracy of result, this work is a dramatic advance over anything in the literature. I’ve implemented and compared it against two other techniques that are widely cited as reasonable approaches to this problem: Tiller-Hanson and the “shape control” approach of Yang and Huang. For generating a single segment, it can be considerably more accurate than either.

In addition to the accuracy for generating a single line segment, it is interesting to compare the scaling as the number of subdivisions increases, or as the error tolerance decreases. These tend to follow a power law. For this technique, it is $O(n^6)$, meaning that subdividing a curve in half reduces the error by a factor of 64. For the shape control approach, it is $O(n^5)$, and for Tiller-Hanson is is $O(n^2)$. That last is a surprisingly poor result, suggesting that it is only a constant factor better than subdividing the curves into lines.

The shape control technique has good scaling, but stability issues when the tangents are nearly parallel. That can happen for an S-shaped curve, and also for a U with nearly 180 degrees of arc.

The Tiller-Hanson technique is geometrically intuitive; it offsets each edge of the control polygon by the offset amount, as illustrated in the diagram below. It doesn’t have the stability issues with nearly-parallel tangents and can produce better results for those “S” curves, but the scaling is much worse.

Regarding performance, I have preliminary numbers from the JavaScript implementation, about 12µs per curve segment generated on an M1 Max running Chrome. I am quite happy with this result, and of course expect the Rust implementation to be even faster when it’s done. There are also significant downstream performance improvements from generating highly accurate results; every cubic segment you generate has some cost to process and render, so the fewer of those, the better.

I haven’t implemented all the techniques in the Elber, Lee and Kim paper, but it is possible to draw some tentative conclusions from the literature. I expect the Klass technique (and its numerical refinement by Sakai and Suenaga) to have good scaling but relatively poor acccuracy for a single segment. The Klass technique is also documented to have poor numerical stability, thanks in part to its reliance on Newton solving techniques. The Hoschek and related (least-squares) approaches will likely produce good results but are quite slow (the Yang and Huang paper reports an eye-popping 49s for calculating a simple case with .001 tolerance, of course on older hardware).

The Euler spiral technique in my previous blog post will in general produce considerably more subdivision (with $O(n^4)$ scaling), but perhaps it would be premature to write it off completely. Once the curve is in piecewise Euler spiral form, a result within the given error bounds can be computed directly, with no need to explicitly evaluate an error metric. In addition, the cusps are located robustly with trivial calculation. That said, getting a curve into piecewise Euler spiral form is still challenging, and my prototype code uses a rather expensive error metric to achieve that.

Discussion

This post presents a significantly better solution to the parallel curve problem than the current state of the art. It is accurate, robust, and fast. It should be suitable to implement in interactive vector graphics applications, font compilation pipelines, and other contexts.

While parallel curve is an important application, the curve fitting technique is quite general. It can be adapted to generalized strokes, for example where the stroke width is variable, path simplification, distortions and other transforms, conversion from other curve representations, accurate plotting of functions, and I’m sure there are other applications. Basically the main thing that’s required is the ability to evaluate area and moment of the source curve, and ability to evaluate the distance to that curve (which can be done readily enough by sampling a series of points with their arc lengths).

This work also provides a bit of insight into the nature of cubic Bézier curves. The $O(n^6)$ scaling provides quantitative support to the idea that cubic Bézier curves are extremely expressive; with skillful placement of the control points, they can extremely accurately approximate a wide variety of curves. Parallel curves are challenging for a variety of reasons, including cusps and sudden curvature variations. That said, they do require skill, as geometrically intuitive but unoptimized approaches to setting control points (such as Tiller-Hanson) perform poorly.

There’s clearly more work that could be done to make the evalation more rigorous, including more optimization of the code. I believe this result would make a good paper, but my bandwidth for writing papers is limited right now. I would be more than open to collaboration, and invite interested people to get in touch.

Thanks to Linus Romer for helpful discussion and refinement of the polynomial equations regarding quartic solving of the core curve fitting algorithm.

Discuss on Hacker News.

References

Here is a bibliography of some relevant academic papers on the topic.

An offset spline approximation for plane cubic splines, Klass, 1983
Offsets of Two-Dimensional Profiles, Tiller and Hanson, 1984
High Accuracy Geometric Hermite Interpolation, de Boor, Höllig, Sabin, 1987 (PDF cache)
Optimal approximate conversion of spline curves and spline approximation of offset curves, Hoschek and Wissel, 1988 (PDF cache)
A New Shape Control and Classification for Cubic Bézier Curves, Yang and Huang, 1993 (PDF cache)
Comparing offset curve approximation methods, Elber, Lee, Kim, 1997 (PDF cache)
Cubic spline approximation of offset curves of planar cubic splines, Sakai and Suenaga, 2001 (PDF cache)
Boosting Efficiency in Solving Quartic Equations with No Compromise in Accuracy, Orellana and De Michele, 2020 (PDF cache)
High-Performance Polynomial Root Finding for Graphics, Yuksel, 2022 (PDF cache)

Advice for the next dozen Rust GUIs

2022-07-15T17:53:42+00:00

A few times a week, someone asks on the #gui-and-ui channel on the Rust Discord, “what is the best UI toolkit for my application?” Unfortunately there is still no clear answer to this question. Generally the top contenders are egui, Iced, and Druid, with Slint looking promising as well, but web-based approaches such as Tauri are also gaining some momentum, and of course there’s always the temptation to just build a new one. And every couple or months or so, a post appears with a new GUI toolkit.

This post is something of a sequel to Rust 2020: GUI and community. I hope to offer a clear-eyed survey of the current state of affairs, and suggestions for how to improve it. It also includes some lessons so far from Druid.

The motivations for building GUI in Rust remain strong. While Electron continues to gain momentum, especially for desktop use cases, there is considerable desire for a less resource-hungry alternative. That said, a consensus has not emerged what that alternative should look like. In addition, unfortunately there is fragmentation at the level of infrastructure as well.

Fragmentation is not entirely a bad thing. To some extent, specialization can be a good thing, resulting in solutions more adapted to the problem at hand, rather than a one-size-fits-all approach. More importantly, the diversity of UI approaches is a rich ground for experimentation and exploration, as there are many aspects to GUI in Rust where we still don’t know the best approach.

A large tradeoff space

One of the great truths to understand about GUI is that there are few obviously correct ways to do things, and many, many tradeoffs. At present, this tradeoff space is very sensitive, in that small differences in requirements and priorities may end up with considerably different implementation choices. I believe this is a main factor driving the fragmentation of Rust GUI. Many of the tradeoffs have to do with the extent to use platform capabilities for various things (of which text layout and compositing are perhaps the most intersting), as opposed to a cross-platform implementation of those capabilities, layered on top of platform abstractions.

A small rant about native

You’ll see the word “native” used quite a bit in discussions about UI, but I think that’s more confusing than illuminating, for a number of reasons.

In the days of Windows XP, “native” on Windows would have had quite a specific and precise meaning, it would be the use of user32 controls, which for the most part created a HWND per control and used GDI+ for drawing. Thanks to the strong compatibility guarantees of Windows, such code will still run, but it will look ancient and out of place compared to building on other technology stacks, also considered “native.” In the Windows 7 era, that would be windowless controls drawn with Direct2D (this was the technology used for Microsoft Office, using the internal DirectUI toolkit, which was not available to third party developers, though there were other windowless toolkits such as WPF). Starting with Windows 8 and the (relatively unsuccessful) Metro design language, many of the elements of the UI came together in the compositor rather than being directly drawn by the app. As of Windows 10, Microsoft started pushing Universal Windows Platform as the preferred “native” solution, but that also didn’t catch on. As of Windows 11, there is a new Fluent design language, natively supported only in WinUI 3 (which in turn is an evolution of UWP). “Native” on Windows can refer to all of these technology stacks.

On Windows in particular, there’s also quite a bit of confusion about UI functionality that’s directly provided by the platform (as is the case for user32 controls and much of UWP, including the Windows.UI namespace), vs lower level capabilities (Direct2D, DirectWrite, and DirectComposition, for example) provided to a library which mostly lives in userspace. The new WinUI (and Windows App SDK more generally) is something of a hybrid, with this distinction largely hidden from the developer. An advantage of this hybrid approach is that OS version is abstracted to a large extent; for example, even though UWP proper is Windows 10 and up only, it is possible to use the Uno platform to deploy WinUI apps on other systems, though it is a very good question to what extent that kind of deployment can be considered “native.”

On macOS the situation is not quite as chaotic and fragmented, but there is an analogous evolution from Cocoa (AppKit) to SwiftUI; going forward, it’s likely that new capabilities and evolutions of the design language will be provided for the latter but not the former. There was a similar deprecation of Carbon in favor of Cocoa, many years ago. (One of the main philosophical differences is that Windows maintains backwards compatibility over many generations, while Apple actually deprecates older technology)

The idea of abstracting and wrapping platform native GUI has been around a long time. Classic implementations include wxWidgets and Java AWT, while a more modern spin is React Native. It is very difficult to provide high quality experiences with this approach, largely because of the impedance mismatch between platforms, and few successful applications have been shipped this way.

The meaning of “native” varies from platform to platform. On macOS, it would be extremely jarring and strange for an app not to use the system menubar, for example (though it would be acceptable for a game). On Windows, there’s more diversity in the way apps draw menus, and of course Linux is fragmented by its nature.

Instead of trying to decide whether a GUI toolkit is native or not, I recommend asking a set of more precise questions:

What is the binary size for a simple application?
What is the startup time for a simple application?
Does text look like the platform native text?
To what extent does the app support preferences set at the system level?
What subset of expected text control functionality is provided?
- Complex text rendering including BiDi
- Support for keyboard layouts including “dead key” for alphabetic scripts
- Keyboard shortcuts according to platform human interface guidelines
- Input Method Editor
- Color emoji rendering and use of platform emoji picker
- Copy-paste (clipboard)
- Drag and drop
- Spelling and grammar correction
- Accessibility (including reading the text aloud)

If a toolkit does well on all these axes, then I don’t think it much matters it’s built with “native” technology; that’s basically internal details. That said, it’s also very hard to hit all these points if there is a huge stack of non-native abstractions in the way.

On winit

All GUIs (and all games) need a way to create windows, and wire up interactions with that window - primarily drawing pixels into the window and dealing with user inputs such as mouse and keyboard, but potentially a much larger range. The implementation is platform specific and involves many messy details. There is one very popular crate for this function – winit – but I don’t think a consensus, at least yet, so there are quite a few other alternatives, including the tao fork of winit used by Tauri, druid-shell, baseview (which is primarily used in audio applications because it supports the VST plug-in case), and handrolled approaches such as the one used by makepad.

I would describe the tension this way (perhaps not everyone will agree with me). The stated scope of winit is to create a window and leave the details of what happens inside that window to the application. For some interactions (especially GPU rendering, which is well developed in Rust space), that split works well, but for other interactions it is not nearly as clean. In practice, I think winit has evolved to become quite satisfactory for game use cases, but less so for GUI. Big chunks of functionality, such as access to native menus, are missing (the main motivation behind the tao fork, and the reason iced doesn’t support system menus), and keyboard support is persistently below what’s needed for high quality text input in a GUI.

I think resolving some of this fragmentation is possible and would help move the broader ecosystem forward. For the time being, the Druid family of projects will continue developing druid-shell, but is open to collaboration with winit. One way to frame this is that the extra capabilities of druid-shell serve as a set of requirements, as well as guidance and experience how to implement them well.

In the meantime, which windowing library to adopt is a tough choice, and I wouldn’t be surprised to see yet another one pop up if the application has specialized requirements. Consider the situation in C++ world: for games, both GLFW and SDL are good choices, but both primarily for games. Pretty much every serious UI toolkit has its own platform abstraction layer; while it would be possible to use something like SDL for more general purpose GUI, it wouldn’t be a great fit.

Advice: think through the alternatives when considering a windowing library. After adopting one, learn from the others to see how yours might be improved. Plan on dedicating some time and energy into improving this important bit of infrastructure.

Tradeoff: use of system compositor

One of the more difficult, and I think underappreciated tradeoffs is the extent to which the GUI toolkit relies on the system compositor. See The compositor is evil for more background on this, but here I’ll revisit the issue from the point of view of a GUI toolkit trying to navigate the existing space.

All modern GUI platforms have a compositor which composites the (possibly alpha-transparent) windows from the various applications running on the system. As of Windows 8, Wayland, and Mac OS 10.5, the platform exposes richer access to the compositor, so the application can provide a tree of composition surfaces, and update attributes such as position and opacity (generally providing an additional animation API so the compositor can animate transitions without the application having to provide new values every frame).

If the GUI can be decomposed into the schema supported by the compositor, there are significant advantages. For one, it is decidedly the most power-efficient method to accomplish effects such as scrolling. The system pays the cost of the compositor anyway (in most cases), so any effects that can be done by the compositor (including scrolling) are “for free.”

As an illustration of how a Rust UI app may depend on the compositor, see the Minesweeper using windows-rs demo. Essentially all presentation is done using the compositor rather than a drawing API (this is why the numbers are drawn using dots rather than fonts). This sample application depends on the Windows.UI namespace to be provided by the operating system, so will only run on Windows 10 (build 1803 or later).

All that said, there are significant disadvantages to the compositor as well. One is cross-platform support and compatibility. There is currently no good cross-platform abstraction for the compositor (planeshift was an attempt, but is abandoned). Further, older systems (Windows 7 and X11) cannot rely on the compositor, so there has to be a compatibility path, generally with degraded behavior.

There are other more subtle drawbacks. One is a lowest-common-denominator approach, emphasizing visual effects supported by the compositor, especially cross-platform. As just one example, translation and alpha fading is well-supported, but scaling of bitmap surfaces comes with visual degradation, compared with re-rendering the vector original. There’s also the issue of additional RAM usage for all the intermediate texture layers.

Perhaps the biggest motivation to use the compositor extensively is stitching together diverse visual sources, particularly video, 3D, and various UI embeddings including web and “native” controls. If you want a video playback window to scroll seamlessly and other UI elements to blend with it, there is essentially no other game in town. These embeddings were declared as out of scope for Druid, but people request them often.

Building a proper cross-platform infrastructure for the compositor is a huge and somewhat thankless task. The surface area of these interfaces is large, I’m sure there are lots of annoying differences between the major platforms, and no doubt there will need to be a significant amount of compatibility engineering to work well on older platforms. Browsers have invested in this work (in the case of Safari without the encumbrance of needing to be cross-platform), and this is actually one good reason to use Web technology stacks.

Advice: new UI toolkits should figure out their relationship to the system compositor. If the goal is to provide a truly smooth, native integration of content such as video playback, then they must invest in the underlying mechanisms, much as browsers have.

Tradeoff: platform text rendering

One of the most difficult aspects of building UI from scratch is getting text right. There are a lot of details to text rendering, but there’s also the question of matching the system appearance and being able to access the system font fallback chain. The latter is especially important for non-Latin scripts, but also emoji. Unfortunately, operating systems generally don’t have good mechanisms for enumerating or efficiently querying the system fonts. Either you use the built-in text layout capabilities (which means having to build a lowest common denominator abstraction on top of them), or you end up replicating all the work, and finding heuristics and hacks to access the system fonts without running into either correctness or efficiency problems.

There’s so much work involved in making a fully functional text input box that it is something of a litmus test for how far along a UI toolkit has gotten. Rendering is only part of it, but there’s also IME (including the emoji picker), copy-paste (potentially including rich text), access to system text services such as spelling correction, and one of the larger and richer subsurfaces for integrating accessibility.

Again, browsers have invested a massive amount of work into getting this right, and it’s no one simple trick. Druid, by comparison, does use the system text layout capabilities, but we’re seeing the drawbacks (it tends to be slow, and hammering out all the inconsistencies between platforms is annoying to say the least), so as we go forward we’ll probably do more of that ourselves.

Over the longer term, I’d love to have Rust ecosystem infrastructure crates for handling text well, but it’s an uphill battle. Just how to design the interface abstraction boundaries is a hard problem, and it’s likely that even if a good crate was published, there’d be resistance to adoption because it wouldn’t be trivial to integrate. There are thorny issues such as rich text representation, and how the text layout crate integrates with 2D drawing.

Advice: figure out a strategy to get text right. It’s not feasible to do that in the short term, but longer term it is a requirement for “real” UI. Potentially this is an area for the UI toolkits to join forces as well.

On architecture

One constant I’ve found is that the developer-facing architecture of a UI toolkit needs to evolve. We don’t have a One True architecture yet, and in particular designs made in other languages don’t adapt well to Rust.

Druid itself has had three major architectures: an intial attempt at applying ECS patterns to UI, the current architecture relying heavily on lenses, and the Xilem proposal for a future architecture. In between were two explorations that didn’t pan out. Crochet was an attempt to provide an immediate mode API to applications on top of a retained mode implementation, and lasagna was an attempt to decouple the reactive architecture from the underlying widget tree implementation.

There are a number of triggers that might motivate large scale architectural changes in GUI toolkits. Among them, support for multi-window, accessibility, virtualized scrolling (and efficient large containers in general), async.

The crochet experiment

Now is a good time to review an architectural experiement that ultimately we decided not to pursue. The crochet prototype was an attempt to emulate immediate mode GUI on top of a retained mode widget tree. The theory was that immediate mode is easier for programmers, while retained mode has implementation advantages including making it easier to do rich layouts. There were other goals, including facilitating language bindings (for langages such as Python) and also better async integration. Language bindings were a pain point in the existing Druid architecture.

Ultimately I think it would be possible to build UI with this architecture, but there were a number of pain points, so I don’t believe it would be a good experience overall. One of the inherent problems of an immediate mode API is what I call “state tearing.” Because updates to app state (from processing events returned by calls to functions representing widgets) are interleaved with rendering, the rendering of any given frame may contain a mix of old and new state. For some applications, when continuously updating at a high frame rate, an extra frame of latency may not be a serious issue, but I consider it a flaw. I had some ideas for how to address this, but it basically involves running the app logic twice.

There were other ergonomic paper cuts. Because Rust lacks named and optional parameters in function calls, it is painful to add optional modifiers to existing widgets. Architectures based on simple value types for views (as is the case in the in the greater React family, including Xilem) can just use variations on the fluent pattern, method calls on those views to either wrap them or set optional parameters.

Another annoyingly tricky problem was ensuring that begin-container and end-container calls were properly nested. We experimented with a bunch of different ways to do try to enforce this nesting at compile time, but none were fully satisfying.

A final problem with emulating immediate mode is that the architecture tends to thread a mutable context parameter through the application logic. This is not especially ergonomic (adding to “noise”) but perhaps more seriously effectively enforces the app logic running on a single thread.

Advice: of course try to figure out a good architecture, but also plan for it to evolve.

Accessibility

It’s fair to say that a UI toolkit cannot be considered production-ready unless it supports accessibility features such as support for screen readers for visually impaired users. Unfortunately, accessibility support has often taken the back seat, but fortunately the situation looks like it might improve. In particular, the AccessKit project hopes to provide common accessibility infrastructure to UI toolkits in the Rust ecosystem.

Doing accessibility well is of course tricky. It requires architectural support from the toolkit. Further, platform accessibility capabilities often make architectural assumptions about the way apps are built. In general, they expect a retained mode widget tree; this is a significant impedance mismatch with immediate mode GUI, and generally requires stable widget ID and a diffing approach to create the accessibility tree. For the accessibility part (as opposed to the GPU-drawn part) of the UI, it’s fair to say pure immediate mode cannot be used, only a hybrid approach which resembles emulation of an immediate mode API on top of a more traditional retained structure.

Also, to provide a high quality accessibility experience, the toolkit needs to export fine-grained control of accesibility features to the app developer. Hopefully, generic form-like UI can be handled automatically, but for things like custom widgets, the developer needs to build parts of the accessibility tree directly. There are also tricky interactions with features such as virtualized scrolling.

Accessibility is one of the great strengths of the Web technology stack. A lot of thought went into defining a cross-platform abstraction which could actually be implemented, and a lot of users depend on this every day. AccessKit borrows liberally from the Web approach, including the implementation in Chromium.

Advice: start thinking about accessibility early, and try to build prototypes to get a good understanding of what’s required for a high quality experience.

What of Druid?

I admit I had hopes that Druid would become the popular choice for Rust GUI, though I’ve never explicitly had that as a goal. In any case, that hasn’t happened, and now is a time for serious thought about the future of the project.

We now have clearer focus that Druid is primarily a research project, with a special focus on high performance, but also solving the problems of “real UI.” The research goals are longer term; it is not a project oriented to shipping a usable toolkit soon. Thus, we are making some changes along these lines. We hope to get piet-gpu to a “minimum viable” 0.1 release soon, at which point we will be switching drawing over to that, as opposed to the current strategy of wrapping platform drawing capabilities (which often means that drawing is on the CPU). We change the reactive architecture to Xilem.

Assuming we do a good job solving these problems, over time Druid might evolve into a toolkit usable for production applications. In the meantime, we don’t want to create unrealistic expectations. The primary audience for Druid is people learning how to build UI in Rust. This post isn’t the appropriate place for a full roadmap and vision document, but I expect to be writing more about that in time.

Conclusion

I don’t want to make too many predictions, but I am confident in asserting that there will be a dozen new UI projects in Rust in the next year or two. Most of them will be toys, though it is entirely possible that one or more of them will be in support of a product and will attract enough resources to build something approaching a “real” toolkit. I do expect fragmentation of infrastructure to continue, as there are legitimate reasons to choose different approaches, or emphasize different priorities. It’s possible we never get to a “one size fits all” solution for especially thorny problems such as window creation, input (including keyboard input and IME), text layout, and accessibility.

Meanwhile we will be pushing forward with Druid. It won’t be for everyone, but I am hopeful it will move the state of Rust UI forward. I’m also hopeful that the various projects will continue to learn from each other and build common ground on infrastructure where that makes sense.

And I remain very hopeful about the potential for GUI in Rust. It seems likely to me that it will be the language the next major GUI toolkit is written in, as no other language offers the combination of performance, safety, and high level expressiveness. All of the issues in this post are problems to be solved rather than obstacles why Rust isn’t a good choice for building UI.

Discuss on Hacker News and /r/rust.

Xilem: an architecture for UI in Rust

2022-05-07T15:17:42+00:00

Rust is an appealing language for building user interfaces for a variety of reasons, especially the promise of delivering both performance and safety. However, finding a good architecture is challenging. Architectures that work well in other languages generally don’t adapt well to Rust, mostly because they rely on shared mutable state and that is not idiomatic Rust, to put it mildly. It is sometimes asserted for this reason that Rust is a poor fit for UI. I have long believed that it is possible to find an architecture for UI well suited to implementation in Rust, but my previous attempts (including the current Druid architecture) have all been flawed. I have studied a range of other Rust UI projects and don’t feel that any of those have suitable architecture either.

This post presents a new architecture, which is a synthesis of existing work and a few new ideas. The goals include expression of modern reactive, declarative UI, in components which easily compose, and a high performance implementation. UI code written in this architecture will look very intuitive to those familiar with state of the art toolkits such as SwiftUI, Flutter, and React, while at the same time being idiomatic Rust.

The name “Xilem” is derived from xylem, a type of transport tissue in vascular plants, including trees. The word is spelled with an “i” in several languages including Romanian and Malay, and is a reference to xi-editor, a starting place for explorations into UI in Rust (now on hold).

Like most modern UI architectures, Xilem is based on a view tree which is a simple declarative description of the UI. For incremental update, successive versions of the view tree are diffed, and the results are applied to a widget tree which is more of a traditional retained-mode UI. Xilem also contains at heart an incremental computation engine with precise change propagation, specialized for UI use.

The most innovative aspect of Xilem is event dispatching based on an id path, at each stage providing mutable access to app state. A distinctive feature is Adapt nodes (an evolution of the lensing concept in Druid) which facilitate composition of components. By routing events through Adapt nodes, subcomponents have access to a different mutable state reference than the parent.

A quick tour of existing architectures

This architecture is designed to address limitations and problems with the existing state of the art, both the current Druid architecture and other attempts. It’s a little hard to understand some of the motivation without some knowledge of those architectures. That said, a full survey of reactive UI architectures would be quite a long work; this section can only touch the highlights.

The existing Druid architecture has some nice features, but we consistently see people struggle with common themes.

There is a big difference between creating static widget hierarchies and dynamically updating them.
The app data must have a Data bound, which implies cloning and equality testing. Interior mutability is effectively forbidden.
The “lens” mechanism is confusing and it is not easy to implement complex binding patterns.
We never figured out how to integrate async in a compelling way.
There is an environment mechanism but it is not efficient and doesn’t support fine-grained change propagation.

Another common architecture is immediate mode GUI, both in a relatively pure form and in a modified form. It is popular in Rust because it doesn’t require shared mutable state. It also benefits from overall system simplicity. However, the model is oversimplified in a number of ways, and it is difficult to do sophisticated layout and other patterns that are easier in retained widget systems. There are also numerous papercuts related to sometimes rendering stale state. (I experimented with a retained widget backend emulating an immediate mode API in the “crochet” architecture experiment and concluded that the result was not compelling). The popular egui crate is solidly an implementation of immediate mode, and makepad is also based on it, though it differs in some important ways.

A particularly common architecture for UI in Rust is The Elm Architecture, which also does not require shared mutable state. Rather, to support interactions from the UI, gestures and other related UI actions creates messages which are then sent to an update method which takes central app state. Iced, relm, and Vizia all use some form of this architecture. Generally it works well, but the need to create an explicit message type and dispatch on it is verbose, and the Elm architecture does not support cleanly factored components as well as some other architectures. The Elm documentation specifically warns against components, saying, “actively trying to make components is a recipe for disaster in Elm.”

Lastly, there are a number of serious attempts to port React patterns to Rust, of which I find Dioxus most promising. These rely on interior mutability and other patterns that I think adapt poorly to Rust, but definitely represent a credible alternative to the ideas presented here. I think we will have to build some things and see how well they work out.

Synchronized trees

The Xilem architecture is based around generating trees and keeping them synchronized. In that way it is a refinement of the ideas described in my previous blog post, Towards a unified theory of reactive UI.

In each “cycle,” the app produces a view tree, from which rendering is derived. This tree has fairly short lifetime; each time the UI is updated, a new tree is generated. From this, a widget tree is built (or rebuilt). The view tree is retained only long enough to assist in event dispatching and then be diffed against the next version, at which point it is dropped. The widget tree, by contrast, persists across cycles. In addition to these two trees, there is a third tree containing view state, which also persists across cycles. (The view state serves a very similar function as React hooks)

Of existing UI architectures, the view tree most strongly resembles that of SwiftUI - nodes in the view tree are plain value objects. They also contain callbacks, for example specifying the action to be taken on clicking a button. Like SwiftUI, but somewhat unusually for UI in more dynamic languages, the view tree is statically typed, but with a typed-erased escape hatch (Swift’s AnyView) for instances where strict static typing is too restrictive.

The Rust expression of these trees is instances of the View trait, which has two associated types, one for view state and one for the associated widget. The state and widgets are also statically typed. The design relies heavily on the type inference mechanisms of the Rust compiler. In addition to inferring the type of the view tree, it also uses associated types to deduce the type of the associated state tree and widget tree, which are known at compile time. In almost every other comparable system (SwiftUI being the notable exception), these are determined at runtime with a fair amount of allocation, downcasting, and dynamic dispatch.

A worked example

We’ll use the classic counter as a running example. It’s very simple but will give insight into how things work under the hood. For people who want to follow along with the code, check the idiopath directory of the idiopath branch (in the Druid repo); running cargo doc --open there will reveal a bunch of Rustdoc.

Here’s the application.

fn app(count: &mut u32) -> impl View<u32> {
    v_stack((
        format!("Count: {}", count),
        button("Increment", |count| *count += 1),
    ))
}

This was carefully designed to be clean and simple. A few notes about this code, then we’ll get in to what happens downstream to actually build and run the UI.

This function is run whenever there are significant changes (more on that later). It takes the current app state (in this case a single number, but in general app state can be anything), and returns a view tree. The exact type of the view tree is not specified, rather it uses the impl Trait feature to simply assert that it’s something that implements the View trait (parameterized on the type of the app state). The full type happens to be:

VStack<u32, (), (String, Button<u32, (), {anonymous function of type for <'a> FnMut(&'a mut u32) -> ()}>)>

For such a simple example, this is not too bad (other than the callback), but would get annoying quickly.

Another observation is that three nodes of this tree implement the View trait: VStack and its two children String and Button. Yes, that’s an ordinary Rust string, and it implements the View trait. So will colors and shapes (following SwiftUI; implementation is planned in the prototype but not complete).

The view tree returned by the app logic can be visualized as a block diagram:

The type of the associated widget is widget::VStack. Purely as a pragmatic implementation detail, we’ve chosen to type-erase the children of containers. Among other things, this avoids excessive monomorphization. If we wanted fully-typed widget trees, the architecture would support that, and indeed an earlier prototype chose that approach.

Identity and id paths

A specific detail when building the widget tree is assigning a stable identity to each view. These concepts are explained pretty well in the Demystify SwiftUI talk. As in SwiftUI, stable identity can be based on structure (views in the same place in the view tree get to keep their identity across runs), or an explicit key. To illustrate the latter, assume a list container, and that two of the elements in the list are swapped. That might play an animation of the visual representations of those two elements changing places.

The idea of assigning stable identities is quite standard in declarative UI (it’s also present in basically all non-toy immediate mode GUI implementations), but Xilem adds a distinctive twist, the use of id path rather than a single id. The id path of a widget is the sequence of all ids on the path from the root to that widget in the widget tree. Thus, the id path of the button in the above is [1, 3], while the label is [1, 2] and the stack is just [1].

To continue our running example, given the view tree, the build step produces an associated view state tree as well as a widget tree. Here, we’ve annotated the view tree with ids, and also shown how ids and id paths are stored in the newly built structures. Most importantly, the view state for the VStack contains the ids of the child nodes, and the Button widget contains its id path, which is [1, 3].

As it turns out, neither text labels nor buttons require any special view state other than what’s retained in the widget, so those view states are just (), the unit type. Of course, other view nodes may require more associated state, in which case the type would be something other than the unit type.

Advanced topic: does id identify view or widget?

Raph Levien’s blog

I want a good parallel computer

Memory efficiency of sophisticated GPU programs

Parallel computers of the past

Connection Machine

Cell

Larrabee

The changing workload

Paths forward

Big grid of cores: Cell reborn

Running Vulkan commands from GPU-side

Work graphs

CPU convergent evolution

Maybe the hardware is already there?

Complexity

Conclusion

A note on Metal shader converter

Complexity and reasoning

Onward

Simplifying Bézier paths

The ParamCurveFit trait

Error metrics

Finding subdivision points

Bumps

Low pass filtering

Discussion

Moving from Rust to C++

Choice of build systems

Safety

Language evolution

Community

Conclusion

Requiem for piet-gpu-hal

Goals

Implementation choices

Shader language

Ahead of time compilation

GPU abstraction layer

WebGPU pain points

Collaboration with community

Precompiled shaders on roadmap?

Other GPU compute clients

Beyond WebGPU 1.0

Conclusion

Raph’s reflections and wishes for 2023

Reflections on 2022

On happiness

Vello

Xilem

Curves and other research

Non-goals

Conclusion

Minikin retrospective

Parallel curves of cubic Béziers

Outline of approach

Cusp finding

Computing area and moment of the target curve

Refinement of curve fitting approach

Error measurement

Subdivision

Evaluation

Discussion

References

Advice for the next dozen Rust GUIs

A large tradeoff space

A small rant about native

On winit

Tradeoff: use of system compositor

Tradeoff: platform text rendering

On architecture

The crochet experiment

Accessibility

What of Druid?

Conclusion

Xilem: an architecture for UI in Rust

A quick tour of existing architectures

Synchronized trees

A worked example

Identity and id paths