Jekyll2026-02-26T15:28:01+00:00https://spidermonkey.dev/feed.xmlSpiderMonkey JavaScript/WebAssembly EngineSpiderMonkey is Mozilla's JavaScript and WebAssembly Engine, used in Firefox, Servo and various other projects. It is written in C++ and Rust.Flipping Responsibility for Jobs in SpiderMonkey2026-01-15T17:00:00+00:002026-01-15T17:00:00+00:00https://spidermonkey.dev/blog/2026/01/15/job-responsibilityThis blog post is written both as a heads-up to embedders of SpiderMonkey, and an explanation of why the changes are coming

As an embedder of SpiderMonkey one of the decisions you have to make is whether or not to provide your own implementation of the job queue.

The responsibility of the job queue is to hold pending jobs for Promises, which in the HTML spec are called ‘microtasks’. For embedders, the status quo of 2025 was two options:

  1. Call JS::UseInternalJobQueues, and then at the appropriate point for your embedding, call JS::RunJobs. This uses an internal job queue and drain function.
  2. Subclass and implement the JS::JobQueue type, storing and invoking your own jobs. An embedding might want to do this if they wanted to add their own jobs, or had particular needs for the shape of jobs and data carried alongside them.

The goal of this blog post is to indicate that SpiderMonkey’s handling of Promise jobs is changing over the next little while, and explain a bit of why.

If you’ve chosen to use the internal job queue, almost nothing should change for your embedding. If you’ve provided your own job queue, read on:

What’s Changing

  1. The actual type of a job from the JS engine is changing to be opaque.
  2. The responsibility for actually storing the Promise jobs is moving from the embedding, even in the case of an embedding provided JobQueue.
  3. As a result of (1), the interface to run a job from the queue is also changing.

I’ll cover this in a bit more detail, but a good chunk of the interface discussed is in MicroTask.h (this link is to a specific revision because I expect the header to move).

For most embeddings the changes turn out to be very mechanical. If you have specific challenges with your embedding please reach out.

Job Type

The type of a JS Promise job has been a JSFunction, and thus invoked with JS::Call. The job type is changing to an opaque type. The external interface to this type will be JS::Value (typedef’d as JS::GenericMicroTask);

This means that if you’re an embedder who had been storing your own tasks in the same queue as JS tasks you’ll still be able to, but you’ll need to use the queue access APIs in MicroTask.h. A queue entry is simply a JS::Value and so an arbitrary C address can be stored in it as a JS::PrivateValue.

Jobs now are split into two types: JSMicroTasks (enqueued by the JS engine) and GenericMicroTasks (possibly JS engine provided, possibly embedding provided).

Storage Responsibility

It used to be that if an embedding provided its own JobQueue, we’d expect them to store the jobs and trace the queue. Now that an embedding finds that the queue is inside the engine, the model is changing to one where the embedding must ask the JS engine to store jobs it produces outside of promises if it would like to share the job queue.

Running Micro Tasks

The basic loop of microtask execution now looks like this:


JS::Rooted<JSObject*> executionGlobal(cx)
JS::Rooted<JS::GenericMicroTask> genericTask(cx);
JS::Rooted<JS::JSMicroTask> jsTask(cx);

while (JS::HasAnyMicroTasks(cx)) {
  genericTask = JS::DequeueNextMicroTask(cx); 

  if (JS::IsJSMicroTask(genericTask)) {
    jsMicroTask = JS::ToMaybeWrappedJSMicroTask(genericMicroTask);
    executionGlobal = JS::GetExecutionGlobalFromJSMicroTask(jsMicroTask);

    {
      AutoRealm ar(cx, executionGlobal);
      if (!JS::RunJSMicroTask(cx, jsMicroTask)) {
        // Handle job execution failure in the 
        // same way JS::Call failure would have been
        // handled
      }
    }

    continue;
  }

  // Handle embedding jobs as appropriate. 
}

The abstract separation of the execution global is required to handle cases with many compartments and complicated realm semantics (aka a web browser).

An example

In order to see roughly what the changes would look like, I attempted to patch GJS, the GNOME JS embedding which uses SpiderMonkey.

The patch is here. It doesn’t build due to other incompatibilities I found, but this is the rough shape of a patch for an embedding. As you can see, it’s fairly self contained with not too much work to be done.

Why Change?

In a word, performance. The previous form of Promise job management is very heavyweight with lots of overhead, causing performance to suffer.

The changes made here allow us to make SpiderMonkey quite a bit faster for dealing with Promises, and unlock the potential to get even faster.

How do the changes help?

Well, perhaps the most important change here is making the job representation opaque. What this allows us to do is use pre-existing objects as stand-ins for the jobs. This means that rather than having to allocate a new object for every job (which is costly) we can some of the time actually allocate nothing, simply enqueing an existing job with enough information to run.

Owning the queue will also allow us to choose the most efficient data structure for JS execution, potentially changing opaquely in the future as we find better choices.

Empirically, changing from the old microtask queue system to the new in Firefox led to an improvement of up to 45% on Promise heavy microbenchmarks.

Is this it?

I do not think this is the end of the story for changes in this area. I plan further investment. Aspirationally I would like this all to be stabilized by the next ESR release which is Firefox 153, which will ship to beta in June, but only time will tell what we can get done.

Future changes I can predict are things like

  1. Renaming JS::JobQueue which is now more of a ‘jobs interface’
  2. Renaming the MicroTask header to be less HTML specific

However, I can also imagine making more changes in the pursuit of performance.

What’s the bug for this work

You can find most of the work related to this under Bug 1983153 (sm-µ-task)

An Apology

My apologies to those embedders who will have to do some work during this transition period. Thank you for sticking with SpiderMonkey!

]]>
Matthew Gaudet
Who needs Graphviz when you can build it yourself?2025-10-28T17:00:00+00:002025-10-28T17:00:00+00:00https://spidermonkey.dev/blog/2025/10/28/iongraph-web

We recently overhauled our internal tools for visualizing the compilation of JavaScript and WebAssembly. When SpiderMonkey’s optimizing compiler, Ion, is active, we can now produce interactive graphs showing exactly how functions are processed and optimized.

You can play with these graphs right here on this page. Simply write some JavaScript code in the test function and see what graph is produced. You can click and drag to navigate, ctrl-scroll to zoom, and drag the slider at the bottom to scrub through the optimization process.

As you experiment, take note of how stable the graph layout is, even as the sizes of blocks change or new structures are added. Try clicking a block's title to select it, then drag the slider and watch the graph change while the block remains in place. Or, click an instruction's number to highlight it so you can keep an eye on it across passes.

 

Example iongraph output

We are not the first to visualize our compiler’s internal graphs, of course, nor the first to make them interactive. But I was not satisfied with the output of common tools like Graphviz or Mermaid, so I decided to create a layout algorithm specifically tailored to our needs. The resulting algorithm is simple, fast, produces surprisingly high-quality output, and can be implemented in less than a thousand lines of code. The purpose of this article is to walk you through this algorithm and the design concepts behind it.

Read this post on desktop to see an interactive demo of iongraph.

Background

As readers of this blog already know, SpiderMonkey has several tiers of execution for JavaScript and WebAssembly code. The highest tier is known as Ion, an optimizing SSA compiler that takes the most time to compile but produces the highest-quality output.

Working with Ion frequently requires us to visualize and debug the SSA graph. Since 2011 we have used a tool for this purpose called iongraph, built by Sean Stangl. It is a simple Python script that takes a JSON dump of our compiler graphs and uses Graphviz to produce a PDF. It is perfectly adequate, and very much the status quo for compiler authors, but unfortunately the Graphviz output has many problems that make our work tedious and frustrating.

The first problem is that the Graphviz output rarely bears any resemblance to the source code that produced it. Graphviz will place nodes wherever it feels will minimize error, resulting in a graph that snakes left and right seemingly at random. There is no visual intuition for how deeply nested a block of code is, nor is it easy to determine which blocks are inside or outside of loops. Consider the following function, and its Graphviz graph:

function foo(n) {
  let result = 0;
  for (let i = 0; i < n; i++) {
    if (!!(i % 2)) {
      result = 0x600DBEEF;
    } else {
      result = 0xBADBEEF;
    }
  }

  return result;
}

Counterintuitively, the return appears before the two assignments in the body of the loop. Since this graph mirrors JavaScript control flow, we’d expect to see the return at the bottom. This problem only gets worse as graphs grow larger and more complex.

The second, related problem is that Graphviz’s output is unstable. Small changes to the input can result in large changes to the output. As you page through the graphs of each pass within Ion, nodes will jump left and right, true and false branches will swap, loops will run up the right side instead of the left, and so on. This makes it very hard to understand the actual effect of any given pass. Consider the following before and after, and notice how the second graph is almost—but not quite—a mirror image of the first, despite very minimal changes to the graph’s structure:

None of this felt right to me. Control flow graphs should be able to follow the structure of the program that produced them. After all, a control flow graph has many restrictions that a general-purpose tool would not be aware of: they have very few cycles, all of which are well-defined because they come from loops; furthermore, both JavaScript and WebAssembly have reducible control flow, meaning all loops have only one entry, and it is not possible to jump directly into the middle of a loop. This information could be used to our advantage.

Beyond that, a static PDF is far from ideal when exploring complicated graphs. Finding the inputs or uses of a given instruction is a tedious and frustrating exercise, as is following arrows from block to block. Even just zooming in and out is difficult. I eventually concluded that we ought to just build an interactive tool to overcome these limitations.

How hard could layout be?

I had one false start with graph layout, with an algorithm that attempted to sort blocks into vertical “tracks”. This broke down quickly on a variety of programs and I was forced to go back to the drawing board—in fact, back to the source of the very tool I was trying to replace.

The algorithm used by dot, the typical hierarchical layout mode for Graphviz, is known as the Sugiyama layout algorithm, from a 1981 paper by Sugiyama et al. As introduction, I found a short series of lectures that broke down the Sugiyama algorithm into 5 steps:

  1. Cycle breaking, where the direction of some edges are flipped in order to produce a DAG.
  2. Leveling, where vertices are assigned into horizontal layers according to their depth in the graph, and dummy vertices are added to any edge that crosses multiple layers.
  3. Crossing minimization, where vertices on a layer are reordered in order to minimize the number of edge crossings.
  4. Vertex positioning, where vertices are horizontally positioned in order to make the edges as straight as possible.
  5. Drawing, where the final graph is rendered to the screen.

A screenshot from the lectures, showing the five steps above

These steps struck me as surprisingly straightforward, and provided useful opportunities to insert our own knowledge of the problem:

  • Cycle breaking would be trivial for us, since the only cycles in our data are loops, and loop backedges are explicitly labeled. We could simply ignore backedges when laying out the graph.
  • Leveling would be straightforward, and could easily be modified to better mimic the source code. Specifically, any blocks coming after a loop in the source code could be artificially pushed down in the layout, solving the confusing early-exit problem.
  • Permuting vertices to reduce edge crossings was actually just a bad idea, since our goal was stability from graph to graph. The true and false branches of a condition should always appear in the same order, for example, and a few edge crossings is a small price to pay for this stability.
  • Since reducible control flow ensures that a program’s loops form a tree, vertex positioning could ensure that loops are always well-nested in the final graph.

Taken all together, these simplifications resulted in a remarkably straightforward algorithm, with the initial implementation being just 1000 lines of JavaScript. (See this demo for what it looked like at the time.) It also proved to be very efficient, since it avoided the most computationally complex parts of the Sugiyama algorithm.

iongraph from start to finish

We will now go through the entire iongraph layout algorithm. Each section contains explanatory diagrams, in which rectangles are basic blocks and circles are dummy nodes. Loop header blocks (the single entry point to each loop) are additionally colored green.

Be aware that the block positions in these diagrams are not representative of the actual computed layout position at each point in the process. For example, vertical positions are not calculated until the very end, but it would be hard to communicate what the algorithm was doing if all blocks were drawn on a single line!

Step 1: Layering

We first sort the basic blocks into horizontal tracks called “layers”. This is very simple; we just start at layer 0 and recursively walk the graph, incrementing the layer number as we go. As we go, we track the “height” of each loop, not in pixels, but in layers.

We also take this opportunity to vertically position nodes “inside” and “outside” of loops. Whenever we see an edge that exits a loop, we defer the layering of the destination block until we are done layering the loop contents, at which point we know the loop’s height.

A note on implementation: nodes are visited multiple times throughout the process, not just once. This can produce a quadratic explosion for large graphs, but I find that an early-out is sufficient to avoid this problem in practice.

The animation below shows the layering algorithm in action. Notice how the final block in the graph is visited twice, once after each loop that branches to it, and in each case, the block is deferred until the entire loop has been layered, rather than processed immediately after its predecessor block. The final position of the block is below the entirety of both loops, rather than directly below one of its predecessors as Graphviz would do. (Remember, horizontal and vertical positions have not yet been computed; the positions of the blocks in this diagram are hardcoded for demonstration purposes.)

Implementation pseudocode
/*CODEBLOCK=layering*/function layerBlock(block, layer = 0) {
  // Omitted for clarity: special handling of our "backedge blocks"

  // Early out if the block would not be updated
  if (layer <= block.layer) {
    return;
  }

  // Update the layer of the current block
  block.layer = Math.max(block.layer, layer);

  // Update the heights of all loops containing the current block
  let header = block.loopHeader;
  while (header) {
    header.loopHeight = Math.max(header.loopHeight, block.layer - header.layer + 1);
    header = header.parentLoopHeader;
  }

  // Recursively layer successors
  for (const succ of block.successors) {
    if (succ.loopDepth < block.loopDepth) {
      // Outgoing edges from the current loop will be layered later
      block.loopHeader.outgoingEdges.push(succ);
    } else {
      layerBlock(succ, layer + 1);
    }
  }

  // Layer any outgoing edges only after the contents of the loop have
  // been processed
  if (block.isLoopHeader()) {
    for (const succ of block.outgoingEdges) {
      layerBlock(succ, layer + block.loopHeight);
    }
  }
}

Step 2: Create dummy nodes

Any time an edge crosses a layer, we create a dummy node. This allows edges to be routed across layers without overlapping any blocks. Unlike in traditional Sugiyama, we always put downward dummies on the left and upward dummies on the right, producing a consistent “counter-clockwise” flow. This also makes it easy to read long vertical edges, whose direction would otherwise be ambiguous. (Recall how the loop backedge flipped from the right to the left in the “unstable layout” Graphviz example from before.)

In addition, we coalesce any edges that are going to the same destination by merging their dummy nodes. This heavily reduces visual noise.

Step 3: Straighten edges

This is the fuzziest and most ad-hoc part of the process. Basically, we run lots of small passes that walk up and down the graph, aligning layout nodes with each other. Our edge-straightening passes include:

  • Pushing nodes to the right of their loop header to “indent” them.
  • Walking a layer left to right, moving children to the right to line up with their parents. If any nodes overlap as a result, they are pushed further to the right.
  • Walking a layer right to left, moving parents to the right to line up with their children. This version is more conservative and will not move a node if it would overlap with another. This cleans up most issues from the first pass.
  • Straightening runs of dummy nodes so we have clean vertical lines.
  • “Sucking in” dummy runs on the left side of the graph if there is room for them to move to the right.
  • Straighten out any edges that are “nearly straight”, according to a chosen threshold. This makes the graph appear less wobbly. We do this by repeatedly “combing” the graph upward and downward, aligning parents with children, then children with parents, and so on.

It is important to note that dummy nodes participate fully in this system. If for example you have two side-by-side loops, straightening the left loop’s backedge will push the right loop to the side, avoiding overlaps and preserving the graph’s visual structure.

We do not reach a fixed point with this strategy, nor do we attempt to. I find that if you continue to repeatedly apply these particular layout passes, nodes will wander to the right forever. Instead, the layout passes are hand-tuned to produce decent-looking results for most of the graphs we look at on a regular basis. That said, this could certainly be improved, especially for larger graphs which do benefit from more iterations.

At the end of this step, all nodes have a fixed X-coordinate and will not be modified further.

Step 4: Track horizontal edges

Edges may overlap visually as they run horizontally between layers. To resolve this, we sort edges into parallel “tracks”, giving each a vertical offset. After tracking all the edges, we record the total height of the tracks and store it on the preceding layer as its “track height”. This allows us to leave room for the edges in the final layout step.

We first sort edges by their starting position, left to right. This produces a consistent arrangement of edges that has few vertical crossings in practice. Edges are then placed into tracks from the “outside in”, stacking rightward edges on top and leftward edges on the bottom, creating a new track if the edge would overlap with or cross any other edge.

The diagram below is interactive. Click and drag the blocks to see how the horizontal edges get assigned to tracks.

Implementation pseudocode
/*CODEBLOCK=tracks*/function trackHorizontalEdges(layer) {
  const TRACK_SPACING = 20;

  // Gather all edges on the layer, and sort left to right by starting coordinate
  const layerEdges = [];
  for (const node of layer.nodes) {
    for (const edge of node.edges) {
      layerEdges.push(edge);
    }
  }
  layerEdges.sort((a, b) => a.startX - b.startX);

  // Assign edges to "tracks" based on whether they overlap horizontally with
  // each other. We walk the tracks from the outside in and stop if we ever
  // overlap with any other edge.
  const rightwardTracks = []; // [][]Edge
  const leftwardTracks = [];  // [][]Edge
  nextEdge:
  for (const edge of layerEdges) {
    const trackSet = edge.endX - edge.startX >= 0 ? rightwardTracks : leftwardTracks;
    let lastValidTrack = null; // []Edge | null

    // Iterate through the tracks in reverse order (outside in)
    for (let i = trackSet.length - 1; i >= 0; i--) {
      const track = trackSet[i];
      let overlapsWithAnyInThisTrack = false;
      for (const otherEdge of track) {
        if (edge.dst === otherEdge.dst) {
          // Assign the edge to this track to merge arrows
          track.push(edge);
          continue nextEdge;
        }

        const al = Math.min(edge.startX, edge.endX);
        const ar = Math.max(edge.startX, edge.endX);
        const bl = Math.min(otherEdge.startX, otherEdge.endX);
        const br = Math.max(otherEdge.startX, otherEdge.endX);
        const overlaps = ar >= bl && al <= br;
        if (overlaps) {
          overlapsWithAnyInThisTrack = true;
          break;
        }
      }

      if (overlapsWithAnyInThisTrack) {
        break;
      } else {
        lastValidTrack = track;
      }
    }

    if (lastValidTrack) {
      lastValidTrack.push(edge);
    } else {
      trackSet.push([edge]);
    }
  }

  // Use track info to apply offsets to each edge for rendering.
  const tracksHeight = TRACK_SPACING * Math.max(
    0,
    rightwardTracks.length + leftwardTracks.length - 1,
  );
  let trackOffset = -tracksHeight / 2;
  for (const track of [...rightwardTracks.toReversed(), ...leftwardTracks]) {
    for (const edge of track) {
      edge.offset = trackOffset;
    }
    trackOffset += TRACK_SPACING;
  }
}

Step 5: Verticalize

Finally, we assign each node a Y-coordinate. Starting at a Y-coordinate of zero, we iterate through the layers, repeatedly adding the layer’s height and its track height, where the layer height is the maximum height of any node in the layer. All nodes within a layer receive the same Y-coordinate; this is simple and easier to read than Graphviz’s default of vertically centering nodes within a layer.

Now that every node has both an X and Y coordinate, the layout process is complete.

Implementation pseudocode
/*CODEBLOCK=verticalize*/function verticalize(layers) {
  let layerY = 0;
  for (const layer of layers) {
    let layerHeight = 0;
    for (const node of layer.nodes) {
      node.y = layerY;
      layerHeight = Math.max(layerHeight, node.height);
    }
    layerY += layerHeight;
    layerY += layer.trackHeight;
  }
}

Step 6: Render

The details of rendering are out of scope for this article, and depend on the specific application. However, I wish to highlight a stylistic decision that I feel makes our graphs more readable.

When rendering edges, we use a style inspired by railroad diagrams. These have many advantages over the Bézier curves employed by Graphviz. First, straight lines feel more organized and are easier to follow when scrolling up and down. Second, they are easy to route (vertical when crossing layers, horizontal between layers). Third, they are easy to coalesce when they share a destination, and the junctions provide a clear indication of the edge’s direction. Fourth, they always cross at right angles, improving clarity and reducing the need to avoid edge crossings in the first place.

Consider the following example. There are several edge crossings that may traditionally be considered undesirable—yet the edges and their directions remain clear. Of particular note is the vertical junction highlighted in red on the left: not only is it immediately clear that these edges share a destination, but the junction itself signals that the edges are flowing downward. I find this much more pleasant than the “rat’s nest” that Graphviz tends to produce.

Examples of railroad-diagram edges

Why does this work?

It may seem surprising that such a simple (and stupid) layout algorithm could produce such readable graphs, when more sophisticated layout algorithms struggle. However, I feel that the algorithm succeeds because of its simplicity.

Most graph layout algorithms are optimization problems, where error is minimized on some chosen metrics. However, these metrics seem to correlate poorly to readability in practice. For example, it seems good in theory to rearrange nodes to minimize edge crossings. But a predictable order of nodes seems to produce more sensible results overall, and simple rules for edge routing are sufficient to keep things tidy. (As a bonus, this also gives us layout stability from pass to pass.) Similarly, layout rules like “align parents with their children” produce more readable results than “minimize the lengths of edges”.

Furthermore, by rejecting the optimization problem, a human author gains more control over the layout. We are able to position nodes “inside” of loops, and push post-loop content down in the graph, because we reject this global constraint-solver approach. Minimizing “error” is meaningless compared to a human maximizing meaning through thoughtful design.

And finally, the resulting algorithm is simply more efficient. All the layout passes in iongraph are easy to program and scale gracefully to large graphs because they run in roughly linear time. It is better, in my view, to run a fixed number of layout iterations according to your graph complexity and time budget, rather than to run a complex constraint solver until it is “done”.

By following this philosophy, even the worst graphs become tractable. Below is a screenshot of a zlib function, compiled to WebAssembly, and rendered using the old tool.

spaghetti nightmare!!

It took about ten minutes for Graphviz to produce this spaghetti nightmare. By comparison, iongraph can now lay out this function in 20 milliseconds. The result is still not particularly beautiful, but it renders thousands of times faster and is much easier to navigate.

better spaghetti

Perhaps programmers ought to put less trust into magic optimizing systems, especially when a human-friendly result is the goal. Simple (and stupid) algorithms can be very effective when applied with discretion and taste.

Future work

We have already integrated iongraph into the Firefox profiler, making it easy for us to view the graphs of the most expensive or impactful functions we find in our performance work. Unfortunately, this is only available in specific builds of the SpiderMonkey shell, and is not available in full browser builds. This is due to architectural differences in how profiling data is captured and the flags with which the browser and shell are built. I would love for Firefox users to someday be able to view these graphs themselves, but at the moment we have no plans to expose this to the browser. However, one bug tracking some related work can be found here.

We will continue to sporadically update iongraph with more features to aid us in our work. We have several ideas for new features, including richer navigation, search, and visualization of register allocation info. However, we have no explicit roadmap for when these features may be released.

To experiment with iongraph locally, you can run a debug build of the SpiderMonkey shell with IONFLAGS=logs; this will dump information to /tmp/ion.json. This file can then be loaded into the standalone deployment of iongraph. Please be aware that the user experience is rough and unpolished in its current state.

The source code for iongraph can be found on GitHub. If this subject interests you, we would welcome contributions to iongraph and its integration into the browser. The best place to reach us is our Matrix chat.


Thanks to Matthew Gaudet, Asaf Gartner, and Colin Davidson for their feedback on this article.

]]>
Ben Visness
5 Things You Might Not Know about Developing Self-Hosted Code2025-04-23T00:00:00+00:002025-04-23T00:00:00+00:00https://spidermonkey.dev/blog/2025/04/23/self-hosted-developmentSelf-hosted code is JavaScript code that SpiderMonkey uses to implement some of its intrinsic functions for JavaScript. Because it is written in JavaScript, it gets all the benefits of our JITs, like inlining and inline caches.

Even if you are just getting started with self-hosted code, you probably already know that it isn’t quite the same as your typical, day-to-day JavaScript. You’ve probably already been pointed at the SMDOC, but here are a couple tips to make developing self-hosted code a little easier.

1. When you change self-hosted code, you need to build

When you make changes to SpiderMonkey’s self-hosted JavaScript code, you will not automatically see your changes take effect in Firefox or the JS Shell.

SpiderMonkey’s self-hosted code is split up into multiple files and functions to make it easier for developers to understand, but at runtime, SpiderMonkey loads it all from a single, compressed data stream. This means that all those files are gathered together into a single script file and compressed at build time.

To see your changes take effect, you must remember to build!

2. dbg()

Self-hosted JavaScript code is hidden from the JS Debugger, and it can be challenging to debug JS using a C++ debugger. You might want to try logging messages to console.log() to help you debug your code, but that is not available in self-hosted code!

In debug builds, you can print out messages and objects using dbg(), which takes a single argument to print to stderr.

3. Specification step comments

If you are stuck trying to figure out how to implement a step in the JS specification or a proposal, you can see if SpiderMonkey has implemented a similar step elsewhere and base your implementation off that. We try to diligently comment our implementations with references to the specification, so there’s a good chance you can find what you are looking for.

For example, if you need to use the specification function CreateDataPropertyOrThrow(), you can search for it (SearchFox is a great tool for this) and discover that it is implemented in self-hosted code using DefineDataProperty().

4. getSelfHostedValue()

If you want to explore how a self-hosted function works directly, you can use the JS Shell helper function getSelfHostedValue().

We use this method to write many of our tests. For example, unicode-extension-sequences.js checks the implementation of the self-hosted functions startOfUnicodeExtensions() and endOfUnicodeExtensions().

You can also use getSelfHostedValue() to get C++ intrinsic functions, like how toLength.js tests ToLength().

5. You can define your own self-hosted functions

You can write your own self-hosted functions and make them available in the JS Shell and XPC shell. For example, you could write a self-hosted function to print a formatted error message:

  function report(msg) {
      dbg("|ERROR| " + msg + "|");
  }

Then, while you are setting up globals for your JS runtime, call JS_DefineFunctions(cx, obj, funcs):

  static const JSFunctionSpec funcs[] = {
      JS_SELF_HOSTED_FN("report", "report", 1, 0),
      JS_FS_END,
  };

  if (!JS_DefineFunctions(cx, globalObject, funcs)) {
    return false;
  }

The JS_SELF_HOSTED_FN() macro takes the following parameters:

  1. name - The name you want your function to have in JS.
  2. selfHostedName - The name of the self-hosted function.
  3. nargs - Number of formal JS arguments to the self-hosted function.
  4. flags - This is almost always 0, but could be any combination of JSPROP_*.

Now, when you build the JS Shell or XPC Shell, you can call your function:

js> report("BOOM!");          
Iterator.js#6: |ERROR| BOOM!|
]]>
Bryan Thrall
Shipping Temporal2025-04-11T18:00:00+00:002025-04-11T18:00:00+00:00https://spidermonkey.dev/blog/2025/04/11/shipping-temporalThe Temporal proposal provides a replacement for Date, a long standing pain-point in the JavaScript language. This blog post describes some of the history and motivation behind the proposal. The Temporal API itself is well docmented on MDN.

Temporal reached Stage 3 of the TC39 process in March 2021. Reaching Stage 3 means that the specification is considered complete, and that the proposal is ready for implementation.

SpiderMonkey began our implementation that same month, with the initial work tracked in Bug 1519167. Incredibly, our implementation was not developed by Mozilla employees, but was contributed entirely by a single volunteer, André Bargull. That initial bug consisted of 99 patches, but the work did not stop there, as the specification continued to evolve as problems were found during implementation. Beyond contributing to SpiderMonkey, André filed close to 200 issues against the specification. Bug 1840374 is just one example of the massive amount of work required to keep up to date with the specification.

As of Firefox 139, we’ve enabled our Temporal implementation by default, making us the first browser to ship it. Sometimes it can seem like the ideas of open source, community, and volunteer contributors are a thing of the past, but the example of Temporal shows that volunteers can still have a meaningful impact both on Firefox and on the JavaScript language as a whole.

Interested in contributing?

Not every proposal is as large as Temporal, and we welcome contributions of all shapes and sizes. If you’re interested in contributing to SpiderMonkey, please have a look at our mentored bugs. You don’t have to be an expert :). If your interests are more on the specification side, you can also check out how to contribute to TC39.

]]>
Daniel Minor
SpiderMonkey Newsletter (Firefox 135-137)2025-03-17T20:00:00+00:002025-03-17T20:00:00+00:00https://spidermonkey.dev/blog/2025/03/17/newsletter-firefox-135-137Hello everyone,

Matthew here from the SpiderMonkey team. As the weather whipsaws from cold to hot to cold, I have elected to spend some time whipping together a too brief newsletter, which will almost certainly not capture the best of what we’ve done these last few months. Nevertheless, onwards!

🧑‍🎓Outreachy

We hosted an Outreachy intern, Serah Nderi, for the most recent Outreachy cycle, with Dan as her mentor. Serah worked on implementing the Iterator.range proposal as well as a few other things. We were happy to host her, and grateful to her for joining. Read about her internship project here.

🥯HYTRADBOI: Have You Tried Rubbing a Database On It

HYTRADBOI is an interesting independent online only conference, which this year had a strong programming languages track. Iain from the SpiderMonkey team was able to produce a stellar video talk called A quick ramp-up on ramping up quickly, where he helps the audience reinvent our baseline interpreter in 10 minutes. The talk is fun and short, so go forth and watch it!

👷🏽‍♀️ New features & In Progress Standards Work

We have done a whole bunch of shipping work this cycle. By far the most important thing is that Temporal has now been shipped on Nightly. We must extend our enormous gratitude to André Bargull, who has been implementing this proposal for years, providing reams of feedback to champions, and making it possible for us to ship so early. We’ve also been working on improving error messages reported to developers, and have a list of “good first bugs” available for people interested in getting started contributing to SpiderMonkey or Firefox.

In addition to Temporal, Dan has worked on shipping a number of our complete proposal implementations:

and Atomics.pause.

🚀 Performance

🚉 SpiderMonkey Platform Improvements

]]>
Matthew Gaudet
Implementing Iterator.range in SpiderMonkey2025-03-05T07:00:00+00:002025-03-05T07:00:00+00:00https://spidermonkey.dev/blog/2025/03/05/iterator-rangeIn October 2024, I joined Outreachy as an Open Source contributor and in December 2024, I joined Outreachy as an intern working with Mozilla. My role was to implement the TC39 Range Proposal in the SpiderMonkey JavaScript engine. Iterator.range is a new built-in method proposed for JavaScript iterators that allows generating a sequence of numbers within a specified range. It functions similarly to Python’s range, providing an easy and efficient way to iterate over a series of values:

for (const i of Iterator.range(0, 43)) console.log(i); // 0 to 42

But also things like:

function* even() {
  for (const i of Iterator.range(0, Infinity)) if (i % 2 === 0) yield i;
}

In this blog post, we will explore the implementation of Iterator.range in the SpiderMonkey JavaScript engine.

Understanding the Implementation

When I started working on Iterator.range, the initial implementation had been done, ie; adding a preference for the proposal and making the builtin accessible in the JavaScript shell.

The Iterator.range simply returned false, a stub indicating that the actual implementation of Iterator.range was under development or not fully implemented, which is where I came in. As a start, I created a CreateNumericRangeIterator function that delegates to the Iterator.range function. Following that, I implemented the first three steps within the Iterator.range function. Next, I initialised variables and parameters for the NUMBER-RANGE data type in the CreateNumericRangeIteratorfunction.

I focused on implementing sequences that increase by one, such as Iterator.range(0, 10).Next, I created an IteratorRangeGenerator* function (ie, step 18 of the Range proposal), that when called doesn’t execute immediately, but returns a generator object which follows the iterator protocol. Inside the generator function you have yield statements which represents where the function suspends its execution and provides value back to the caller. Additionaly, I updated the CreateNumericRangeIterator function to invoke IteratorRangeGenerator* with the appropriate arguments, aligning with Step 19 of the specification, and added tests to verify its functionality.

The generator will pause at each yield, and will not continue until the next method is called on the generator object that is created. The NumericRangeIteratorPrototype (Step 27.1.4.2 of the proposal) is the object that holds the iterator prototype for the Numeric range iterator. The next() method is added to the NumericRangeIteratorPrototype, when you call the next() method on an object created from NumericRangeIteratorPrototype, it doesn’t directly return a value, but it makes the generator yield the next value in the series, effectively resuming the suspended generator.

The first time you invoke next() on the generator object created via IteratorRangeGenerator*, the generator will run up to the first yield statement and return the first value. When you invoke next() again, theNumericRangeIteratorNext() will be called.

This method uses GeneratorResume(this), which means the generator will pick up right where it left off, continuing to iterate the next yield statement or until iteration ends.

Generator Alternative

After discussions with my mentors Daniel and Arai, I transitioned from a generator-based implementation to a more efficient slot-based approach. This change involved defining slots to store the state necessary for computing the next value. The reasons included:

  • Efficiency: Directly managing iteration state is faster than relying on generator functions.
  • Simplified Implementation: A slot-based approach eliminates the need for generator-specific handling, making the code more maintainable.
  • Better Alignment with Other Iterators: Existing built-in iterators such as StringIteratorPrototype and ArrayIteratorPrototype do not use generators in their implementations.

Perfomance and Benchmarks

To quantify the performance improvements gained by transitioning from a generator-based implementation to a slot-based approach, I conducted comparative benchmarks using a test in the current bookmarks/central, and in the revision that used generator-based approach. My benchmark tested two key scenarios:

  • Floating-point range iteration: Iterating through 100,000 numbers with a step of 0.1
  • BigInt range iteration: Iterating through 1,000,000 BigInts with a step of 2

Each test was run 100 times to eliminate anomalies. The benchmark code was structured as follows:

// Benchmark for Number iteration
var sum = 0;
for (var i = 0; i < 100; ++i) {
  for (num of Iterator.range(0, 100000, 0.1)) {
    sum += num;
  }
}
print(sum);

// Benchmark for BigInt iteration
var sum = 0n;
for (var i = 0; i < 100; ++i) {
  for (num of Iterator.range(0n, 1000000n, 2n)) {
    sum += num;
  }
}
print(sum);

Results

Implementation Execution Time (ms) Improvement
Generator-based 8,174.60 -
Slot-based 2,725.33 66.70%

The slot-based implementation completed the benchmark in just 2.7 seconds compared to 8.2 seconds for the generator-based approach. This represents a 66.7% reduction in execution time, or in other words, the optimized implementation is approximately 3 times faster.

Challenges

Implementing BigInt support was straightforward from a specification perspective, but I encountered two blockers:

1. Handling Infinity Checks Correctly

The specification ensures that start is either a Number or a BigInt in steps 3.a and 4.a. However, step 5 states:

  • If start is +∞ or -∞, throw a RangeError.

Despite following this, my implementation still threw an error stating that start must be finite. After investigating, I found that the issue stemmed from using a self-hosted isFinite function.

The specification requires isFinite to throw a TypeError for BigInt, but the self-hosted Number_isFinite returns false instead. This turned out to be more of an implementation issue than a specification issue.

See Github discussion here.

  • Fix: Explicitly check that start is a number before calling isFinite:
// Step 5: If start is +∞ or -∞, throw a RangeError.
if (typeof start === "number" && !Number_isFinite(start)) {
  ThrowRangeError(JSMSG_ITERATOR_RANGE_START_INFINITY);
}

2. Floating Point Precision Errors

When testing floating-point sequences, I encountered an issue where some decimal values were not represented exactly due to JavaScript’s floating-point precision limitations. This caused incorrect test results.

There’s a GitHub issue discussing this in depth. I implemented an approximatelyEqual function to compare values within a small margin of error.

  • Fix: Using approximatelyEqual in tests:
const resultFloat2 = Array.from(Iterator.range(0, 1, 0.2));
approximatelyEqual(resultFloat2, [0, 0.2, 0.4, 0.6, 0.8]);

This function ensures that minor precision errors do not cause test failures, improving floating-point range calculations.

Next Steps and Future Improvements

There are different stages a TC39 proposal goes through before it can be shipped. This document shows the different stages that a proposal goes through from ideation to consumption. The Iterator.range proposal is currently at stage 1 which is the Draft stage. Ideally, the proposal should advance to stage 3 which means that the specification is stable and no changes to the proposal are expected, but some necessary changes may still occur due to web incompatibilities or feedback from production-grade implementations.

Currently, this implementation is in it’s early stages of implementation. It’s only built in Nightly and disabled by default until such a time the proposal is in stage 3 or 4 and no further revision to the specification can be made.

Final Thoughts

Working on the Iterator.range implementation in SpiderMonkey has been a deeply rewarding experience. I learned how to navigate a large and complex codebase, collaborate with experienced engineers, and translate a formal specification into an optimized, real-world implementation. The transition from a generator-based approach to a slot-based one was a significant learning moment, reinforcing the importance of efficiency in JavaScript engine internals.

Beyond technical skills, I gained a deeper appreciation for the standardization process in JavaScript. The experience highlighted how proposals evolve through real-world feedback, and how early-stage implementations help shape their final form.

As Iterator.range continues its journey through the TC39 proposal stages, I look forward to seeing its adoption in JavaScript engines and the impact it will have on developers. I hope this post provides useful insights into SpiderMonkey development and encourages others to contribute to open-source projects and JavaScript standardization efforts.

If you’d like to read more, here are my blog posts that I made during the project:

]]>
Serah Nderi
Making Teleporting Smarter2025-02-19T18:00:00+00:002025-02-19T18:00:00+00:00https://spidermonkey.dev/blog/2025/02/19/Making-Teleporting-SmarterRecently I got to land a patch which touches a cool optimization, that I had to really make sure I understood deeply. As a result, I wrote a huge commit message. I’d like to expand that message a touch here and turn it into a nice blog post.

This post assumes roughly that you understand how Shapes work in the JavaScript object model, and how prototypical property lookup works in JavaScript. If you don’t understand that just yet, this blog post by Matthias Bynens is a good start.

This patch aims to mitigate a performance cliff that occurs when we have applications which shadow properties on the prototype chain or which mutate the prototype chain.

The problem is that these actions currently break a property lookup optimization called “Shape Teleportation”.

What is Shape Teleporting?

Suppose you’re looking up some property y on an object obj, which has a prototype chain with 4 elements. Suppose y isn’t stored on obj, but instead is stored on some prototype object B, in slot 1.

A diagram of shape teleporting

In order to get the value of this property, officially you have to walk from obj up to B to find the value of y. Of course, this would be inefficient, so what we do instead is attach an inline cache to make this lookup more efficient.

Now we have to guard against future mutation when creating an inline cache. A basic version of a cache for this lookup might look like:

  • Check obj still has the same shape.
  • Check obj‘s prototype (D) still has the same shape.
  • Check D‘s prototype (C) still has the same shape
  • Check C’s prototype (B) still has the same shape.
  • Load slot 1 out of B.

This is less efficient than we would like though. Imagine if instead of having 3 intermediate prototypes, there were 13 or 30? You’d have this long chain of prototype shape checking, which takes a long time!

Ideally, what you’d like is to be able to simply say

  • Check obj still has the same shape.
  • Check B still has the same shape
  • Load slot 1 out of B.

The problem with doing this naively is “What if someone adds y as a property to C? With the faster guards, you’d totally miss that value, and as a result compute the wrong result. We don’t like wrong results.

Shape Teleporting is the existing optimization which says that so long as you actively force a change of shape on objects in the prototype chain when certain modifications occur, then you can guard in inline-caches only on the shape of the receiver object and the shape of the holder object.

By forcing each shape to be changed, inline caches which have baked in assumptions about these objects will no longer succeed, and we’ll take a slow path, potentially attaching a new IC if possible.

We must reshape in the following situations:

  • Adding a property to a prototype which shadows a property further up the prototype chain. In this circumstance, the object getting the new property will naturally reshape to account for the new property, but the old holder needs to be explicitly reshaped at this point, to avoid an inline cache jumping over the newly defined prototype.

A diagram of shape teleporting

  • Modifying the prototype of an object which exists on the prototype chain. For this case we need to invalidate the shape of the object being mutated (natural reshape due to changed prototype), as well as the shapes of all objects on the mutated object’s prototype chain. This is to invalidate all stubs which have teleported over the mutated object.

A diagram of shape teleporting

Furthermore, we must avoid an “A-B-A” problem, where an object returns to a shape prior to prototype modification: for example, even if we re-shape B, what if code deleted and then re-added y, causing B to take on its old shape? Then the IC would start working again, even though the prototype chain may have been mutated!

Prior to this patch, Watchtower watches for prototype mutation and shadowing, and marks the shapes of the prototype objects involved with these operations as InvalidatedTeleporting. This means that property access with the objects involved can never more rely on the shape teleporting optimization. This also avoids the A-B-A problem as new shapes will always carry along the InvalidatedTeleporting flag.

This patch instead chooses to migrate an object shape to dictionary mode, or generate a new dictionary shape if it’s already in dictionary mode. Using dictionary mode shapes works because all dictionary mode shapes are unique and never recycled. This ensures the ICs are no longer valid as expected, as well as handily avoiding the A-B-A problem.

The patch does keep the InvalidatedTeleporting flag to catch potentially ill-behaved sites that do lots of mutation and shadowing, avoiding having to reshape proto objects forever.

The patch also provides a preference to allow cross-comparison between old and new, however this patch defaults to dictionary mode teleportation.

Performance testing on micro-benchmarks shows large impact by allowing ICs to attach where they couldn’t before, however Speedometer3 shows no real movement.

]]>
Matthew Gaudet
Is Memory64 actually worth using?2025-01-15T18:00:00+00:002025-01-15T18:00:00+00:00https://spidermonkey.dev/blog/2025/01/15/is-memory64-actually-worth-usingAfter many long years, the Memory64 proposal for WebAssembly has finally been released in both Firefox 134 and Chrome 133. In short, this proposal adds 64-bit pointers to WebAssembly.

If you are like most readers, you may be wondering: “Why wasn’t WebAssembly 64-bit to begin with?” Yes, it’s the year 2025 and WebAssembly has only just added 64-bit pointers. Why did it take so long, when 64-bit devices are the majority and 8GB of RAM is considered the bare minimum?

It’s easy to think that 64-bit WebAssembly would run better on 64-bit hardware, but unfortunately that’s simply not the case. WebAssembly apps tend to run slower in 64-bit mode than they do in 32-bit mode. This performance penalty depends on the workload, but it can range from just 10% to over 100%—a 2x slowdown just from changing your pointer size.

This is not simply due to a lack of optimization. Instead, the performance of Memory64 is restricted by hardware, operating systems, and the design of WebAssembly itself.

What is Memory64, actually?

To understand why Memory64 is slower, we first must understand how WebAssembly represents memory.

When you compile a program to WebAssembly, the result is a WebAssembly module. A module is analogous to an executable file, and contains all the information needed to bootstrap and run a program, including:

  • A description of how much memory will be necessary (the memory section)
  • Static data to be copied into memory (the data section)
  • The actual WebAssembly bytecode to execute (the code section)

These are encoded in an efficient binary format, but WebAssembly also has an official text syntax used for debugging and direct authoring. This article will use the text syntax. You can convert any WebAssembly module to the text syntax using tools like WABT (wasm2wat) or wasm-tools (wasm-tools print).

Here’s a simple but complete WebAssembly module that allows you to store and load an i32 at address 16 of its memory.

(module
  ;; Declare a memory with a size of 1 page (64KiB, or 65536 bytes)
  (memory 1)

  ;; Declare, and export, our store function
  (func (export "storeAt16") (param i32)
    i32.const 16  ;; push address 16 to the stack
    local.get 0   ;; get the i32 param and push it to the stack
    i32.store     ;; store the value to the address
  )

  ;; Declare, and export, our load function
  (func (export "loadFrom16") (result i32)
    i32.const 16  ;; push address 16 to the stack
    i32.load      ;; load from the address
  )
)

Now let’s modify the program to use Memory64:

(module
  ;; Declare an i64 memory with a size of 1 page (64KiB, or 65536 bytes)
  (memory i64 1)

  ;; Declare, and export, our store function
  (func (export "storeAt16") (param i32)
    i64.const 16  ;; push address 16 to the stack
    local.get 0   ;; get the i32 param and push it to the stack
    i32.store     ;; store the value to the address
  )

  ;; Declare, and export, our load function
  (func (export "loadFrom16") (result i32)
    i64.const 16  ;; push address 16 to the stack
    i32.load      ;; load from the address
  )
)

You can see that our memory declaration now includes i64, indicating that it uses 64-bit addresses. We therefore also change i32.const 16 to i64.const 16. That’s it. This is pretty much the entirety of the Memory64 proposal1.

How is memory implemented?

So why does this tiny change make a difference for performance? We need to understand how WebAssembly engines actually implement memories.

Thankfully, this is very simple. The host (in this case, a browser) simply allocates memory for the WebAssembly module using a system call like mmap or VirtualAlloc. WebAssembly code is then free to read and write within that region, and the host (the browser) ensures that WebAssembly addresses (like 16) are translated to the correct address within the allocated memory.

However, WebAssembly has an important constraint: accessing memory out of bounds will trap, analogous to a segmentation fault (segfault). It is the host’s job to ensure that this happens, and in general it does so with bounds checks. These are simply extra instructions inserted into the machine code on each memory access—the equivalent of writing if (address >= memory.length) { trap(); } before every single load2. You can see this in the actual x64 machine code generated by SpiderMonkey for an i32.load3:

  movq 0x08(%r14), %rax       ;; load the size of memory from the instance (%r14)
  cmp %rax, %rdi              ;; compare the address (%rdi) to the limit
  jb .load                    ;; if the address is ok, jump to the load
  ud2                         ;; trap
.load:
  movl (%r15,%rdi,1), %eax    ;; load an i32 from memory (%r15 + %rdi)

These instructions have several costs! Besides taking up CPU cycles, they require an extra load from memory, they increase the size of machine code, and they take up branch predictor resources. But they are critical for ensuring the security and correctness of WebAssembly code.

Unless…we could come up with a way to remove them entirely.

How is memory really implemented?

The maximum possible value for a 32-bit integer is about 4 billion. 32-bit pointers therefore allow you to use up to 4GB of memory. The maximum possible value for a 64-bit integer, on the other hand, is about 18 sextillion, allowing you to use up to 18 exabytes of memory. This is truly enormous, tens of millions of times bigger than the memory in even the most advanced consumer machines today. In fact, because this difference is so great, most “64-bit” devices are actually 48-bit in practice, using just 48 bits of the memory address to map from virtual to physical addresses4.

Even a 48-bit memory is enormous: 65,536 times larger than the largest possible 32-bit memory. This gives every process 281 terabytes of address space to work with, even if the device has only a few gigabytes of physical memory.

This means that address space is cheap on 64-bit devices. If you like, you can reserve 4GB of address space from the operating system to ensure that it remains free for later use. Even if most of that memory is never used, this will have little to no impact on most systems.

How do browsers take advantage of this fact? By reserving 4GB of memory for every single WebAssembly module.

In our first example, we declared a 32-bit memory with a size of 64KB. But if you run this example on a 64-bit operating system, the browser will actually reserve 4GB of memory. The first 64KB of this 4GB block will be read-write, and the remaining 3.9999GB will be reserved but inaccessible.

By reserving 4GB of memory for all 32-bit WebAssembly modules, it is impossible to go out of bounds. The largest possible pointer value, 2^32-1, will simply land inside the reserved region of memory and trap. This means that, when running 32-bit wasm on a 64-bit system, we can omit all bounds checks entirely5.

This optimization is impossible for Memory64. The size of the WebAssembly address space is the same as the size of the host address space. Therefore, we must pay the cost of bounds checks on every access, and as a result, Memory64 is slower.

So why use Memory64?

The only reason to use Memory64 is if you actually need more than 4GB of memory.

Memory64 won’t make your code faster or more “modern”. 64-bit pointers in WebAssembly simply allow you to address more memory, at the cost of slower loads and stores.

The performance penalty may diminish over time as engines make optimizations. Bounds checking strategies can be improved, and WebAssembly compilers may be able to eliminate some bounds checks at compile time. But it is impossible to beat the absolute removal of all bounds checks found in 32-bit WebAssembly.

Furthermore, the WebAssembly JS API constrains memories to a maximum size of 16GB. This may be quite disappointing for developers used to native memory limits. Unfortunately, because WebAssembly makes no distinction between “reserved” and “committed” memory, browsers cannot freely allocate large quantities of memory without running into system commit limits.

Still, being able to access 16GB is very useful for some applications. If you need more memory, and can tolerate worse performance, then Memory64 might be the right choice for you.

Where can WebAssembly go from here? Memory64 may be of limited use today, but there are some exciting possibilities for the future:

  • Bounds checks could be better supported in hardware in the future. There has already been some research in this direction—for example, see this 2023 paper by Narayan et. al. With the growing popularity of WebAssembly and other sandboxed VMs, this could be a very impactful change that improves performance while also eliminating the wasted address space from large reservations. (Not all WebAssembly hosts can spend their address space as freely as browsers.)

  • The memory control proposal for WebAssembly, which I co-champion, is exploring new features for WebAssembly memory. While none of the current ideas would remove the need for bounds checks, they could take advantage of virtual memory hardware to enable larger memories, more efficient use of large address spaces (such as reduced fragmentation for memory allocators), or alternative memory allocation techniques.

Memory64 may not matter for most developers today, but we think it is an important stepping stone to an exciting future for memory in WebAssembly.


  1. The rest of the proposal fleshes out the i64 mode, for example by modifying instructions like memory.fill to accept either i32 or i64 depending on the memory’s address type. The proposal also adds an i64 mode to tables, which are the primary mechanism used for function pointers and indirect calls. For simplicity, they are omitted from this post. 

  2. In practice the instructions may actually be more complicated, as they also need to account for integer overflow, offset, and align

  3. If you’re using the SpiderMonkey JS shell, you can try this yourself by using wasmDis(func) on any exported WebAssembly function. 

  4. Some hardware now also supports addresses larger than 48 bits, such as Intel processors with 57-bit addresses and 5-level paging, but this is not yet commonplace. 

  5. In practice, a few extra pages beyond 4GB will be reserved to account for offset and align, called “guard pages”. We could reserve another 4GB of memory (8GB in total) to account for every possible offset on every possible pointer, but in SpiderMonkey we instead choose to reserve just 32MiB + 64KiB for guard pages and fall back to explicit bounds checks for any offsets larger than this. (In practice, large offsets are very uncommon.) For more information about how we handle bounds checks on each supported platform, see this SMDOC comment (which seems to be slightly out of date), these constants, and this Ion code. It is also worth noting that we fall back to explicit bounds checks whenever we cannot use this allocation scheme, such as on 32-bit devices or resource-constrained mobile phones. 

]]>
Ben Visness
SpiderMonkey Newsletter (Firefox 132-134)2024-11-27T17:00:00+00:002024-11-27T17:00:00+00:00https://spidermonkey.dev/blog/2024/11/27/newsletter-firefox-132-134Hello! Welcome to another episode of the SpiderMonkey Newsletter. I’m your host, Matthew Gaudet.

In the spirit of the upcoming season, let’s talk turkey. I mean, monkeys. I mean SpiderMonkey.

Today we’ll cover a little more ground than the normal newsletter.

If you haven’t already read Jan’s wonderful blog about how he managed to improve Wasm compilation speed by 75x on large modules, please take a peek. It’s a great story of how O(n^2) is the worst complexity – fast enough to seem OK in small cases, and slow enough to blow up horrendously when things get big.

🚀 Performance

👷🏽‍♀️ New features & In Progress Standards Work

🚉 SpiderMonkey Platform Improvements

]]>
Matthew Gaudet
75x faster: optimizing the Ion compiler backend2024-10-16T17:00:00+00:002024-10-16T17:00:00+00:00https://spidermonkey.dev/blog/2024/10/16/75x-faster-optimizing-the-ion-compiler-backendIn September, machine learning engineers at Mozilla filed a bug report indicating that Firefox was consuming excessive memory and CPU resources while running Microsoft’s ONNX Runtime (a machine learning library) compiled to WebAssembly.

This post describes how we addressed this and some of our longer-term plans for improving WebAssembly performance in the future.

The problem

SpiderMonkey has two compilers for WebAssembly code. First, a Wasm module is compiled with the Wasm Baseline compiler, a compiler that generates decent machine code very quickly. This is good for startup time because we can start executing Wasm code almost immediately after downloading it. Andy Wingo wrote a nice blog post about this Baseline compiler.

When Baseline compilation is finished, we compile the Wasm module with our more advanced Ion compiler. This backend produces faster machine code, but compilation time is a lot higher.

The issue with the ONNX module was that the Ion compiler backend took a long time and used a lot of memory to compile it. On my Linux x64 machine, Ion-compiling this module took about 5 minutes and used more than 4 GB of memory. Even though this work happens on background threads, this was still too much overhead.

Optimizing the Ion backend

When we investigated this, we noticed that this Wasm module had some extremely large functions. For the largest one, Ion’s MIR control flow graph contained 132856 basic blocks. This uncovered some performance cliffs in our compiler backend.

VirtualRegister live ranges

In Ion’s register allocator, each VirtualRegister has a list of LiveRange objects. We were using a linked list for this, sorted by start position. This caused quadratic behavior when allocating registers: the allocator often splits live ranges into smaller ranges and we’d have to iterate over the list for each new range to insert it at the correct position to keep the list sorted. This was very slow for virtual registers with thousands of live ranges.

To address this, I tried a few different data structures. The first attempt was to use an AVL tree instead of a linked list and that was a big improvement, but the performance was still not ideal and we were also worried about memory usage increasing even more.

After this we realized we could store live ranges in a vector (instead of linked list) that’s optionally sorted by decreasing start position. We also made some changes to ensure the initial live ranges are sorted when we create them, so that we could just append ranges to the end of the vector.

The observation here was that the core of the register allocator, where it assigns registers or stack slots to live ranges, doesn’t actually require the live ranges to be sorted. We therefore now just append new ranges to the end of the vector and mark the vector unsorted. Right before the final phase of the allocator, where we again rely on the live ranges being sorted, we do a single std::sort operation on the vector for each virtual register with unsorted live ranges. Debug assertions are used to ensure that functions that require the vector to be sorted are not called when it’s marked unsorted.

Vectors are also better for cache locality and they let us use binary search in a few places. When I was discussing this with Julian Seward, he pointed out that Chris Fallin also moved away from linked lists to vectors in Cranelift’s port of Ion’s register allocator. It’s always good to see convergent evolution :)

This change from sorted linked lists to optionally-sorted vectors made Ion compilation of this Wasm module about 20 times faster, down to 14 seconds.

Semi-NCA

The next problem that stood out in performance profiles was the Dominator Tree Building compiler pass, in particular a function called ComputeImmediateDominators. This function determines the immediate dominator block for each basic block in the MIR graph.

The algorithm we used for this (based on A Simple, Fast Dominance Algorithm by Cooper et al) is relatively simple but didn’t scale well to very large graphs.

Semi-NCA (from Linear-Time Algorithms for Dominators and Related Problems by Loukas Georgiadis) is a different algorithm that’s also used by LLVM and the Julia compiler. I prototyped this and was surprised to see how much faster it was: it got our total compilation time down from 14 seconds to less than 8 seconds. For a single-threaded compilation, it reduced the time under ComputeImmediateDominators from 7.1 seconds to 0.15 seconds.

Fortunately it was easy to run both algorithms in debug builds and assert they computed the same immediate dominator for each basic block. After a week of fuzz-testing, no problems were found and we landed a patch that removed the old implementation and enabled the Semi-NCA code.

Sparse BitSets

For each basic block, the register allocator allocated a (dense) bit set with a bit for each virtual register. These bit sets are used to check which virtual registers are live at the start of a block.

For the largest function in the ONNX Wasm module, this used a lot of memory: 199477 virtual registers x 132856 basic blocks is at least 3.1 GB just for these bit sets! Because most virtual registers have short live ranges, these bit sets had relatively few bits set to 1.

We replaced these dense bit sets with a new SparseBitSet data structure that uses a hashmap to store 32 bits per entry. Because most of these hashmaps contain a small number of entries, it uses an InlineMap to optimize for this: it’s a data structure that stores entries either in a small inline array or (when the array is full) in a hashmap. We also optimized InlineMap to use a variant (a union type) for these two representations to save memory.

This saved at least 3 GB of memory but also improved the compilation time for the Wasm module to 5.4 seconds.

Faster move resolution

The last issue that showed up in profiles was a function in the register allocator called createMoveGroupsFromLiveRangeTransitions. After the register allocator assigns a register or stack slot to each live range, this function is responsible for connecting pairs of live ranges by inserting moves.

For example, if a value is stored in a register but is later spilled to memory, there will be two live ranges for its virtual register. This function then inserts a move instruction to copy the value from the register to the stack slot at the start of the second live range.

This function was slow because it had a number of loops with quadratic behavior: for a move’s destination range, it would do a linear lookup to find the best source range. We optimized the main two loops to run in linear time instead of being quadratic, by taking more advantage of the fact that live ranges are sorted.

With these changes, Ion can compile the ONNX Wasm module in less than 3.9 seconds on my machine, more than 75x faster than before these changes.

Adobe Photoshop

These changes not only improved performance for the ONNX Runtime module, but also for a number of other WebAssembly modules. A large Wasm module downloaded from the free online Adobe Photoshop demo can now be Ion-compiled in 14 seconds instead of 4 minutes.

The JetStream 2 benchmark has a HashSet module that was affected by the quadratic move resolution code. Ion compilation time for it improved from 2.8 seconds to 0.2 seconds.

New Wasm compilation pipeline

Even though these are great improvements, spending at least 14 seconds (on a fast machine!) to fully compile Adobe Photoshop on background threads still isn’t an amazing user experience. We expect this to only get worse as more large applications are compiled to WebAssembly.

To address this, our WebAssembly team is making great progress rearchitecting the Wasm compiler pipeline. This work will make it possible to Ion-compile individual Wasm functions as they warm up instead of compiling everything immediately. It will also unlock exciting new capabilities such as (speculative) inlining.

Stay tuned for updates on this as we start rolling out these changes in Firefox.

- Jan de Mooij, engineer on the SpiderMonkey team

]]>
Jan de Mooij