Joy Mallik

Penny: Correction my misconceptions after learning more about the execution model

2026-02-16T03:18:12+00:00

context

This post documents a design correction I made while building Penny, a browser-based work-stealing runtime, after better understanding JavaScript’s execution model.

landscape judgement

The scheduler.postTask() api, allows tasks to be scheduled with priority to the event loop which is a time sliced scheduler. This would help reduce jitters or blocking on the main thread, but the time to completion of the task might go up. The objective of penny is to utilize hardware threads to compute faster, reliably and smoothly. Same constraints for scheduler.yield(). These help resolve main thread responsiveness, they’re not built for worker cpu scheduling.

issues with Penny

Based on the previous memory model, I thought having a task registry, that holds the entire closure, and just passing IDs was enough, and it helped me get a minor state where work was being executed between workers with the API, but it was in no way work stealing. Because work stealing in my head would only work well if units of work are bounded , because we don’t want one long running task completely monopolizing a thread, which based on newfound knowledge of the execution model of JS is not supported by any primitive, at least not in the multithreaded sense. we have a primitive in the form of generator function like

function\* gen(){
console.log(first point);
yield;
console.log(second point);
}
const g = gen();
g.next(); // "first point"
g.next(); // "second point"

and it does indeed give me controllable execution, and it holds execution state within the object, but this is within an agent’s context.

but “agents” holds their own stack, heap and event loop (and realms, but it’s a relation of 1 agent = many realms, not vice versa), which means I can’t move a generator across workers

So, is that it? no more preemption? I have to wait for wasm fiber proposal to implement it? well, not exactly. The generator function gave me an idea, which upon research is a pretty popular idea in the state machine / compiler world.

So, most compute heavy workloads are iteration based, and iterations are essentially state machines. Now, if we can mix the concept of yield, and expressing iterations as state: i.e -> exampleTask {i: number; maxIter: number; state1: somePODType; state2: AnotherPODType} we could store this in a SAB, and pass this around instead. For embarrassingly parallelizable tasks, this helps quite a bit, since the iteration can be for

for (let k = 0; k < ChunkSize; k++) {
  someStateTransformerOrOutput(i);
  i++;
}
if (i < N) "yielded"; // back to scheduler
return markAsDone();

it gets a bit more complicated with tasks whose iteration[x] depends on the result of iteration[x-1], but I just have to guarantee there are no 2 instances of a task running at the same time at that case, but this approach seems promising, since it gives me (albeit explicit) preemption points to manage priority inversion which already established patterns. It’s not “preemption” in the OS sense, but should be enough to build a usable work-stealing runtime inside browsers

The thing I have to be a bit careful about is how I’m going to manage my memory model, now. My current iteration holds:

//
//
//                 ┌────────────┼────────────────────────────────────────────────────────────────────┐
//                 │            │                                                                    │
//                 │   Thread 1 │                                                                    │
//                 │┌───┬───┬──┐│               Per Thread Deque                                     │
//                 ││Tsk│Tsk│  ││                                                                    │
//                 │└───┴───┴──┘│                                                                    │
//                 │            │                                                                    │
//                 ├────────────┼───────────┬──────────┬───────────┬─────────┬───────────────────────┤
//                 │            │  Task 2   │          │           │         │                       │
//                 │            │   Args    │          │           │         │                       │
//                 │   Task     │   Iter    │  ────────┼───────────┼─────────┼─►   Task N            │
//                 │    1       │   Max     │          │           │         │                       │
//                 │            │   Owner?  │          │           │         │                       │
//                 │            │MarkForKill?          │           │         │                       │
//                 │            │           │          │           │         │                       │
//                 └────────────┴───────────┴──────────┴───────────┴─────────┴───────────────────────┘
//
//

which I have to clean up a bit.

but the main thing being, that the registry of tasks(by id) is stored in the sab along with the per thread dequeue. This has it’s tradeoffs:

PROS

Fixed layout
easy to visualize and debug

CONS

Fixed layout (ironic?)
memory growth of O(n) per active task that needs to be completed. Functions are not pointers anymore. If someone wants to execute the same task 100 times, there would be 100 copies of the same task in 100 slots.
args will be limited, due to fixed size layouts

But, fixing it to an upper limit (200 for now) and apply backpressure by having a main thread queue, and the benefit of having preemption is a win in my opinion

Also, the yield points need to be inserted exactly at the boundary of each iteration. Due to the nature of this approach and due to lack of stackful co-routines, yielding in the middle of an iteration execution is just not possible, there would be no guarantee of the accuracy of computation at that point. I just don’t know how to encode those rules, or what the rules would be yet, that’s why I’m manually writing the tasks for the MVP, but I imagine having syntatic sugar that lowers into the same (or similar) task layout I’m writing would be doable once I learn more about it

why not just use Async/Microtasks?

I did consider this approach and after reading this excellent post: tasks-microtasks-queues-and-schedules, it doesn’t really help compute heavy workloads. It’s optimized to handle promise callbacks, and don’t give me preemption. continuations only run after the current call stack completes, i.e: Run until complete policy mention -> (the JS execution model has this design by choice). So, this won’t help me deal with priority inversion

next steps

Implement fixed-size global task pool in SAB
Represent tasks as explicit state machines
Integrate single-owner execution + yielding
Validate work stealing with a simple loop workload

When Draw calls stop being the point

2026-01-29T17:00:00+00:00

I thought rendering performance was about optimizing draw calls. It turns out my assumption is 10+ years old already.

The initial issue

This topic is actually wild, and I’m glad I came across it the way I did. So, background: I’m building my own engine for my game 004(name pending), and one of the first frictions I found out about, was the way I had to submit draw calls. Now, I’m a beginner to the whole rendering thing, so I’ve been using openGL and using learnOpenGL as a way to get introduced to it. So, the issue I was facing with draw calls was twofold. 1. I had to include GL everywhere I wanted rendering logic, and 2. at the end, there was a central place, where everything had to be passed down with relevant data which made the individual draw calls. Now, obviously this gave me a bad smell, that I’m doing something wrong, so I wanted to improve it. Now, I’ve been using ImGui for quick debug UI, and for my editor, and I’m really a big fan of their api model. The fact that in my main loop, I just need to make sure to call ImGui::BeginFrame() and EndFrame at the end, while any subsystem, or class in between can define their own nodes, seemed like the perfect api. But, my one gripe with this whole api is that it requires a global namespace with a lot of variables. And while I do see the benefits (I love it), there are a few restrictions that come with it.Such as it has to run on the main thread & thread safety is not guaranteed (I did see a comment mentioning how to get thread safety, but didn’t follow it through to be fair).

Which is fair play, but my current goal was to use openGL to learn the basics and move on to vulkan to get access to the sweet promise of multi queues for parallel submission. So, I didn’t want this sort of api for my rendering pipeline. But, while writing more of my engine to use my naive approach, I realized I was doing the same thing multiple times. Fetch the resource handle -> write data -> in render loop use data to make the draw call. Now, I was manually writing this, and there’s a lot of resources that are used by multiple draw calls, and very often. So, this by default looked like a DAG problem to me from my earlier [[Penny]] work (TODO: shill it with a link here). If I could define the unit of work as a node, and the dependencies as edges, I would be able to build out a graph, and right before making the draw calls, sort it for optimization! This seemed like a revelation and lo and behold! This idea (albeit way better and way more thought out) has a name! The render graph. I was estatic that I was able to come to this idea (sort of) on my own, and validated by engines as a valid orchestration layer.

So, naturally I wanted to look at how the great engines do it, so I went to the forge’s github, to get some insight. and that’s where I came across this: https://github.com/ConfettiFX/The-Forge/issues/171. Now, the great wolfgangfengel (loved his book series btw, can’t recommend it enough. It’s how I learnt about SVOs and the wonderful world of morton encoding), has a comment towards the bottom that said “With a GPU-driven renderer, you do not make many draw calls.” and “It was a good idea 10+ years ago. (This one stung, ngl)”. With a GPU-driven renderer, you do not make many draw call? I’m not joking when I say, this revelation actually shook me. All the resources I’ve been seeing just have draw calls, and maybe deferred vs forward rendering, and I thought that was advanced, and that AAA studios just have really tight optimizations to improve performance. Man, was I wrong.

The Paradigm shift

So, based on my research, Gpu driven rendering uses SSBOs and frustrum culling heavily, to reduce CPU sync overhead. Traditional Uniform have a size limit of ~16kb, UBO = 64kb, but SSBOs limit is the VRAM limit(sorta). AND it’s read and write data. So, modern renderers, upload what they need to the GPU, and a compute shader /kernel generate draw commands. That’s a completely different game, one that I didn’t remotely even see coming. since visibility check is embarrisingly parallel, this makes it the perfect use for frustrum culling. My chunking logic to reduce O(n) over all of my tiles, cannot realistically compare in terms of parallelism or bandwidth, even with 32 threads. This pipeline makes the CPU the policy setter , while the GPU solely becomes responsible for handling all computation regarding geometry, with minimal data sync between the 2. Now, the CPU side only (post initial upload) needs to worry about telling the GPU about the data that changed.

So, with good map design, like breaking line of sight, tall structures, broader geometry etc. You can render beautiful scenes, while offloading the massive compute to the GPU.

Now, I imagine this isn’t a silver bullet. The CPU still needs to define barriers and such, along with orchestrating data uploads, so I don’t imagine RenderGraph on the CPU side is useless, but I can definitely see how AAA games are able to achieve the visual fidelity, which I had a massive misconception about.

The mental model revision

What finally clicked for me through this intense journey is that this isn’t just an optimization. It’s a role reversal. The CPU stops being the thing that decides what gets drawn, it can focus on game logic, state machines and gameplay mechanics, and just sets policy.

The GPU becomes responsible for what is shown in this frame.

Penny: Designing a memory model to support a work stealing scheduler in the browser

2025-12-02T03:18:12+00:00

Modern browsers give you workers but not threads, shared memory but not pointers, atomics but not fences, and message passing that copies everything. I wanted to see: Is it possible to build a real work-stealing scheduler — something like Cilk — inside these constraints? Penny is my attempt.

Why Bother building a scheduler

Because the browser’s concurrency model is fundamentally limited:

No shared thread stacks
No continuations
No preemption
Message-passing is expensive
Workers cannot inspect each other’s memory
SABs are powerful but extremely constrained

This makes it an interesting systems challenge:
Can we simulate a real work-stealing runtime inside an environment not designed for it?

Objective

The goal of Penny is to explore whether a work-stealing scheduler — similar to Cilk — can be implemented inside a browser, using only Web Workers, SharedArrayBuffers, and Atomics.

Initially I assumed the hard part would be implementing the Chase–Lev algorithm. In reality, the true difficulty was designing a memory model that fit the browser’s constraints.

This post documents the decisions, dead-ends, and constraints I discovered while designing Penny’s first memory model.

Environment Limitations

Penny is designed entirely within the constraints of browser execution:

Message Passing is Expensive

postMessage copies non-SAB payloads. SAB + typed views must be used whenever possible.

Strict Memory Alignment Rules

JS requires typed arrays to begin on specific byte boundaries. This constraint actually helps sab → L1/L2 cache locality.

No Portable Thread Stacks

Workers have no accessible stack. No stack switching, no green threads. This is needed to avoid a single task monopolizing a thread.

Functions Cannot Be Sent to Workers

Closures cannot cross thread boundaries. Workers must use a static function table.

These constraints influenced my decisions deeply with tradeoffs in mind

Memory Model - v0

The objective was to create a Chase-Lev deque to support work stealing. I added a basic sketch I used to reason about my first memory model for the scheduler. It looks a bit messy, but the idea behind it was that I would have a SharedArrayBuffer, deterministically determining the layout beforehand. And each thread would have it’s own view into it’s own slice of memory, essentially emulating the concept of thread memory (that can be easily, cheaply and quickly manipulated from either the main thread, or other threads from the pool). The key idea here was to use Shared Array Buffers (or SAB from here on), to reduce the overhead of message passing, and to give other threads to check tasklists of other threads, which is otherwise not possible in the browser

The basic layout of each thread’s memory slice would be:

4 bytes representing thread state. With 0 = sleeping, 1 = running & 2 = blocked by I/O
4 bytes representing the thread ID. Now, the scheduler will be responsible for handing out IDs, and it just a number from [0..n], with n being the number of threads. It’s important because the ID of the thread is what helps the thread index into the right window of memory, using

const baseOffset = threadId * THREAD_MEMORY_SIZE;

This was designed keeping in mind that SABs don’t let you create Views, if the memory address isn’t naturally aligned. Also, given the fact that atomics need INT32 array views, the state was set to that, so that we can use atomics to sleep or wake up threads, instead of busy waiting.

There’s a couple of issues here, such as:

There’s only 4 bytes per top and bottom ptr. This would result in them being in the same cacheline. The bottom pointer is fine, since it should only be touched by the owning thread, but the top ptr would be constantly being touched by other threads, to steal tasks. This would lead to false sharing and tank performance quite a bit.
The threads have their own deque, based on Chase-Lev deques which is great, but it’s a static array, that cannot be reallocated to a larger size. This would mean, that there would need to exist a global backbuffer to store tasks in case all threads have their task lists at capacity.
There is no way for arguments for the task needed to be passed other than postMessage() . This was mentioned earlier but this tanks performance heavily, and I wanted to avoid it from happening too much.

Scheduler design v0

This was my first go at the scheduler design. The idea here was that, the scheduler would stay on the main thread. Initialze the thread pool and accept tasks from the user, to divy up between the threads.

There would also be a global backbuffer for excess tasks. The scheduler randomly picks a thread, and if it’s sleeping, tries to add the task to the taskList if it’s not at capacity.

const worker = this.threadPool[randThreadIndex].worker;
        this.threadPool[randThreadIndex].state[0]= 1;
        this.threadPool[randThreadIndex].tasklist[this.threadPool[randThreadIndex].bottom[0]] = taskId;

If at capacity, then push the taskId to the global backbuffer, which is another Ringbuffer.

The issue with this design is that:

function args are copied everywhere
task results are sent back by postMessage()
To drain the global backbuffer, the scheduler needs to run in a loop or timeout on the main thread, which is highly likely to hitch rendering and other main thread ops.

Because of the above reasons, it was clear that this cannot be the right architecture for this. Hence I reworked the scheduler around these issues.

Scheduler design v1

Forgive the fancy circles, they’re just the same as the thread deques (ringbuffers) that I got fancy with while sketching. The main difference being the global Task buffer would be LIFO (to support work stealing), and the result buffer would be FIFO.

This idea essentially treats the global task list as another worker thread, that the worker threads steal from if they can’t steal from anyone else. This makes sure that the main thread never gets stuck by the scheduler, all the scheduler does is just place the task into one of the available task lists.

Another improvement is that, threads write directly to the result taskbuffer lists, avoiding message passing and copying of results. This helps with not blocking the main thread (again).

The main concerns throughout all these designs have been making sure that race conditions don’t ruin task execution or thread behavior.

For the bottom pointers for the thread deques, it’s fine, since only the owning thread is able to manipulate them. but reading and manipulating top pointers, have to be done through Atomics, such as Atomics.load() , Atomics.store() and CAS operations to avoid ABA behavior.

Things to improve / explore

Right now JS execution is not very determinstic and if you notice in all the designs, it assumes that users will be passing and operating on basic data types. Almost all data passing through the boundary of the scheduler and user assumes that data will be of the few basic data types available. Also, given that all the values are being read as Int32/Uint32 there’s a significant chance of overflow, since JS numbers won’t adhere to that. Also, the lack of continuations points stops this from being a true fiber/green thread driven scheduling environment. A couple of options that I’m playing around with is, a small IR layer, where I inject yield() points into userCode based on cycles

Genuinely considering writing the core of the scheduler in WASM to get more deterministic performance, and exposure to features like stack switching Here

This is still under heavy exploration and I’ll most probably be changing the design soon, as I learn more.

For a more detailed breakdown, link to architecture_doc