Victor's Devblog

Physics Performance

March 16^th, 2026 at 05:33

A Performance Problem

Profiling

After implementing some simple code to generate a racing track with a bezier curve, I noticed that I had some performance issues. Here is a recording (of what it looked like mid-performance improvements because I went a bit fast without recording and I am too lazy to git-revert my changes just to record what it really looked like).

Roughly, broad phase would take up to 2ms and narrow phase could take 1.5ms per sub-step with a total of up to 18ms where the map had a decent amount of colliding triangles. Reminder: physics must run below 10ms per update. Here is what a typical profile was:

Ideally, most physics thread updates would be below ~4ms with acceptable spikes near ~8ms, giving enough padding for lower end machines / very specific map locations with excessive amount of static colliders.

Recap of Physics Algorithm

Reminder from last week, here how physics is implemented in my engine:

// BROAD PHASE
for each static blocker
    if car aabb+padding collides with static collider aabb
        for each static collider's triangle [Issue #3]
            if car aabb+padding collides with triangle
                store triangle as broad phase candidate
// NARROW PHASE
for step 1 to 10
    for sub-step 1 to 4
        for each car's ellipsoid
            for each broad phase candidate triangle
                if ellipsoid approximately collides with triangle [Issue #1]
                    compute penetration distance
                    update deepest penetration data [Issue #2]
        apply forces per ellipsoid according to penetration data
        update car position and velocity

Issues

Issue #1:
Car has 8 ellipsoids, and in tight banked turns, we need a lot of triangles for smooth driving. This mean that broad phase may return a couple hundred of candidate triangles, totalling at least 10 steps 4 sub-steps 8 ellipsoids * 200 triangles = 64k ellipsoid-triangle intersection checks per update.

Issue #2:
Physics requires finding penetration distance with an accuracy of ~1e-5 for smooth collision restitution despite the extreme magnitude of elasticity force.
However computation of penetration distance is done by finding the up to 6 zeroes of a non-trivial function, through dichotomy. Each zero requires many evaluation of this function, and sometimes even more (I will explain below). Meaning computing deepest penetration distance is very expensive.

Issue #3:
When a static collider has a lot of triangle and collider's AABB intersect car's, broad phase will check on all of them which is a lot.

First Pass At Improving Performance

Collision Detection Between Ellipsoid And Triangle

Previous Algorithm

The algorithm I was using (and still am) consist in intersecting the sphere (max radius of ellipsoid) with the triangle's OBB (oriented bounding box). It does so by bringing the sphere into a space where a triangle OBB is trivial, for instance:

Since this must be done 10 x 4 x 8 times per triangle during the narrow phase, the broad phase is first converting triangle's (p0, p1, p2) vertices into a specific structure that makes this computation trivial. The later consists in finding a specific space transformation such that in this space, the triangle's vertices look like { (0, 0, 0), (x1, 0, 0), (x1, x2, 0) }.
This structure, which I called NormalizedTriangle, must then store:
- p, a vertex of the triangle, I was using p0
- dirX, the unit vector from p to next point, I was using p1
- dirY, the unit vector such that dirX ^ dirY = triangle's normal
- x1
- x2, y2

This uses a bit more memory but makes the math executed by narrow phase's sub-steps very simple:

auto const spherePositionLocal = spherePosition - p0;

auto const localX = dot(dirX, spherePositionLocal);
if ((localX + sphereRadius < min(0, x2)) || (max(x2, x2) < localX - sphereRadius))
    return false;

auto const localY = dot(dirY, spherePositionLocal);
if ((localY + sphereRadius < 0) || (y2 < localY - sphereRadius))
    return false;

auto const localZ = dot(cross(dirX, dirY), spherePositionLocal);
if (abs(localZ) > sphereRadius)
    return false;

return true;

And with this stored, we can still retrieve p1, p2, and n the triangle's normal, in world space (for eventually computing penetration distance if necessary):

auto const n = cross(dirX, dirY);
auto const p1 = p + dirX * x1;
auto const p2 = p + dirX * x2 + dirY * y2;

Improved Algorithm

Since I iterated a bit quickly on this I forgot how slow it was, but maybe about 20us per test? Nonetheless, I decided to implement 2 changes.

First, given the context (a game's physics, so most of the time things are not colliding together) performing the normal test first (abs(localZ) > sphereRadius) will discard most triangles immediately. This already halved execution time!

Imagine this is a picture showing time halved

Second, it is possible to chose a different reference point p when creating the NormalizedTriangle such that min(0, x2) = 0 and max(x1, x2) = x1 (this may sound stupid but it saves 2 min/max computation per test). To do so, just pick for p whichever pi that is the start of longest edge. This guarantees that 0 < x2 < x1. Again, saving a bit more (forgot exactly how much).

Finding Deepest Penetrating Point

Last Week's Recap With More Math

Last week we saw how given a point Pt (belonging to a triangle) inside an ellipsoid E, the closest point Pe on ellipsoid surface must satisfy that dir(Pt, Pe) is the normal of ellipsoid at Pe. If it was not the case, it means we could do a small step on plane (Pe, n) while still outside (since this plane is tangent to the convex ellipsoid shape at Pe) and projecting this new point P' on ellipsoid would give another point Pe' that is closer to Pt than Pe is.

Knowing ellipsoid's normal at P = (Px, Py, Pz) is colinear to (Px/rx², Py/ry², Pz/rz²) means that, for our solution Pe, there exist a real number t such that Pe-Pt = t · (Px/rx², Py/ry², Pz/rz²).
By combining this equation with ellipsoid's Px²/rx² + Py²/ry² + Pz²/rz² = 1, we have that t is a root of f with:

f(t) = Ptx² / (rx² · (1 + t / rx²)²) + Pty² / (ry² · (1 + t / ry²)²) + Pty² / (ry² · (1 + t / ry²)²) - 1
f'(t) = -2 · (Ptx² / (rx4 · (1 + t / rx²)³) + Pty² / (ry4 · (1 + t / ry²)³) + Ptz² / (rz4 · (1 + t / rz²)³))

(Again, using the 2-to-6 roots of f, we can retrieve 6 potential solution and we can pick the best as explained in last week's newsletter)

Now this is interesting because this function is negative at -Inf and +Inf and has up to 3 asymptots at -rx², -ry², and -rz²* (here only 2 when represented for a 2D ellipse):

(note we can always rotate space so that rx > ry > rz)

Those properties, and the fact f is continuous and f' is strictly positive in ]-Inf, -rx²[ and strictly negative in ]-rz², +Inf[ means there are at least two roots to f (within those two same intervals) that I will name tmin and tmax.

In addition, there may exist 2 additional roots in ]-rx², -ry²[ and 2 other in ]-ry², -rz²[ but this may not always be the case (see graph below). Since f" is strictly positive between an interval such as ]-rx², -ry²[ while f goes to +Inf at those two asymptots, the presence of the 2 roots depends on whether f manages to go negative within the interval. A good way to know if those roots exist it is to find the root tlow of f' in this interval (here where red curves crosses the X axis) which gives the lowest point of f. If f(tlow) is negative, we have 2 additional roots for f, one in ]-rx², tlow[ and one in ]tlow, -ry²[.

Previous Algorithm

So to find f smalest root tmin (same algorithm for tmax) what I implemented last week was to first find t0 such that f(t0) < 0 (I would just substract some distance from t=-rx² until f(t) < 0), and second to proceed by dichotomy:

t1 = -rx^2
while (t1 - t0) > epsilon:
    t = (t0 + t1) / 2
    if f(t) < 0:
        t0 = t
    else:
        t1 = t
return (t0 + t1) / 2

The strict positivity of f' in this interval guarantees this will converge.

Then to find the hypothetical other roots, I would first find tlow the root of f' in ]-rx², -ry²[ (same for ]-ry², -rz²[) and, if f(tlow) < 0 I would proceed just like for tmin and tmax.

The issue is that those algorithm converge to the solution rather slowly: each iteration we divide interval by 2, but since I need a small margin of error (that epsilon in algorithm above) it takes 15 to 16 iterations to reach an acceptable root.

Improved Algorithm

To start with, for tmin, the monotony of f (f'(t) > 0) and strict positivity of f" in ]-Inf, -rx²[ means that we can apply Newton's method in a way that is guaranteed to converge safely (stays in that interval) and rather quickly (this is orders of magnitude faster than the dichotomy method) for any starting point t such that f(t) > 0:

for i in 0..n:
    t = t - f(t) / f'(t)

And the good news is that f(-rx · (rx + |Ptx|)) > 0. Performing as little as 5 iterations is enough to achieve desired margin of error:

Obviously the same operation can be done for tmax.

Finally, for finding roots in in ]-rx², -ry²[ I use a similar algorithm but knowing the following:
- if tlow exists in this interval such that f(tlow) < 0 (i.e. we have 2 extra roots), then applying Newton's method directly to f in interval ]-rx², tlow[ from some initial t such that f(t) > 0 (e.g.: t = -rx · (rx - |Ptx|)) will converge to desired solution (same reasons: f > 0 and f" > 0 when such tlow exists)
- else either t will leave ]-rx², -ry²[ or Newton will not converge (f(t) > epsilon after ~5 steps) so there is no root in this interval

These changes done, computing deepest ellipsoid point inside triangle went from "noticeable on profiling capture" to "is it still even there?". Great news!

Broad Phase

Previous Algorithm

What I first implemented was very simple:
- each static collider entity has an AABB and a list of triangles
- each physics update, if static's AABB collides with car's AABB, iterate through its triangles and those that intersect with car's AABB will be tested in narrow phase
Nothing fancy, just a lot too much work for a reasonnable CPU.

Improved Algorithm (v1)

Ok this is where I hope to entertain again those that got lost with so much math. Inspired by this Amazing Sebastian Lague video on Ray Tracing I decided to implement a Bounding Volume Hiararchy (BVH) algorithm for my broad phase (could use for narrow phase too maybe?).

This means that for every static collider, at game's launch, I split its triangles into increasingly smaller AABBs. Then during broad phase, I can easily test my car's AABB against those expotentially shrinking AABB to only consider static triangles that have a chance to collide with the car.
The hardest part was to figure out how to implement BVH using only two std::vectors (to minimize memory allocation and improve cache coherence). Once done here is the results:

This drastically improved the performance of the broad phase from about 2ms to under 400us.

Another Performance Problem

Large OBB Issues

Earlier, during the recap of physics logic, I somewhat skipped another issues, can you spot it?

// BROAD PHASE
for each static blocker
    if car aabb+padding collides with static collider aabb
        for each static collider's triangle [Issue #3 => FIXED]
            if car aabb+padding collides with triangle
                store triangle as broad phase candidate
// NARROW PHASE
for step 1 to 10
    for sub-step 1 to 4
        for each car's ellipsoid
            for each broad phase candidate triangle
                if ellipsoid approximately collides with triangle [Issue #1 => FIXED]
                    compute penetration distance
                    update deepest penetration data [Issue #2 => FIXED]
        apply forces per ellipsoid according to penetration data
        update car position and velocity

Look at narrow phase: we are still iterating over every triangles found by broad phase for every physics' sub-step. And since broad phase's OBB check can collect up to 200 triangles in those sharp banked turns, this is still a lot of iteration.

The issue is that broad phase uses a large OBB (below in dark green) to handle the case where car would be going fast (up to 1000km/h). This ensures that if car moves a lot during the 10x4 physics sub-steps, triangles it would collide with are still detected.

However, at any of those sub-steps, car occupies a much smaller space and most of those static triangles have no chance of colliding with its ellipsoids.

Partitioning Space

The solution I came up with is to use another partitioning of space (like I did to improve broad phase) except this time I would use a sort of Binary Space Partitionning (BSP, very similary to BHV .. but actually it's not even a real BSP) where broad phase candidate triangles are place on either side of a plane (actually it's two planes of opposit normals, hopefully very close to one another, because otherwise it is very difficult to find a plane splitting a list of triangles in half). Here is an ideal example where a unique plane can be found:

Using separating planes instead of AABB hopefully (there is still some WIP) gives me better performances in non axis-aligned sections of a map like a tight turn: this allows, in theory, finding this diagonal plane that would better split triangles in two separated groups. However in practice it is almost impossible finding the perfect plane, here is the algorithm I went for:

RECURSIVELY (for each node of the BSP tree, root node initially containing all triangles from broad phase):
    find mean position of triangle centroids in that node
    compute covariance matrix of triangle centroids
    find n the primary eigen vector of covariance matrix
    // n is axis that triangle centroids are the most spread along
    order triangles along this axis
    split triangles into left & right nodes
    compute plane of normal -n and n containing triangles in left and right nodes

And here it is in action (hopefully this is gonna be today's satisfying video for you as much as it is for me):

As I mentionned, a first prototype of this is working but this is WIP. The way I perform this BSP is expensive (broadphase is back up to ~1.25ms) and the plane I chose is not always very adequate. For instance, if I have a road section split into thin triangles, their centroids will be split left and right along the width of the road but they will still span across its entire width: meaning the first few space splits performed by my BSP implementation will not really "split" anything (plane for which all "right" triangles are to the right of will still be at the very left of the road, and plane for which all "left" triangles are to the left of will still be at the very right of the road).

My goal is to maybe replace part of my broad phase BVH by this BSP but run it once on each static collider's triangles at the beginning of the game. This way I could spend more CPU time finding a better separating plane that would further reduce computation cost during gameplay. Speaking of gameplay, if you read this newsletter entirely you deserve some (pay attention to bottom right of the screen for a performance sneak peak):

Until next ^Sunday^, MOG well :)