Jekyll2026-01-27T22:56:42+08:00http://lukelau.me/feed.xmlLuke’s blogBlogging about compilers at Igalia.Closing the gap, part 2: Probability and profitability2026-01-26T00:00:00+08:002026-01-26T00:00:00+08:00http://lukelau.me/2026/01/26/closing-the-gap-pt2Welcome back to the second post in this series looking at how we can improve the performance of RISC-V code from LLVM.

Previously in part 1 we looked at how we can use LNT to analyze performance gaps, then identified and fixed a missed fmsub.d opportunity during instruction selection, giving a modest 1.77% speedup on a SPEC CPU 2017 benchmark.

In this post we’ll be improving another SPEC benchmark by 7% by teaching the loop vectorizer to make smarter cost modelling decisions. It involves a relatively non-trivial analysis, but thanks to LLVM’s modular infrastructure we can do it in just a handful of lines of code. Let’s get started.

Analysis

Just like last time, all fruitful performance work begins by analysing some workloads. In the last post we had already run some comparisons of SPEC CPU 2017 benchmarks on LNT, so we can return to those results and pick another benchmark to focus on. Here’s one that’s 12% slower than GCC:

Screenshot of LNT showing the 531.deepsjeng_r benchmark being 12.14% slower on Clang vs GCC

531.deepsjeng_r is a chess engine that tied first in the World Computer Chess Championships back in 2009. It consists of a lot bitwise arithmetic and complex loops, since the state of the game is encoded in 64 element arrays: one element for each square on the board. Unlike 508.namd_r from last time, there’s no floating point arithmetic.

Drilling into the profile and its list of functions, right off the bat we can see that one function is much slower on LLVM. On GCC qsearch(state_t*, int, int, int, int) makes up 9.1% of the overall cycles, but on LLVM it’s 16.1%. And if we click in on the function and view the cumulative total of cycles spent in user mode, Clang takes 74.6 billion cycles to do what takes GCC only 37.7 billion cycles.

Screenshot of Clang disassembly and GCC disassembly side by side, with inline total cumulative cycle annotations showing Clang taking 74.6 billion cycles and GCC taking 37.7
Left shows Clang taking 74.6 billion cycles, right shows GCC taking 37.7 billion.

So there’s probably something we can improve upon here, but it’s not immediately obvious from staring at the disassembly. qsearch is a pretty big function with a couple hundred instructions, so switching to the CFG view gives us a better overview.

On LLVM’s side we see the offending loop that’s consuming so many cycles: It’s long, vectorized, and completely if-predicated: there’s no control flow inside the loop itself. This is typical of a loop that’s been auto-vectorized by the loop vectorized. If you look at the load and store instructions you can see that they are masked with the v0.t operand, stemming from the original control flow that was flattened.

Screenshot of the disassembly from Clang, showing a very hot block
with a lot of masked vector
instructions.

But on the GCC side there’s no equivalent vectorized loop. The loop is in there somewhere, but all the loops are still in their original scalar form with the control flow intact. And if we look at the edges coming from the loop headers, we can see that most of the time it visits one or two basic blocks and then branches back up to the header. Most of the blocks in the loop are completely cold.

Unfortunately the sources for deepsjeng aren’t open source so we can’t share them in this post, but the very rough structure of the loop is something like this:

for (i = 0; i < N; i++) {
    if (foo[i] == a) {
        if (bar[i] == b) {
            if (baz[i] == c) {
                qux[i] = 123;
                // lots of work here...
            }
        }
    }
}

For any given iteration, it’s statistically unlikely that we enter the first if statement. It’s even more unlikely that the second if’s condition is also true. And even more so for the third nested if where we eventually have lots of work to compute.

In a scalar loop this doesn’t matter because if an if statement’s condition is false, then we don’t execute the code inside it. We just branch back to the start of the loop. But with a vectorized loop, we execute every single instruction regardless of the condition.

This is the core of the performance gap that we’re seeing versus GCC: Given that the majority of the work in this loop is so deeply nested in the control flow, it would have been better to have not vectorized it given that we need to if-convert it.

Cost modelling

One of the hardest problems when making an optimizing compiler is to know when an optimization is profitable. Some optimizations are a double edged sword that can harm performance just as much as they can improve it (if not more), and loop vectorization falls squarely into this category. So rather than blindly applying optimizations at any given opportunity, LLVM has detailed cost models for each target to try and estimate how expensive or cheap a certain sequence of instructions is, which it can then use to evaluate whether or not a transform will be a net positive.

It’s hard to overstate the amount of effort in LLVM spent fine tuning these cost models, applying various heuristics and approximations to make sure different optimizations don’t shoot themselves in the foot. In fact there are some optimizations like loop distribute that are in-tree but disabled by default due to the difficulty in getting the cost model right.

So naturally, we would expect that the loop vectorizer already has a sophisticated solution for the problem we’re seeing in our analysis: Given any predicated block that’s if-converted during vectorization, we would expect the scalar cost for that block to be made slightly cheaper because the scalar block may not always be executed. And the less likely it is to be executed, the cheaper it should be — the most deeply nested if block should be discounted more than the outermost if block.

So how does the loop vectorizer handle this?

/// A helper function that returns how much we should divide the cost of a
/// predicated block by. Typically this is the reciprocal of the block
/// probability, i.e. if we return X we are assuming the predicated block will
/// execute once for every X iterations of the loop header so the block should
/// only contribute 1/X of its cost to the total cost calculation, but when
/// optimizing for code size it will just be 1 as code size costs don't depend
/// on execution probabilities.
///
/// TODO: We should use actual block probability here, if available. Currently,
///       we always assume predicated blocks have a 50% chance of executing.
inline unsigned
getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind) {
  return CostKind == TTI::TCK_CodeSize ? 1 : 2;
}

We’ve come across a load bearing TODO here. Either the block is executed or its not, so it’s a fifty/fifty chance.

On its own this hardcoded probability doesn’t seem like an unreasonable guess. But whilst 50% may be an accurate estimate as to whether or not a branch will be taken, it’s an inaccurate estimate as to whether or not a block will be executed. Assuming that a branch has a 1/2 chance of being taken, the most deeply nested block in our example ends up having a 1/2 * 1/2 * 1/2 = 1/8 chance of being executed.

for (i = 0; i < N; i++) {
    if (foo[i] == a) {
        // 1/2 chance of being executed
        if (bar[i] == b) {
            // 1/4 chance of being executed
            if (baz[i] == c) {
                // 1/8 chance of being executed
                // ...
            }
        }
    }
}

The fix to get the loop vectorizer to not unprofitably vectorize this loop will be to teach getPredBlockCostDivisor to take into account control flow between blocks.

It’s worth mentioning the fact that a hardcoded constant managed to work well enough up until this point is the sign of an good trade off. 1% of the effort for 90% of the benefit. A patch can go off the rails very easily by trying to implement too much in one go, so deferring the more complex cost modelling here till later was an astute choice. Incremental development is key to making progress upstream.

VPlan cost modeling

To get a better picture of how the loop vectorizer is calculating the cost for each possible loop, lets start with a simplified LLVM IR reproducer:

; for (int i = 0; i < 1024; i++)
;   if (c0)
;     if (c1)
;       p1[p0[i]] = 0; // extra work to increase the cost in the predicated block

define void @nested(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
entry:
  br label %loop

loop:
  %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
  br i1 %c0, label %then.0, label %latch

then.0:
  br i1 %c1, label %then.1, label %latch

then.1:
  %gep0 = getelementptr i32, ptr %p0, i32 %iv
  %x = load i32, ptr %gep0
  %gep1 = getelementptr i32, ptr %p1, i32 %x
  store i32 0, ptr %gep1
  br label %latch

latch:
  %iv.next = add i32 %iv, 1
  %done = icmp eq i32 %iv.next, 1024
  br i1 %done, label %exit, label %loop

exit:
  ret void
}

We can run opt -p loop-vectorize -debug on this example to see how the loop vectorizer decides if it’s profitable to vectorize the loop or not:

$ opt -p loop-vectorize -mtriple riscv64 -mattr=+v nested.ll -disable-output -debug
...
LV: Found an estimated cost of 0 for VF 1 For instruction:   %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %c0, label %then.0, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %c1, label %then.1, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction:   %gep0 = getelementptr i32, ptr %p0, i32 %iv
LV: Found an estimated cost of 1 for VF 1 For instruction:   %x = load i32, ptr %gep0, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction:   %gep1 = getelementptr i32, ptr %p1, i32 %x
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i32 0, ptr %gep1, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction:   br label %latch
LV: Found an estimated cost of 1 for VF 1 For instruction:   %iv.next = add i32 %iv, 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %done = icmp eq i32 %iv.next, 1024
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %done, label %exit, label %loop
LV: Scalar loop costs: 3.
...
Cost of 1 for VF vscale x 4: induction instruction   %iv.next = add i32 %iv, 1
Cost of 0 for VF vscale x 4: induction instruction   %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
Cost of 1 for VF vscale x 4: exit condition instruction   %done = icmp eq i32 %iv.next, 1024
Cost of 0 for VF vscale x 4: EMIT vp<%4> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
Cost of 0 for VF vscale x 4: EXPLICIT-VECTOR-LENGTH-BASED-IV-PHI vp<%5> = phi ir<0>, vp<%index.evl.next>
Cost of 0 for VF vscale x 4: EMIT-SCALAR vp<%avl> = phi [ ir<1024>, vector.ph ], [ vp<%avl.next>, vector.body ]
Cost of 1 for VF vscale x 4: EMIT-SCALAR vp<%6> = EXPLICIT-VECTOR-LENGTH vp<%avl>
Cost of 0 for VF vscale x 4: vp<%7> = SCALAR-STEPS vp<%5>, ir<1>, vp<%6>
Cost of 0 for VF vscale x 4: CLONE ir<%gep0> = getelementptr ir<%p0>, vp<%7>
Cost of 0 for VF vscale x 4: vp<%8> = vector-pointer ir<%gep0>
Cost of 2 for VF vscale x 4: WIDEN ir<%x> = vp.load vp<%8>, vp<%6>, vp<%3>
Cost of 0 for VF vscale x 4: WIDEN-GEP Inv[Var] ir<%gep1> = getelementptr ir<%p1>, ir<%x>
Cost of 12 for VF vscale x 4: WIDEN vp.store ir<%gep1>, ir<0>, vp<%6>, vp<%3>
Cost of 0 for VF vscale x 4: EMIT vp<%index.evl.next> = add nuw vp<%6>, vp<%5>
Cost of 0 for VF vscale x 4: EMIT vp<%avl.next> = sub nuw vp<%avl>, vp<%6>
Cost of 0 for VF vscale x 4: EMIT vp<%index.next> = add nuw vp<%4>, vp<%0>
Cost of 0 for VF vscale x 4: EMIT branch-on-count vp<%index.next>, vp<%1>
Cost of 0 for VF vscale x 4: vector loop backedge
Cost of 0 for VF vscale x 4: EMIT-SCALAR vp<%bc.resume.val> = phi [ ir<0>, ir-bb<entry> ]
Cost of 0 for VF vscale x 4: IR   %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ] (extra operand: vp<%bc.resume.val> from scalar.ph)
Cost of 0 for VF vscale x 4: EMIT vp<%3> = logical-and ir<%c0>, ir<%c1>
Cost for VF vscale x 4: 17 (Estimated cost per lane: 2.1)
...
LV: Selecting VF: vscale x 4.
LV: Minimum required TC for runtime checks to be profitable:0
LV: Interleaving is not beneficial.
LV: Found a vectorizable loop (vscale x 4) in nested.ll
LV: Vectorizing: innermost loop.
LEV: Unable to vectorize epilogue because no epilogue is allowed.
LV: Loop does not require scalar epilogue
LV: Loop does not require scalar epilogue
Executing best plan with VF=vscale x 4, UF=1

First we see it work out the cost of the original scalar loop, or as the vectorizer sees it, the loop with a vectorization factor (VF) of 1. It goes through each instruction calling into TargetTransformInfo, and arrives at a total scalar cost of 3. You might have noticed though, if you went through and manually summed up the individual instruction costs you would have gotten a total cost of 4. However the load and store instructions belong to the predicated then.1 block, so they have their cost divided by 2 from getPredBlockCostDivisor.

For the vectorized loop, the loop vectorizer uses VPlan to cost the one plan for a range of different VFs1. VPlan is an IR specific to the loop vectorizer to help represent various vectorization strategies, which is why you see all the EMIT and WIDEN “recipes” in the output. It calculates a total cost for the loop and divides it by the estimated number of lanes — we’re working with scalable vectors on RISC-V so the target needs to make an estimate of what vscale is — and arrives at 2.1 per lane. There’s no predication discount applied here because it’s a vectorized loop. 2.1 is cheaper than 3, so it ultimately picks the vectorized loop.

BlockFrequencyInfo

Computing an accurate probability that a given block will be executed is a non-trivial task, but thankfully LLVM already has an analysis we can use for this called BlockFrequencyInfo.

BlockFrequencyInfo computes how often a block can be expected to execute relative to other blocks in a function. It in turn uses another analysis called BranchProbabilityInfo to work out how likely a branch to a specific block is going to be taken. And because BranchProbabilityInfo uses profiling information when available, it can give you much more accurate block frequencies when compiling with PGO. Otherwise it will fall back to guessing the probability of a branch being taken, which is just 50/50 a lot of the time, but sometimes influenced by interesting heuristics too: like the probability of icmp eq i32 %x, 0 is 0.375 instead of 0.5, and floats have a near zero chance of being NaN.

Plugging BlockFrequencyInfo into the loop vectorizer is straightforward, all we need to do is tell the pass manager that we want to access BlockFrequencyInfo from LoopVectorizePass:

PreservedAnalyses LoopVectorizePass::run(Function &F,
                                         FunctionAnalysisManager &AM) {
   ...
   BFI = &AM.getResult<BlockFrequencyAnalysis>(F);
   ...
}

(BlockFrequencyAnalysis is the pass that computes the analysis result BlockFrequencyInfo, if you’re wondering why the names are different)

Then we can use it to lookup the relative frequencies of whatever block and work out the probability of it being executed in the loop:

uint64_t LoopVectorizationCostModel::getPredBlockCostDivisor(
    TargetTransformInfo::TargetCostKind CostKind, const BasicBlock *BB) {
  if (CostKind == TTI::TCK_CodeSize)
    return 1;

  uint64_t HeaderFreq =
      BFI->getBlockFreq(TheLoop->getHeader()).getFrequency();
  uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency();
  return HeaderFreq / BBFreq;
}

The frequencies returned from BlockFrequencyInfo are relative to the the entry block of a function. So if a block has a frequency of 50 and the entry block has a frequency of 100, then you can expect that block to execute 50 times for every 100 times the entry block is executed.

You can use this to work out probabilities of a block being taken in a function, so in this example that block has a 50/100 = 50% chance of being executed every time the function is executed. However this only works in the case that the CFG has no loops: otherwise a block may be executed more times than the entry block and we’d end up with probabilities greater than 100%.

If we want to calculate the probability of a block being executed inside a loop though, that’s fine since the loop vectorizer currently only vectorizes inner-most loops2, i.e. loops that contain no other loops.

We can consider the frequencies of each block in the loop relative to the frequency of the header block. To give a brief loop terminology recap, the header is the first block inside the loop body which dominates all other blocks in the loop, and is the destination of all backedges. So the header is guaranteed to have a frequency greater than or equal to any other block in the loop — this invariant is important as we’ll see later.

A diagram showing off terminology for different parts of a loop

Then to calculate the probability of a block in a loop being executed, we divide the block frequency by the header frequency. To work out how much we should divide the cost of the scalar block by, we return the inverse of that.

Trying out this change on our sample loop, first we’ll see the debug output from BlockFrequencyInfo as it’s computed:

$ opt -p loop-vectorize -mtriple riscv64 -mattr=+v nested.ll -disable-output -debug
...
block-frequency-info: nested
 - entry: float = 1.0, int = 562949953421312
 - loop: float = 32.0, int = 18014398509481984
 - then.0: float = 16.0, int = 9007199254740992
 - then.1: float = 8.0, int = 4503599627370496
 - latch: float = 32.0, int = 18014398509481984
 - exit: float = 1.0, int = 562949953421312

loop is the header block and then.1 is the nested if block, and with BlockFrequencyInfo’s frequency we get a probability of 8/32 = 0.25. So we would expect then.1’s scalar cost to be divided by 4:

...
LV: Found an estimated cost of 0 for VF 1 For instruction:   %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %c0, label %then.0, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %c1, label %then.1, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction:   %gep0 = getelementptr i32, ptr %p0, i32 %iv
LV: Found an estimated cost of 1 for VF 1 For instruction:   %x = load i32, ptr %gep0, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction:   %gep1 = getelementptr i32, ptr %p1, i32 %x
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i32 0, ptr %gep1, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction:   br label %latch
LV: Found an estimated cost of 1 for VF 1 For instruction:   %iv.next = add i32 %iv, 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %done = icmp eq i32 %iv.next, 1024
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %done, label %exit, label %loop
LV: Scalar loop costs: 2.
...
Cost for VF vscale x 4: 17 (Estimated cost per lane: 2.1)
...
LV: Selecting VF: 1.
LV: Vectorization is possible but not beneficial.

then.1s scalar cost is now 2/4 = 0, so the total cost of the scalar loop is now 2 and the loop vectorizer no longer decides to vectorize. If we try this out on 538.deepsjeng_r, we can see that it no longer vectorizes that loop in qsearch either. Success!

Screenshot of LNT showing a 6.82% improvement on 531.deepsjeng_r

Running it again on LNT showed a ~7% speedup in execution time. Not just as fast as GCC yet, but a welcome improvement for only a handful of lines of code.

Upstreaming

Now that we know the fix we want to land, we can start to think about how we want to upstream this into LLVM.

If we run llvm-lit --update-tests llvm/test/Transforms/LoopVectorize, we actually get quite a few unexpected test changes. One of the side effects of using BlockFrequencyInfo is that tail folded loops no longer discount the scalar loop if it wasn’t predicated to begin with. A tail folded loop is a loop where the scalar epilogue is folded into the vector loop itself by predicating the vector operations:

// non-tail folded loop:
// process as many VF sized vectors that fit in n
for (int i = 0; i < n - (n % VF); i += VF)
  x[i..i+VF] = y[i..i+VF];
// process the remaining n % VF scalar elements
for (int i = n - (n % VF); i < n; i++)
  x[i] = y[i];
// tail folded loop:
for (int i = 0; i < n; i += VF)
  x[i..i+VF] = y[i..i+VF] mask=[i<n, i+1<n, ..., i+VF-1<n];

However because this block is technically predicated due to the mask on the vector instructions, the loop vectorizer applied getPredBlockCostDivisor to the scalar loop cost even if the original scalar loop had no control flow in its body. BlockFrequencyInfo here can detect that if the block had no control flow, its probability of being executed is 1 and so the scalar loop cost isn’t made cheaper than it needs to be. I split off and landed this change separately, since it makes the test changes easier to review.

Now that the remaining changes in llvm/test/Transforms/LoopVectorize looked more contained, I was almost ready to open a pull request. I just wanted to quickly kick the tyres on llvm-test-suite with a few other targets, since this wasn’t a RISC-V specific change. The plan was to quickly collect some stats on how many loops were vectorized, check for any anomalies when compared to beforehand, and then be on our way:

$ cd llvm-test-suite
$ ninja -C build
...
[222/7278] Building C object External/...nchspec/CPU/500.perlbench_r/src/pp.c.o
FAILED: External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o 
/root/llvm-test-suite/build.x86_64-ReleaseLTO-a/tools/timeit --summary External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o.time /root/llvm-project/build/bin/clang -DDOUBLE_SLASHES_SPECIAL=0 -DNDEBUG -DPERL_CORE -DSPEC -DSPEC_AUTO_BYTEORDER=0x12345678 -DSPEC_AUTO_SUPPRESS_OPENMP -DSPEC_CPU -DSPEC_LINUX -DSPEC_LINUX_X64 -DSPEC_LP64 -DSPEC_SUPPRESS_OPENMP -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGE_FILES -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/dist/IO -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/cpan/Time-HiRes -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/cpan/HTML-Parser -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/ext/re -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/specrand -march=x86-64-v3 -save-temps=obj     -O3 -fomit-frame-pointer -flto -DNDEBUG   -w -Werror=date-time -save-stats=obj -save-stats=obj -fno-strict-aliasing -MD -MT External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o -MF External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o.d -o External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o -c /root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: /root/llvm-project/build/bin/clang-19 -cc1 -triple x86_64-unknown-linux-gnu -O3 -emit-llvm-bc -flto=full -flto-unit -save-temps=obj -disable-free -clear-ast-before-backend -main-file-name pp.c -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=none -relaxed-aliasing -fmath-errno -ffp-contract=on -fno-rounding-math -mconstructor-aliases -funwind-tables=2 -target-cpu x86-64-v3 -debugger-tuning=gdb -fdebug-compilation-dir=/root/llvm-test-suite/build.x86_64-ReleaseLTO-a -fcoverage-compilation-dir=/root/llvm-test-suite/build.x86_64-ReleaseLTO-a -resource-dir /root/llvm-project/build/lib/clang/23 -Werror=date-time -w -ferror-limit 19 -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -vectorize-loops -vectorize-slp -stats-file=External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.stats -faddrsig -fdwarf2-cfi-asm -o External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o -x ir External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.bc
1.      Optimizer
2.      Running pass "function<eager-inv>(float2int,lower-constant-intrinsics,chr,loop(loop-rotate<header-duplication;prepare-for-lto>,loop-deletion),loop-distribute,inject-tli-mappings,loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>,infer-alignment,loop-load-elim,instcombine<max-iterations=1;no-verify-fixpoint>,simplifycfg<bonus-inst-threshold=1;forward-switch-cond;switch-range-to-icmp;switch-to-arithmetic;switch-to-lookup;no-keep-loops;hoist-common-insts;no-hoist-loads-stores-with-cond-faulting;sink-common-insts;speculate-blocks;simplify-cond-branch;no-speculate-unpredictables>,slp-vectorizer,vector-combine,instcombine<max-iterations=1;no-verify-fixpoint>,loop-unroll<O3>,transform-warning,sroa<preserve-cfg>,infer-alignment,instcombine<max-iterations=1;no-verify-fixpoint>,loop-mssa(licm<allowspeculation>),alignment-from-assumptions,loop-sink,instsimplify,div-rem-pairs,tailcallelim,simplifycfg<bonus-inst-threshold=1;no-forward-switch-cond;switch-range-to-icmp;switch-to-arithmetic;no-switch-to-lookup;keep-loops;no-hoist-common-insts;hoist-loads-stores-with-cond-faulting;no-sink-common-insts;speculate-blocks;simplify-cond-branch;speculate-unpredictables>)" on module "External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.bc"
3.      Running pass "loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>" on function "Perl_pp_coreargs"
 #0 0x0000556ff93ab158 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/root/llvm-project/build/bin/clang-19+0x2d5c158)
 #1 0x0000556ff93a8835 llvm::sys::RunSignalHandlers() (/root/llvm-project/build/bin/clang-19+0x2d59835)
 #2 0x0000556ff93abf01 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #3 0x00007f305ce49df0 (/lib/x86_64-linux-gnu/libc.so.6+0x3fdf0)
 #4 0x0000556ffaa0dbfb llvm::LoopVectorizationCostModel::expectedCost(llvm::ElementCount) (/root/llvm-project/build/bin/clang-19+0x43bebfb)
 #5 0x0000556ffaa22a0d llvm::LoopVectorizationPlanner::computeBestVF() (/root/llvm-project/build/bin/clang-19+0x43d3a0d)
 #6 0x0000556ffaa36f3b llvm::LoopVectorizePass::processLoop(llvm::Loop*) (/root/llvm-project/build/bin/clang-19+0x43e7f3b)
 #7 0x0000556ffaa413eb llvm::LoopVectorizePass::runImpl(llvm::Function&) (/root/llvm-project/build/bin/clang-19+0x43f23eb)
 ...
...

A crash when building for X86. No assertion message, but a backtrace that points to the loop vectorizer cost model. Unfortunately this did not turn out to be simple to debug and instead turned into a whole other ordeal, so I’ll leave the details of that rabbit hole to the next post. But in the meantime, here are some hints if you want to guess what went wrong:

  • The crash stems from a SIGFPE signal
  • It only occurs when building on X86. Building on AArch64 is unaffected, even when cross-compiling to X86
  • It only occurs with LTO

Hopefully this also gives a bit of insight into the type of upstream work that we carry out at Igalia. If you have an LLVM or RISC-V project that we could help with, feel free to reach out.

  1. The scalar loop is also modeled in VPlan, but currently costed with the legacy cost model and not the VPlan itself. This is another load bearing TODO

  2. Whilst not enabled default, there is experimental support for outer loop vectorization in the VPlan native path. 

]]>
Closing the LLVM RISC-V gap to GCC, part 12025-12-10T00:00:00+08:002025-12-10T00:00:00+08:00http://lukelau.me/2025/12/10/closing-the-gap-pt1At the time of writing, GCC beats Clang on several SPEC CPU 2017 benchmarks on RISC-V1:

LNT results comparing GCC and Clang

LLVM developers upstream have been working hard on the performance of generated code, in every part of the pipeline from the frontend all the way through to the backend. So when we first saw these results we were naturally a bit surprised. But as it turns out, the GCC developers have been hard at work too.

Sometimes a bit of healthy competition isn’t a bad thing, so this blog post is the first in a series looking at the work going on upstream to improve performance and catch up to GCC.

Please note that this series focuses on RISC-V. Other targets may have more competitive performance but we haven’t measured them yet. We’ll specifically be focusing on the high-performance application processor use case for RISC-V, e.g. compiling for a profile like RVA23. Unfortunately since we don’t have access to RVA23 compatible hardware just yet we’ll be benchmarking on a SpacemiT-X60 powered Banana Pi BPI-F3 with -march=rva22u64_v. We don’t want to use -mcpu=spacemit-x60 since we want to emulate a portable configuration that an OS distribution might compile packages with. And we want to include the vector extension, as we’ll see in later blog posts that optimization like auto-vectorization can have a major impact on performance.

Where to start?

It goes without saying that a vague task like “make LLVM faster” is easier said than done. The first thing is to find something to make fast, and while you could read through the couple dozen million lines of code in LLVM until inspiration strikes, it’s generally easier to start the other way around by analyzing the code it generates.

Sometimes you’ll get lucky by just stumbling across something that could be made faster when hacking or poring through generated assembly. But there’s an endless amount of optimizations to be implemented and not all of them are equally impactful. If we really want to make large strides in performance we need to take a step back and triage what’s actually worth spending time on.

LNT, LLVM’s nightly testing infrastructure, is a great tool for this task. It’s both a web server that allows you to analyze benchmark results, and a command line tool to help run the benchmarks and submit the results to said web server.

As the name might imply, it’s normally used for detecting performance regressions by running benchmarks daily with the latest revision of Clang, flagging any tests that may have become slower or faster since the last revision.

But it also allows you to compare benchmark results across arbitrary configurations. You can run experiments to see what effects a flag has, or see the difference in performance on two pieces of hardware.

Moreover, you can pass in different compilers. In our case, we can do two “runs” with Clang and GCC. Here’s how we would kick these off:

for CC in clang riscv64-linux-gnu-gcc
do
  lnt runtest test-suite bpi-f3-rva22u64_v-ReleaseLTO \
    --sandbox /var/lib/lnt/ \
    --test-suite=path/to/llvm-test-suite \
    -DTEST_SUITE_SPEC2017_ROOT=path/to/cpu2017 \
    --cc=$CC \
    --cflags="-O3 -flto -march=rva22u64_v" \
    --cxxflags="-O3 -flto -march=rva22u64_v" \
    --benchmarking-only \
    --build-threads=16 \
    # cross-compile and run on another machine over ssh
    --toolchain=rva22u64_v.cmake \	
    --remote-host=bpi-f3 \
    # fight noise, run each benchmark 3 times on the same core
    --exec-multisample=3 \
    --run-under="taskset -c 5" \
    # submit the results to a web server for easy viewing
    --submit=https://mylntserver.com/submitRun
done

This command does a lot of heavy lifting. First off it invokes CMake to configure a new build of llvm-test-suite and SPEC CPU 2017 with -O3 -flto -march=rva22u64_v. But because compiling the benchmarks on the Banana Pi BPI-F3 would be painfully slow, we’ve specified a CMake toolchain file to cross-compile to riscv64-linux-gnu from an x86-64 build machine. Here’s what the toolchain file looks like:

# rva22u64_v.cmake
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_C_COMPILER_TARGET riscv64-linux-gnu)
set(CMAKE_CXX_COMPILER_TARGET riscv64-linux-gnu)
set(ARCH riscv64)

If you’ve got your cross toolchain sysroots set up in the right place in /usr/riscv64-linux-gnu/, things should “just work” and CMake will magically build RISC-V binaries. On Debian distros you can install the *-cross packages for this:

$ apt install libc6-dev-riscv64-cross libgcc-14-dev-riscv64-cross
libstdc++-12-dev-riscv64-cross

(You could also use mmdebstrap, or see Alex Bradbury’s guide to this)

After the benchmarks are built it rsyncs over the binaries to the remote machine, and then sshes into it to begin running the benchmarks. It will expect the sandbox path where the binaries are built on the remote to also exist on the host, so something like /var/lib/lnt should work across both. The BPI-F3 can also produce some noisy results, so the --exec-multisample=3 and -run-under="taskset -c 5" tell it to run the benchmarks multiple times and pin them to the same core.

Finally it generates a report.json file and submits it to the web server of choice. Navigate to the web interface and you’ll be shown two “machines”, LNT’s parlance for a specific combination of hardware, compiler and flags. You should see something like: bpif3-rva22u64_v-ReleaseLTO__clang_DEV__riscv64 and bpif3-rva22u64_v-ReleaseLTO__gcc_DEV__riscv64. Clicking into one of these machines will allow you to compare it against the other.

LNT UI for comparing results across two machines

Profiling

Once on the LNT web interface you’ll be presented with a list of benchmarks with a lot of red percentages beside them. We now know what is slower, but next we need to know why they’re slower. We need to profile these benchmarks to see where all the cycles are spent and to figure out what Clang is doing differently from GCC.

LNT makes this easy, all you need to do is add --use-perf=profile to the lnt runtest invocation and it will perform an additional run of each benchmark wrapped in perf record. Profiling impacts run time so it runs it separately to avoid interfering with the final results. If you want to override the default events that are sampled you can specify them with --perf-events=cycles:u,instructions:u,....

LNT will take care of copying back the collected profiles to the host machine and encoding them in the report, and in the web interface you’ll notice a “Profile” button beside the benchmark. Click on that and you’ll be brought to a side by side comparison of the profiles from the two machines:

LNT UI for comparing profiles

From here you can dive in and see where the benchmark spends most of its time. Select a function from the dropdown and choose one with a particularly high percentage: This is how much it makes up overall of whatever counter is active in the top right, like cycles or instructions. Then do the same for the other run and you’ll be presented with the disassemblies side-by-side below. Most importantly, information about the counters is displayed inline with each instruction, much like the output of perf annotate.

You might find the per-instruction counter cycle information to be a bit too fine-grained, so personally I like to use the “Control-Flow Graph” view mode in the top left. This groups the instructions into blocks and lets you see which blocks are the hottest. It also shows the edges between branches and their destinations which makes identifying loops a lot easier.

A real example

Lets take a look at how we can use LNT’s web interface to identify something that GCC does but Clang doesn’t (but should). Going back to the list of SPEC benchmark results we can see 508.namd_r is about 17% slower, so hopefully we should find something to optimize in there.

Jumping into the profile we can see there’s a bunch of functions that all contribute a similar amount to the runtime. We’ll just pick the hottest one at 14.3%, ComputeNonbondedUtil::calc_pair_energy_fullelect(nonbonded*). It’s a pretty big function, but in GCC’s profile 71% of the dynamic instruction count comes from this single, albiet large block.

A hot block in the profile for GCC's 508.namd_r

Looking at Clang’s profile on the opposite side we see a similar block that accounts for 85% of the function’s instruction count. This slightly higher proportion is a small hint that the block that Clang’s producing is sub-optimal. If we take the hint and stare at it for long enough, one thing starts to stand out is that Clang generates a handful of fneg.d instructions which GCC doesn’t:

	fneg.d  fa0, fa0
	fneg.d  ft0, ft0
	fneg.d  ft2, ft2
	fmul.d  fa3, ft5, fa3
	fmul.d  fa0, fa3, fa0
	fmul.d  ft0, fa3, ft0
	fmul.d  fa3, fa3, ft2
	fmadd.d fa2, fa4, fa2, fa0
	fmadd.d ft6, fa4, ft6, ft0
	fmadd.d fa4, fa4, ft1, fa3

fneg.d rd, rs1 negates a double and fmul.d multiplies two doubles. fmadd.d rd, rs1, rs2, rs3 computes (rs1*rs2)+rs3, so here we’re doing some calculation like (a*b)+(c*-d).

These fneg.ds and fmadd.ds are missing on GCC. Instead it emits fmsub.d, which is entirely absent from the Clang code:

	fmul.d fa1,fa4,fa1
	fmul.d ft10,fa4,fa5
	fmsub.d ft10,ft7,fa0,ft10
	fmsub.d fa5,ft7,fa5,fa1
	fmul.d fa1,fa4,fa1
	fmsub.d fa1,ft7,fa0,fa1

fmsub.d rd, rs1, rs2, rs3 computes (rs1*rs2)-rs3, so GCC is instead doing something like (a*b)-(c*d) and in doing so avoids the need for the fneg.d. This sounds like a missed optimization in LLVM, so lets take a look at fixing it.

Writing the (right) fix

The LLVM RISC-V scalar backend is pretty mature at this stage so it’s surprising that we aren’t able to match fmsub.d. But if you take a look in RISCVInstrInfoD.td, you’ll see that the pattern already exists:

// fmsub: rs1 * rs2 - rs3
def : Pat<(any_fma FPR64:$rs1, FPR64:$rs2, (fneg FPR64:$rs3)),
          (FMSUB_D FPR64:$rs1, FPR64:$rs2, FPR64:$rs3, FRM_DYN)>;

We’ll need to figure out why this pattern isn’t getting selected, so lets start by extracting the build commands so we can look under the hood and dump the LLVM IR:

$ cmake -B build -C cmake/caches/ReleaseLTO.cmake --toolchain=...
$ ninja -C build 508.namd_r -t clean
$ ninja -C build 508.namd_r -v
...
[44/45] : && llvm-project/build.release/bin/clang++ --target=riscv64-linux-gnu -march=rva22u64_v -O3 -fomit-frame-pointer -flto -DNDEBUG -fuse-ld=lld ... -o External/SPEC/CFP2017rate/508.namd_r/508.namd_r

This is an LTO build so the code generation step is actually happening during link time. To dump the IR we can copy and paste the link command from the verbose output and append -Wl,--save-temps to it, which in turn tells the Clang driver to pass --save-temps to the linker2.

$ llvm-project/build.release/bin/clang++ -Wl,--save-temps --target=riscv64-linux-gnu -march=rva22u64_v -O3 -fomit-frame-pointer -flto -DNDEBUG -fuse-ld=lld ... -o External/SPEC/CFP2017rate/508.namd_r/508.namd_r 
$ ls External/SPEC/CFP2017rate/508.namd_r/508.namd_r*
External/SPEC/CFP2017rate/508.namd_r/508.namd_r
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.0.preopt.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.2.internalize.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.4.opt.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.5.precodegen.bc

The bitcode is dumped at various stages, and 508.namd_r.0.5.precodegen.bc is the particular stage we’re looking for. This is after all the middle-end optimisations have run and is as close as we’ll get before the backend begins. It contains the bitcode for the entire program though, so lets find the symbol for the C++ function and extract just that corresponding LLVM IR function:

$ llvm-objdump -t 508.namd_r | grep calc_pair_energy_fullelect
...
000000000004562e l     F .text	0000000000001c92 _ZN20ComputeNonbondedUtil26calc_pair_energy_fullelectEP9nonbonded
$ llvm-extract -f 508.namd_r.0.5.precodegen.bc --func _ZN20ComputeNonbondedUtil26calc_pair_energy_fullelectEP9nonbonded \
  | llvm-dis > calc_pair_energy_fullelect.precodegen.ll

Now quickly grep the disassembled LLVM IR to see if we can find the source of the fnegs:

  %316 = fneg double %315
  %neg = fmul double %mul922, %316
  %317 = tail call double @llvm.fmuladd.f64(double %mul919, double %314, double %neg)

This looks promising. We have a @llvm.fmuladd that’s being fed by a fmul of a fneg, which is similar to the (a*b)+(c*-d) pattern in the resulting assembly. But looking back to our TableGen pattern for fmsub.d, we want (any_fma $rs1, $rs2, (fneg $rs3)), i.e. a llvm.fmuladd fed by a fneg of a fmul.

One thing about floating point arithmetic is that whilst it’s generally not associative, we can hoist out the fneg from the fmul since all negation does is flip the sign bit. So we can try to teach InstCombine to hoist the fneg outwards like (fmul x, (fneg y)) -> (fneg (fmul x, y)). But if we go to try that out we’ll see that InstCombine already does the exact opposite:

Instruction *InstCombinerImpl::visitFNeg(UnaryOperator &I) {
  Value *Op = I.getOperand(0);
  // ...
  Value *OneUse;
  if (!match(Op, m_OneUse(m_Value(OneUse))))
    return nullptr;

  if (Instruction *R = hoistFNegAboveFMulFDiv(OneUse, I))
    return replaceInstUsesWith(I, R);
  // ...
}

Instruction *InstCombinerImpl::hoistFNegAboveFMulFDiv(Value *FNegOp,
                                                      Instruction &FMFSource) {
  Value *X, *Y;
  if (match(FNegOp, m_FMul(m_Value(X), m_Value(Y)))) {
    // Push into RHS which is more likely to simplify (const or another fneg).
    // FIXME: It would be better to invert the transform.
    return cast<Instruction>(Builder.CreateFMulFMF(
        X, Builder.CreateFNegFMF(Y, &FMFSource), &FMFSource));
  }

InstCombine usually has good reasons for canonicalizing certain IR patterns, so we need to seriously reconsider if we want to change the canonical form. InstCombine affects all targets and it could be the case that some other backends have patterns that match fmul (fneg x, y), in which case we don’t want disturb them. However for RISC-V we know what our patterns for instruction selection are and what form we want our incoming IR to be in. So a much better place to handle this in is in RISCVISelLowering.cpp, which lets us massage it into shape at the SelectionDAG level, in a way that’s localized to just our target. “Un-canonicalizing” the IR is a common task that backends end up performing, and this is what the resulting patch ended up looking like:

--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -20248,6 +20248,17 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
       return V;
     break;
   case ISD::FMUL: {
+    using namespace SDPatternMatch;
+    SDLoc DL(N);
+    EVT VT = N->getValueType(0);
+    SDValue X, Y;
+    // InstCombine canonicalizes fneg (fmul x, y) -> fmul x, (fneg y), see
+    // hoistFNegAboveFMulFDiv.
+    // Undo this and sink the fneg so we match more fmsub/fnmadd patterns.
+    if (sd_match(N, m_FMul(m_Value(X), m_OneUse(m_FNeg(m_Value(Y))))))
+      return DAG.getNode(ISD::FNEG, DL, VT,
+                         DAG.getNode(ISD::FMUL, DL, VT, X, Y));
+

And if we rebuild our benchmark after applying it, we can see we the fmsub.ds getting matched, saving a couple of instructions:

@@ -983,18 +983,15 @@
        fld     ft2, 48(a5)
        fld     ft3, 64(a5)
        fld     ft4, 72(a5)
-       fneg.d  fa0, fa0
-       fneg.d  ft0, ft0
-       fneg.d  ft2, ft2
        fmul.d  fa3, ft5, fa3
        fmul.d  fa0, fa3, fa0
        fmul.d  ft0, fa3, ft0
        fmul.d  fa3, fa3, ft2
        fld     ft2, 0(s1)
        fmul.d  fa4, ft5, fa4
-       fmadd.d fa2, fa4, fa2, fa0
-       fmadd.d ft6, fa4, ft6, ft0
-       fmadd.d fa4, fa4, ft1, fa3
+       fmsub.d fa2, fa4, fa2, fa0
+       fmsub.d ft6, fa4, ft6, ft0
+       fmsub.d fa4, fa4, ft1, fa3

All in all this ended up giving a 1.77% improvement in instruction count for the 508.namd_r benchmark. It’s still not nearly as fast as GCC, but we’re a little bit closer than before we started.

What’s next?

Hopefully this has given you an overview of how to identify opportunities for optimization in LLVM, and what a typical fix might look like. The analysis is really the most important part, but if you don’t feel like setting up an LNT instance yourself locally Igalia runs one at cc-perf.igalia.com3. We run llvm-test-suite and SPEC CPU 2017 nightly built with Clang and GCC on a small set of RISC-V hardware4, but hopefully to be expanded in future. Feel free to use it to investigate some of the differences between Clang and GCC yourself, and maybe you’ll find some inspiration for optimizations.

In the next post in this series I’ll talk about a performance improvement that recently landed related to cost modelling.

  1. Compiled with -march=rva22u64_v -O3 -flto, running the train dataset on a 16GB Banana Pi BPI-F3 (SpacemiT X60), with GCC and Clang from ToT on 2025-11-25. 

  2. LLD in this case, configurable through CMake with -DCMAKE_LINKER_TYPE=LLD

  3. The LLVM foundation is also in the process of rebooting its canonical public server, which should hopefully be up and running in the coming months. 

  4. Currently it consists of a few Banana Pi BPI-F3s and some HiFive Premier P550s, the latter of which were generously donated by RISC-V International. 

]]>
How to land a change to LLVM in 20 easy patches2024-07-17T00:00:00+08:002024-07-17T00:00:00+08:00http://lukelau.me/2024/07/17/how-to-land-a-change-to-llvm-in-20-easy-patchesA few months ago Piyou Chen and I landed a sizeable change to how the LLVM RISC-V backend generates code for the vector extension.

The gist is that vector instructions on RISC-V don’t encode the vector length or element type in the instruction itself, but instead read them from two registers vl and vtype that are configured with vset[i]vli:

# setup vl and vtype so we're working with <a0 x i32> vectors
vsetivli zero, a0, e32, m1, ta, ma
vadd.vv v8, v10, v12

So in LLVM we have a pass called RISCVInsertVSETVLI that inserts the necessary vset[i]vlis so that vl and vtype are setup correctly for each pseudo instruction:

%a = PseudoVADD_VV_M2 %passthru, %rs2, %rs1, %avl, 5 /*e32*/, 3 /*ta, ma*/

--[RISCVInsertVSETVLI]-->

$x0 = PseudoVSETVLI %avl, 209 /*e32, m2, ta, ma*/, implicit-def $vl, implicit-def $vtype
%rd = PseudoVADD_VV_M2 %passthru, %rs2, %rs1, implicit $vl, implicit $vtype

Previously this was run before register allocation on virtual registers, but it had some drawbacks:

  1. PseudoVSETVLI implicitly defines $vl and $vtype whilst the pseudos implicitly use them, which turns every PseudoVSETVLI into a scheduling boundary. For example, we can’t move this high-latency vle32.v before the vadd.vv because they have different vtypes:

    $x0 = PseudoVSETVLI ... /*e32, m2, ta, ma*/, implicit-def $vl, implicit-def $vtype
    %x = PseudoVADD_VV_M2 ..., implicit $vl, implicit $vtype
    /* Can't move the vle32.v up past here! */
    $x0 = PseudoVSETVLI ... /*e32, m8, ta, ma*/, implicit-def $vl, implicit-def $vtype
    %y = PseudoVLE32_V_M8 ..., implicit $vl, implicit $vtype
    
  2. It’s not practically possible to emit further vector pseudos during or after regsiter allocation, since we would need to keep track of the current vl and vtype and potentially emit new vsetvlis. This is a blocker for vector rematerialization and partial spilling: we’re stuck using whole register loads and stores when spilling since they don’t use vl or vtype, even though we might not need an entire vector register for a small vector.

The solution we ended up with was to insert the vsetvlis after register allocation instead:

%rd = PseudoVADD_VV_M2 %passthru, %rs2, %rs1, %avl, 5 /*e32*/, 3 /*ta, ma*/

--[Allocate vector registers]-->

$v8m2 = PseudoVADD_VV_M2 $v8m2, $v8m2, $v10m2, %avl, 5 /*e32*/, 3 /*ta, ma*/

--[RISCVInsertVSETVLI]-->

$x0 = PseudoVSETVLI %avl, 209 /*e32, m2, ta, ma*/, implicit-def $vl, implicit-def $vtype
$v8m2 = PseudoVADD_VV_M2 $v8m2, $v8m2, $v10m2, implicit $vl, implicit $vtype

--[Allocate scalar registers]-->

...

The trick here is that we split register allocation into two, doing one pass for vectors first and then a second one for scalars. This means that RISCVInsertVSETVLI can be run in between the two, operating on physical vector registers but whilst still able to create virtual scalar registers (It might need to write to a register to set the AVL to VLMAX: vsetvli a0, x0, ...).

The other big change needed was to update RISCVInsertVSETVLI to operate on LiveIntervals now that it was out of SSA form. This isn’t so much an issue with the physical vector registers, but doing the dataflow analysis on the scalar AVL registers proved much trickier.

These changes are interesting in their own right, but in this blog post I want to focus on the logistics that went into landing them instead. For reference, there were only around 500 lines in the initial version of the patch:

$ git diff --stat --relative -- lib
 lib/CodeGen/LiveDebugVariables.cpp      |   2 +-
 lib/CodeGen/LiveDebugVariables.h        |  68 ------
 lib/CodeGen/RegAllocBasic.cpp           |   2 +-
 lib/CodeGen/RegAllocGreedy.cpp          |   2 +-
 lib/CodeGen/VirtRegMap.cpp              |   2 +-
 lib/Target/RISCV/RISCVInsertVSETVLI.cpp | 394 ++++++++++++++++++++++++--------
 lib/Target/RISCV/RISCVTargetMachine.cpp |  14 +-
 7 files changed, 318 insertions(+), 166 deletions(-)

But getting this into the main branch took 20 something patches: easily one of the longest running efforts I’ve been a part of during my time at Igalia.

Putting the test diffs front and centre

As you might expect with a significant change like this, it was initially put behind a flag that was disabled by default, hence why the initial patch was so small. Enabling it by default revealed how nearly every bit of vector code under the sun ended up being affected:

$ git diff --stat --relative -- test
test/CodeGen/RISCV/O0-pipeline.ll                  |    2 +-
 test/CodeGen/RISCV/O3-pipeline.ll                  |    2 +-
 .../early-clobber-tied-def-subreg-liveness.ll      |   18 +-
 test/CodeGen/RISCV/intrinsic-cttz-elts-vscale.ll   |   26 +-
 
 ...
 
 test/CodeGen/RISCV/rvv/vssubu-vp.ll                |    4 +-
 test/CodeGen/RISCV/rvv/vtrunc-vp.ll                |   22 +-
 test/CodeGen/RISCV/rvv/vuitofp-vp.ll               |   28 +-
 test/CodeGen/RISCV/rvv/vxrm-insert.ll              |   24 +-
 test/CodeGen/RISCV/rvv/vxrm.mir                    |    6 +-
 test/CodeGen/RISCV/rvv/vzext-vp.ll                 |    2 +-
 test/CodeGen/RISCV/spill-fpr-scalar.ll             |   24 +-
 test/CodeGen/RISCV/srem-seteq-illegal-types.ll     |    4 +-
 331 files changed, 14868 insertions(+), 14079 deletions(-)

Since we ultimately wanted post-register allocation RISCVInsertVSETVLI to be the only configuration going forward, we would have to confront these test diffs at some point. Disabling it behind a flag would have just kicked the massive diff further down the road to a separate patch that enables it by default.

Landing the changes enabled by default also keeps the code changes and test diffs in sync, which makes it easier to correlate the two when reading through the commit history. And in our case it was better to review the test diffs whilst the mental model of the code was still fresh in our heads.

We still kept the flag since it’s useful to have a fail-safe with risky changes like these, but we switched the flag over to be enabled by default and were now left with a 30k-ish line test diff.

A large test diff isn’t necessarily unreviewable provided the changes are boring, but unfortunately for us our changes were actually pretty interesting:

diff --git a/test/CodeGen/RISCV/rvv/vsetvli-insert.ll b/test/CodeGen/RISCV/rvv/vsetvli-insert.ll
index 29ce7c52e8fd..6fc3e3917a5c 100644
--- a/test/CodeGen/RISCV/rvv/vsetvli-insert.ll
+++ b/test/CodeGen/RISCV/rvv/vsetvli-insert.ll
@@ -352,15 +352,13 @@ entry:
 define <vscale x 1 x double> @test18(<vscale x 1 x double> %a, double %b) nounwind {
 ; CHECK-LABEL: test18:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    vsetivli zero, 6, e64, m1, tu, ma
-; CHECK-NEXT:    vmv1r.v v9, v8
-; CHECK-NEXT:    vfmv.s.f v9, fa0
-; CHECK-NEXT:    vsetvli zero, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vfadd.vv v8, v8, v8
+; CHECK-NEXT:    vsetivli a0, 6, e64, m1, ta, ma
+; CHECK-NEXT:    vfadd.vv v9, v8, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e64, m1, tu, ma
 ; CHECK-NEXT:    vfmv.s.f v8, fa0
+; CHECK-NEXT:    vfmv.s.f v9, fa0
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vfadd.vv v8, v9, v8
+; CHECK-NEXT:    vfadd.vv v8, v8, v9
 ; CHECK-NEXT:    ret

In this one diff we have registers being renamed, instructions being shuffled around different vsetvlis being inserted. In other places we were seeing changes to the number of spills, sometimes more and sometimes less.

All of these observable side effects are intertwined with the multiple things we needed to change in our patch. And when there’s multiple functional changes in a patch, it’s not as obvious as to what line of code caused what line of the test diff to change. Compound that over 300 something test files and you have a patch that no one will be able to realistically review.

Splitting

The main tool for dealing with these massive patches is to split it up so you’re only changing one thing at a time. The first such change split off was the register allocation split:

  • [RISCV] Split regalloc between RVV and other (#72096)
  • [RISCV] default enable splitting regalloc between RVV and other (#72950)

This was nice and self contained, and the test diff was only around 130 lines. It was a natural first choice to split out from the patch since it was tangential to RISCVInsertVSETVLI itself, and even though this didn’t make a dent in the overall test diff we could rule out the register allocation split as one of the forces at play. Getting it out of the way early allowed us to stop thinking about it and save our mental effort for other things.

NFCs

The next step was to go ahead and carve out the NFC changes: No Functional Change changes. This is a common practice in LLVM and other projects where changes that don’t (intentionally) affect any observable behaviour are landed as patches labelled as NFC: things like refactoring or adding test cases.

Knowing that a patch isn’t supposed to have functional changes helps reasoning about it, and if it did end up accidentally changing some behaviour (which happens all the time) then it’s easier to bisect. We were able to find a good few NFC bits from the initial patch:

  • [NFC][LLVM][CodeGen] Move LiveDebugVariables.h into llvm/include/llvm/CodeGen (#88374)
  • [RISCV] Add statistic support for VSETVL insertion pass (#78543)
  • [RISCV] Check dead flag on VL def op in RISCVCoalesceVSETVLI. NFC (#91168)
  • [RISCV] Use an enum for demanded LMUL in RISCVInsertVSETVLI. NFC (#92513)
  • [NFC][RISCV] Keep AVLReg define instr inside VSETVLInfo (#89180)

Some of these NFC patches were just a matter of tightening existing assumptions about the code by adding invariants:

  • [RISCV] Add invariants that registers always have definitions. NFC (#90587)

Whilst others were about removing configurations to reduce the number of code paths:

  • [RISCV] Remove -riscv-insert-vsetvl-strict-asserts flag (#90171)

And as the code diff became smaller, some of the edge cases became more apparent. We refactored things to make them more explicit, which made it more clear when we were dealing with them in the main patch:

  • [RISCV] Split out VSETVLIInfo AVL states to be more explicit (#89964)

All of the above had the same common goal, doing the thinking up front to simplify things further down the line. But whilst it did make the code changes easier to reason about, we still had the very large, very complicated test diff to deal with.

Absorbing test diffs

At some point the NFC patches started to dry up, and we had to start tweezing apart the tangle of changes in the test diff.

Ideally each patch would contain only functional change, but that’s easier said than done, especially on a project like LLVM with so many moving parts. Often you’ll need to touch multiple passes to make one codegen change, and a lot of the time these changes will be interlinked: you need to change something in pass A, but on its own it results in regressions until you land a change in pass B.

We began by ignoring those tricky changes for now, and instead looked for things that could be landed as a isolated, incremental, net-positive (or at worst neutral) changes.

  • [RISCV] Convert implicit_def tuples to noreg in post-isel peephole (#91173)
  • [RISCV] Move RISCVDeadRegisterDefinitions to post vector regalloc (#90636)

Some of the test diffs were due to deficiencies in the existing code that just happened to surface when combined with the changes in our patch. For example, changes in the scheduling showed how we sometimes failed to merge vsetvlis when they were in a different order.

  • [RISCV] Unify getDemanded between forward and backwards passes in RISCVInsertVSETVLI (#92860)

Shuffling up the pipeline

A significant chunk of the test diffs were due to RISCVInsertVSETVLI being moved after other passes. For example, moving it past RISCVInsertReadWriteCSR and RISCVInsertWriteVXRM meant that vsetvlis were now inserted after csrwis and fsrmis.

diff --git a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
index 5b271606f08ab..aa11e012af201 100644
--- a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
@@ -15,8 +15,8 @@ define <vscale x 1 x half> @vp_ceil_vv_nxv1f16(<vscale x 1 x half> %va, <vscale
 ; CHECK-NEXT:    vfabs.v v9, v8, v0.t
 ; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, mu
 ; CHECK-NEXT:    vmflt.vf v0, v9, fa5, v0.t
-; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
 ; CHECK-NEXT:    fsrmi a0, 3
+; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
 ; CHECK-NEXT:    vfcvt.x.f.v v9, v8, v0.t
 ; CHECK-NEXT:    fsrm a0
 ; CHECK-NEXT:    vfcvt.f.x.v v9, v9, v0.t

We could take these relatively boring 9000-ish lines of test diff separately by moving RISCVInsertVSETVLI just after these passes.

  • [RISCV] Move RISCVInsertVSETVLI after CSR/VXRM passes (#91701)

This was one of the patches where the test diff wasn’t actually a win: It was just important to be able to understand the changes in isolation.

Splitting the pass itself

Another trick we used to isolate the diffs was to split the pass itself and move part of it past register allocation.

  • [RISCV] Separate doLocalPostpass into new pass and move to post vector regalloc (#88295)

This let us absorb the diffs for a specific function, doLocalPostpass, which were small enough to be individually reviewable.

This was unfortunately one of the functional changes that introduced a temporary regression, which wouldn’t be fixed until the rest of the work was landed. We justified it though since we were able to understand where it was coming from and how the final patch would fix it.

Big change, small diff/Small change, big diff

At this stage, the last remaining change was to do the actual move to after register allocation, which meant changing the pass to work on LiveIntervals since we would be out of SSA.

At first glance this appeared to be one indivisible chunk of work, but it turned out we could defer the changes from the pass reordering till later by only moving it up past phi elimination. That was enough to get us out of SSA and into LiveIntervals, which was ultimately the tricky bit that we wanted to be sure we got right.

  • [RISCV] Move RISCVInsertVSETVLI to after phi elimination (#91440)

This patch had most of the code changes, but a minimal test diff of around 300 lines. Being able to land the riskiest part with a small test diff allowed us to see the code that ended up being affected, which would have been completely hidden otherwise in the original 30k line diff.

After this, we were just left with the original patch which was now just a matter of moving RISCVInsertVSETVLI past register allocation:

  • [RISCV] Support postRA vsetvl insertion pass (#70549)
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 5aab138dae408..d9f8222669cab 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -399,6 +406,8 @@ bool RISCVPassConfig::addRegAssignAndRewriteFast() {
 bool RISCVPassConfig::addRegAssignAndRewriteOptimized() {
   addPass(createRVVRegAllocPass(true));
   addPass(createVirtRegRewriter(false));
+  if (EnableVSETVLIAfterRVVRegAlloc)
+    addPass(createRISCVInsertVSETVLIPass());

Looking the numbers we were now were down to a 19k line test diff, still a considerable size but now the code changes were tiny.

This is the converse to #91440: we have a patch with minimal code changes but a large test diff, and this is also much easier to review. We can have confidence that all changes in the test diff come from a single functional change, which in our case this was the interaction of moving RISCVInsertVSETVLI past register allocation and unblocking the machine scheduler.

When and where to split

None of the above splits were obvious to us at first. For a change like this, the only realistic starting point was to just implement it all at first. Once we had a picture of what the destination should look like, we could start to identify some initial things to split off.

From then on out it’s mostly a matter of splitting, rebasing and repeating the cycle. New incision points will become clearer as both the code and test diffs are whittled down, and hopefully the cycle will eventually lead you to a patch small enough that you can digest everything that’s going on inside of it.

Splitting up big patches is not a new or rare idea, and it’s done all the time in LLVM. Take for example the ongoing work to replace getelementptr with ptradd, being split up into several smaller changes to canonicalize and massage the GEPs into shape before a new instruction is added:

  • [InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)
  • [IR] Change representation of getelementptr inrange (#84341)
  • [ConstantFolding] Canonicalize constexpr GEPs to i8 (#89872)

Likewise the epic journey to remove debug intrinsics was carried out over multiple RFCs:

Being able to submit reviewable and landable patches is half the battle of collaborative open source software development. There’s no one-size-fits-all strategy and it often ends up being much more difficult than writing the actual code. But if you’re working on something big, hopefully some of the ideas in this post will be of inspiration to you.

Many thanks to Piyou and the reviewers who helped drive this forward, unblocking the way for further enhancements to vector codegen in the RISC-V backend.

]]>