Previously in part 1
we looked at how we can use LNT to
analyze performance gaps, then identified and fixed a missed fmsub.d
opportunity during instruction selection, giving a modest 1.77%
speedup on a SPEC CPU 2017 benchmark.
In this post we’ll be improving another SPEC benchmark by 7% by teaching the loop vectorizer to make smarter cost modelling decisions. It involves a relatively non-trivial analysis, but thanks to LLVM’s modular infrastructure we can do it in just a handful of lines of code. Let’s get started.
Just like last time, all fruitful performance work begins by analysing some workloads. In the last post we had already run some comparisons of SPEC CPU 2017 benchmarks on LNT, so we can return to those results and pick another benchmark to focus on. Here’s one that’s 12% slower than GCC:

531.deepsjeng_r is a chess engine that tied first in the World Computer Chess Championships back in 2009. It consists of a lot bitwise arithmetic and complex loops, since the state of the game is encoded in 64 element arrays: one element for each square on the board. Unlike 508.namd_r from last time, there’s no floating point arithmetic.
Drilling into the profile and its list of functions, right off the bat
we can see that one function is much slower on LLVM. On GCC
qsearch(state_t*, int, int, int, int) makes up 9.1% of the overall
cycles, but on LLVM it’s 16.1%. And if we click in on the function and
view the cumulative total of cycles spent in user mode, Clang takes
74.6 billion cycles to do what takes GCC only 37.7 billion cycles.
So there’s probably something we can improve upon here, but it’s not
immediately obvious from staring at the disassembly. qsearch is a
pretty big function with a couple hundred instructions, so switching
to the CFG view gives us a better overview.
On LLVM’s side we see the offending loop that’s consuming so many
cycles: It’s long, vectorized, and completely if-predicated: there’s
no control flow inside the loop itself. This is typical of a loop
that’s been auto-vectorized by the loop vectorized. If you look at the
load and store instructions you can see that they are masked with the
v0.t operand, stemming from the original control flow that was
flattened.

But on the GCC side there’s no equivalent vectorized loop. The loop is in there somewhere, but all the loops are still in their original scalar form with the control flow intact. And if we look at the edges coming from the loop headers, we can see that most of the time it visits one or two basic blocks and then branches back up to the header. Most of the blocks in the loop are completely cold.
Unfortunately the sources for deepsjeng aren’t open source so we can’t share them in this post, but the very rough structure of the loop is something like this:
for (i = 0; i < N; i++) {
if (foo[i] == a) {
if (bar[i] == b) {
if (baz[i] == c) {
qux[i] = 123;
// lots of work here...
}
}
}
}
For any given iteration, it’s statistically unlikely that we enter the first if statement. It’s even more unlikely that the second if’s condition is also true. And even more so for the third nested if where we eventually have lots of work to compute.
In a scalar loop this doesn’t matter because if an if statement’s condition is false, then we don’t execute the code inside it. We just branch back to the start of the loop. But with a vectorized loop, we execute every single instruction regardless of the condition.
This is the core of the performance gap that we’re seeing versus GCC: Given that the majority of the work in this loop is so deeply nested in the control flow, it would have been better to have not vectorized it given that we need to if-convert it.
One of the hardest problems when making an optimizing compiler is to know when an optimization is profitable. Some optimizations are a double edged sword that can harm performance just as much as they can improve it (if not more), and loop vectorization falls squarely into this category. So rather than blindly applying optimizations at any given opportunity, LLVM has detailed cost models for each target to try and estimate how expensive or cheap a certain sequence of instructions is, which it can then use to evaluate whether or not a transform will be a net positive.
It’s hard to overstate the amount of effort in LLVM spent fine tuning these cost models, applying various heuristics and approximations to make sure different optimizations don’t shoot themselves in the foot. In fact there are some optimizations like loop distribute that are in-tree but disabled by default due to the difficulty in getting the cost model right.
So naturally, we would expect that the loop vectorizer already has a sophisticated solution for the problem we’re seeing in our analysis: Given any predicated block that’s if-converted during vectorization, we would expect the scalar cost for that block to be made slightly cheaper because the scalar block may not always be executed. And the less likely it is to be executed, the cheaper it should be — the most deeply nested if block should be discounted more than the outermost if block.
So how does the loop vectorizer handle this?
/// A helper function that returns how much we should divide the cost of a
/// predicated block by. Typically this is the reciprocal of the block
/// probability, i.e. if we return X we are assuming the predicated block will
/// execute once for every X iterations of the loop header so the block should
/// only contribute 1/X of its cost to the total cost calculation, but when
/// optimizing for code size it will just be 1 as code size costs don't depend
/// on execution probabilities.
///
/// TODO: We should use actual block probability here, if available. Currently,
/// we always assume predicated blocks have a 50% chance of executing.
inline unsigned
getPredBlockCostDivisor(TargetTransformInfo::TargetCostKind CostKind) {
return CostKind == TTI::TCK_CodeSize ? 1 : 2;
}
We’ve come across a load bearing TODO here. Either the block is executed or its not, so it’s a fifty/fifty chance.
On its own this hardcoded probability doesn’t seem like an
unreasonable guess. But whilst 50% may be an accurate estimate as to
whether or not a branch will be taken, it’s an inaccurate estimate
as to whether or not a block will be executed. Assuming that a
branch has a 1/2 chance of being taken, the most deeply nested block
in our example ends up having a 1/2 * 1/2 * 1/2 = 1/8 chance of
being executed.
for (i = 0; i < N; i++) {
if (foo[i] == a) {
// 1/2 chance of being executed
if (bar[i] == b) {
// 1/4 chance of being executed
if (baz[i] == c) {
// 1/8 chance of being executed
// ...
}
}
}
}
The fix to get the loop vectorizer to not unprofitably vectorize this
loop will be to teach getPredBlockCostDivisor to take into account
control flow between blocks.
It’s worth mentioning the fact that a hardcoded constant managed to work well enough up until this point is the sign of an good trade off. 1% of the effort for 90% of the benefit. A patch can go off the rails very easily by trying to implement too much in one go, so deferring the more complex cost modelling here till later was an astute choice. Incremental development is key to making progress upstream.
To get a better picture of how the loop vectorizer is calculating the cost for each possible loop, lets start with a simplified LLVM IR reproducer:
; for (int i = 0; i < 1024; i++)
; if (c0)
; if (c1)
; p1[p0[i]] = 0; // extra work to increase the cost in the predicated block
define void @nested(ptr noalias %p0, ptr noalias %p1, i1 %c0, i1 %c1) {
entry:
br label %loop
loop:
%iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
br i1 %c0, label %then.0, label %latch
then.0:
br i1 %c1, label %then.1, label %latch
then.1:
%gep0 = getelementptr i32, ptr %p0, i32 %iv
%x = load i32, ptr %gep0
%gep1 = getelementptr i32, ptr %p1, i32 %x
store i32 0, ptr %gep1
br label %latch
latch:
%iv.next = add i32 %iv, 1
%done = icmp eq i32 %iv.next, 1024
br i1 %done, label %exit, label %loop
exit:
ret void
}
We can run opt -p loop-vectorize -debug on this example to see how the loop
vectorizer decides if it’s profitable to vectorize the loop or not:
$ opt -p loop-vectorize -mtriple riscv64 -mattr=+v nested.ll -disable-output -debug
...
LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %c0, label %then.0, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %c1, label %then.1, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction: %gep0 = getelementptr i32, ptr %p0, i32 %iv
LV: Found an estimated cost of 1 for VF 1 For instruction: %x = load i32, ptr %gep0, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction: %gep1 = getelementptr i32, ptr %p1, i32 %x
LV: Found an estimated cost of 1 for VF 1 For instruction: store i32 0, ptr %gep1, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction: br label %latch
LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i32 %iv, 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %done = icmp eq i32 %iv.next, 1024
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %done, label %exit, label %loop
LV: Scalar loop costs: 3.
...
Cost of 1 for VF vscale x 4: induction instruction %iv.next = add i32 %iv, 1
Cost of 0 for VF vscale x 4: induction instruction %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
Cost of 1 for VF vscale x 4: exit condition instruction %done = icmp eq i32 %iv.next, 1024
Cost of 0 for VF vscale x 4: EMIT vp<%4> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
Cost of 0 for VF vscale x 4: EXPLICIT-VECTOR-LENGTH-BASED-IV-PHI vp<%5> = phi ir<0>, vp<%index.evl.next>
Cost of 0 for VF vscale x 4: EMIT-SCALAR vp<%avl> = phi [ ir<1024>, vector.ph ], [ vp<%avl.next>, vector.body ]
Cost of 1 for VF vscale x 4: EMIT-SCALAR vp<%6> = EXPLICIT-VECTOR-LENGTH vp<%avl>
Cost of 0 for VF vscale x 4: vp<%7> = SCALAR-STEPS vp<%5>, ir<1>, vp<%6>
Cost of 0 for VF vscale x 4: CLONE ir<%gep0> = getelementptr ir<%p0>, vp<%7>
Cost of 0 for VF vscale x 4: vp<%8> = vector-pointer ir<%gep0>
Cost of 2 for VF vscale x 4: WIDEN ir<%x> = vp.load vp<%8>, vp<%6>, vp<%3>
Cost of 0 for VF vscale x 4: WIDEN-GEP Inv[Var] ir<%gep1> = getelementptr ir<%p1>, ir<%x>
Cost of 12 for VF vscale x 4: WIDEN vp.store ir<%gep1>, ir<0>, vp<%6>, vp<%3>
Cost of 0 for VF vscale x 4: EMIT vp<%index.evl.next> = add nuw vp<%6>, vp<%5>
Cost of 0 for VF vscale x 4: EMIT vp<%avl.next> = sub nuw vp<%avl>, vp<%6>
Cost of 0 for VF vscale x 4: EMIT vp<%index.next> = add nuw vp<%4>, vp<%0>
Cost of 0 for VF vscale x 4: EMIT branch-on-count vp<%index.next>, vp<%1>
Cost of 0 for VF vscale x 4: vector loop backedge
Cost of 0 for VF vscale x 4: EMIT-SCALAR vp<%bc.resume.val> = phi [ ir<0>, ir-bb<entry> ]
Cost of 0 for VF vscale x 4: IR %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ] (extra operand: vp<%bc.resume.val> from scalar.ph)
Cost of 0 for VF vscale x 4: EMIT vp<%3> = logical-and ir<%c0>, ir<%c1>
Cost for VF vscale x 4: 17 (Estimated cost per lane: 2.1)
...
LV: Selecting VF: vscale x 4.
LV: Minimum required TC for runtime checks to be profitable:0
LV: Interleaving is not beneficial.
LV: Found a vectorizable loop (vscale x 4) in nested.ll
LV: Vectorizing: innermost loop.
LEV: Unable to vectorize epilogue because no epilogue is allowed.
LV: Loop does not require scalar epilogue
LV: Loop does not require scalar epilogue
Executing best plan with VF=vscale x 4, UF=1
First we see it work out the cost of the original scalar loop, or as
the vectorizer sees it, the loop with a vectorization factor (VF)
of 1. It goes through each instruction calling into
TargetTransformInfo, and arrives at a total scalar cost of 3. You
might have noticed though, if you went through and manually summed up
the individual instruction costs you would have gotten a total cost
of 4. However the load and store instructions belong to the predicated
then.1 block, so they have their cost divided by 2 from
getPredBlockCostDivisor.
For the vectorized loop, the loop vectorizer uses
VPlan to cost the one
plan for a range of different VFs1. VPlan is an IR
specific to the loop vectorizer to help represent various
vectorization strategies, which is why you see all the EMIT and
WIDEN “recipes” in the output. It calculates a total cost for the
loop and divides it by the estimated number of lanes — we’re working
with scalable vectors on RISC-V so the target needs to make an
estimate of what vscale is — and arrives at 2.1 per lane. There’s
no predication discount applied here because it’s a vectorized
loop. 2.1 is cheaper than 3, so it ultimately picks the vectorized
loop.
Computing an accurate probability that a given block will be executed is a non-trivial task, but thankfully LLVM already has an analysis we can use for this called BlockFrequencyInfo.
BlockFrequencyInfo computes how often a block can be expected to
execute relative to other blocks in a function. It in turn uses
another analysis called BranchProbabilityInfo to work out how likely a
branch to a specific block is going to be taken. And because
BranchProbabilityInfo uses profiling information when available, it
can give you much more accurate block frequencies when compiling with
PGO. Otherwise
it will fall back to guessing the probability of a branch being taken,
which is just 50/50 a lot of the time, but sometimes influenced by
interesting heuristics too: like the probability of icmp eq i32 %x,
0 is 0.375 instead of 0.5, and floats have a near zero chance of
being NaN.
Plugging BlockFrequencyInfo into the loop vectorizer is straightforward, all we need to do is tell the pass manager that we want to access BlockFrequencyInfo from LoopVectorizePass:
PreservedAnalyses LoopVectorizePass::run(Function &F,
FunctionAnalysisManager &AM) {
...
BFI = &AM.getResult<BlockFrequencyAnalysis>(F);
...
}
(BlockFrequencyAnalysis is the pass that computes the analysis result BlockFrequencyInfo, if you’re wondering why the names are different)
Then we can use it to lookup the relative frequencies of whatever block and work out the probability of it being executed in the loop:
uint64_t LoopVectorizationCostModel::getPredBlockCostDivisor(
TargetTransformInfo::TargetCostKind CostKind, const BasicBlock *BB) {
if (CostKind == TTI::TCK_CodeSize)
return 1;
uint64_t HeaderFreq =
BFI->getBlockFreq(TheLoop->getHeader()).getFrequency();
uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency();
return HeaderFreq / BBFreq;
}
The frequencies returned from BlockFrequencyInfo are relative to the the entry block of a function. So if a block has a frequency of 50 and the entry block has a frequency of 100, then you can expect that block to execute 50 times for every 100 times the entry block is executed.
You can use this to work out probabilities of a block being taken in a function, so in this example that block has a 50/100 = 50% chance of being executed every time the function is executed. However this only works in the case that the CFG has no loops: otherwise a block may be executed more times than the entry block and we’d end up with probabilities greater than 100%.
If we want to calculate the probability of a block being executed inside a loop though, that’s fine since the loop vectorizer currently only vectorizes inner-most loops2, i.e. loops that contain no other loops.
We can consider the frequencies of each block in the loop relative to the frequency of the header block. To give a brief loop terminology recap, the header is the first block inside the loop body which dominates all other blocks in the loop, and is the destination of all backedges. So the header is guaranteed to have a frequency greater than or equal to any other block in the loop — this invariant is important as we’ll see later.
Then to calculate the probability of a block in a loop being executed, we divide the block frequency by the header frequency. To work out how much we should divide the cost of the scalar block by, we return the inverse of that.
Trying out this change on our sample loop, first we’ll see the debug output from BlockFrequencyInfo as it’s computed:
$ opt -p loop-vectorize -mtriple riscv64 -mattr=+v nested.ll -disable-output -debug
...
block-frequency-info: nested
- entry: float = 1.0, int = 562949953421312
- loop: float = 32.0, int = 18014398509481984
- then.0: float = 16.0, int = 9007199254740992
- then.1: float = 8.0, int = 4503599627370496
- latch: float = 32.0, int = 18014398509481984
- exit: float = 1.0, int = 562949953421312
loop is the header block and then.1 is the nested if block, and
with BlockFrequencyInfo’s frequency we get a probability of 8/32 =
0.25. So we would expect then.1’s scalar cost to be divided by 4:
...
LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i32 [ 0, %entry ], [ %iv.next, %latch ]
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %c0, label %then.0, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %c1, label %then.1, label %latch
LV: Found an estimated cost of 0 for VF 1 For instruction: %gep0 = getelementptr i32, ptr %p0, i32 %iv
LV: Found an estimated cost of 1 for VF 1 For instruction: %x = load i32, ptr %gep0, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction: %gep1 = getelementptr i32, ptr %p1, i32 %x
LV: Found an estimated cost of 1 for VF 1 For instruction: store i32 0, ptr %gep1, align 4
LV: Found an estimated cost of 0 for VF 1 For instruction: br label %latch
LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i32 %iv, 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %done = icmp eq i32 %iv.next, 1024
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %done, label %exit, label %loop
LV: Scalar loop costs: 2.
...
Cost for VF vscale x 4: 17 (Estimated cost per lane: 2.1)
...
LV: Selecting VF: 1.
LV: Vectorization is possible but not beneficial.
then.1s scalar cost is now 2/4 = 0, so the total cost of the
scalar loop is now 2 and the loop vectorizer no longer decides to
vectorize. If we try this out on 538.deepsjeng_r, we can see that it
no longer vectorizes that loop in qsearch either. Success!

Running it again on LNT showed a ~7% speedup in execution time. Not just as fast as GCC yet, but a welcome improvement for only a handful of lines of code.
Now that we know the fix we want to land, we can start to think about how we want to upstream this into LLVM.
If we run llvm-lit --update-tests
llvm/test/Transforms/LoopVectorize, we actually get quite a few
unexpected test changes. One of the side effects of using
BlockFrequencyInfo is that tail folded loops no longer discount the
scalar loop if it wasn’t predicated to begin
with. A tail folded
loop is a loop where the scalar epilogue is folded into the vector loop itself by predicating the vector operations:
// non-tail folded loop:
// process as many VF sized vectors that fit in n
for (int i = 0; i < n - (n % VF); i += VF)
x[i..i+VF] = y[i..i+VF];
// process the remaining n % VF scalar elements
for (int i = n - (n % VF); i < n; i++)
x[i] = y[i];
// tail folded loop:
for (int i = 0; i < n; i += VF)
x[i..i+VF] = y[i..i+VF] mask=[i<n, i+1<n, ..., i+VF-1<n];
However because this block is technically predicated due to the mask
on the vector instructions, the loop vectorizer applied
getPredBlockCostDivisor to the scalar loop cost even if the original
scalar loop had no control flow in its body. BlockFrequencyInfo here
can detect that if the block had no control flow, its probability of
being executed is 1 and so the scalar loop cost isn’t made cheaper
than it needs to be. I split off and landed this change separately,
since it makes the test changes easier to review.
Now that the remaining changes in llvm/test/Transforms/LoopVectorize
looked more contained, I was almost ready to open a pull request. I
just wanted to quickly kick the tyres on
llvm-test-suite with a few
other targets, since this wasn’t a RISC-V specific change. The plan
was to quickly collect some stats on how many loops were vectorized,
check for any anomalies when compared to beforehand, and then be on
our way:
$ cd llvm-test-suite
$ ninja -C build
...
[222/7278] Building C object External/...nchspec/CPU/500.perlbench_r/src/pp.c.o
FAILED: External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o
/root/llvm-test-suite/build.x86_64-ReleaseLTO-a/tools/timeit --summary External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o.time /root/llvm-project/build/bin/clang -DDOUBLE_SLASHES_SPECIAL=0 -DNDEBUG -DPERL_CORE -DSPEC -DSPEC_AUTO_BYTEORDER=0x12345678 -DSPEC_AUTO_SUPPRESS_OPENMP -DSPEC_CPU -DSPEC_LINUX -DSPEC_LINUX_X64 -DSPEC_LP64 -DSPEC_SUPPRESS_OPENMP -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGE_FILES -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/dist/IO -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/cpan/Time-HiRes -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/cpan/HTML-Parser -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/ext/re -I/root/cpu2017/benchspec/CPU/500.perlbench_r/src/specrand -march=x86-64-v3 -save-temps=obj -O3 -fomit-frame-pointer -flto -DNDEBUG -w -Werror=date-time -save-stats=obj -save-stats=obj -fno-strict-aliasing -MD -MT External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o -MF External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o.d -o External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o -c /root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0. Program arguments: /root/llvm-project/build/bin/clang-19 -cc1 -triple x86_64-unknown-linux-gnu -O3 -emit-llvm-bc -flto=full -flto-unit -save-temps=obj -disable-free -clear-ast-before-backend -main-file-name pp.c -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=none -relaxed-aliasing -fmath-errno -ffp-contract=on -fno-rounding-math -mconstructor-aliases -funwind-tables=2 -target-cpu x86-64-v3 -debugger-tuning=gdb -fdebug-compilation-dir=/root/llvm-test-suite/build.x86_64-ReleaseLTO-a -fcoverage-compilation-dir=/root/llvm-test-suite/build.x86_64-ReleaseLTO-a -resource-dir /root/llvm-project/build/lib/clang/23 -Werror=date-time -w -ferror-limit 19 -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -vectorize-loops -vectorize-slp -stats-file=External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.stats -faddrsig -fdwarf2-cfi-asm -o External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.c.o -x ir External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.bc
1. Optimizer
2. Running pass "function<eager-inv>(float2int,lower-constant-intrinsics,chr,loop(loop-rotate<header-duplication;prepare-for-lto>,loop-deletion),loop-distribute,inject-tli-mappings,loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>,infer-alignment,loop-load-elim,instcombine<max-iterations=1;no-verify-fixpoint>,simplifycfg<bonus-inst-threshold=1;forward-switch-cond;switch-range-to-icmp;switch-to-arithmetic;switch-to-lookup;no-keep-loops;hoist-common-insts;no-hoist-loads-stores-with-cond-faulting;sink-common-insts;speculate-blocks;simplify-cond-branch;no-speculate-unpredictables>,slp-vectorizer,vector-combine,instcombine<max-iterations=1;no-verify-fixpoint>,loop-unroll<O3>,transform-warning,sroa<preserve-cfg>,infer-alignment,instcombine<max-iterations=1;no-verify-fixpoint>,loop-mssa(licm<allowspeculation>),alignment-from-assumptions,loop-sink,instsimplify,div-rem-pairs,tailcallelim,simplifycfg<bonus-inst-threshold=1;no-forward-switch-cond;switch-range-to-icmp;switch-to-arithmetic;no-switch-to-lookup;keep-loops;no-hoist-common-insts;hoist-loads-stores-with-cond-faulting;no-sink-common-insts;speculate-blocks;simplify-cond-branch;speculate-unpredictables>)" on module "External/SPEC/CINT2017rate/500.perlbench_r/CMakeFiles/500.perlbench_r.dir/root/cpu2017/benchspec/CPU/500.perlbench_r/src/pp.bc"
3. Running pass "loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>" on function "Perl_pp_coreargs"
#0 0x0000556ff93ab158 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/root/llvm-project/build/bin/clang-19+0x2d5c158)
#1 0x0000556ff93a8835 llvm::sys::RunSignalHandlers() (/root/llvm-project/build/bin/clang-19+0x2d59835)
#2 0x0000556ff93abf01 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
#3 0x00007f305ce49df0 (/lib/x86_64-linux-gnu/libc.so.6+0x3fdf0)
#4 0x0000556ffaa0dbfb llvm::LoopVectorizationCostModel::expectedCost(llvm::ElementCount) (/root/llvm-project/build/bin/clang-19+0x43bebfb)
#5 0x0000556ffaa22a0d llvm::LoopVectorizationPlanner::computeBestVF() (/root/llvm-project/build/bin/clang-19+0x43d3a0d)
#6 0x0000556ffaa36f3b llvm::LoopVectorizePass::processLoop(llvm::Loop*) (/root/llvm-project/build/bin/clang-19+0x43e7f3b)
#7 0x0000556ffaa413eb llvm::LoopVectorizePass::runImpl(llvm::Function&) (/root/llvm-project/build/bin/clang-19+0x43f23eb)
...
...
A crash when building for X86. No assertion message, but a backtrace that points to the loop vectorizer cost model. Unfortunately this did not turn out to be simple to debug and instead turned into a whole other ordeal, so I’ll leave the details of that rabbit hole to the next post. But in the meantime, here are some hints if you want to guess what went wrong:
Hopefully this also gives a bit of insight into the type of upstream work that we carry out at Igalia. If you have an LLVM or RISC-V project that we could help with, feel free to reach out.
The scalar loop is also modeled in VPlan, but currently costed with the legacy cost model and not the VPlan itself. This is another load bearing TODO. ↩
Whilst not enabled default, there is experimental support for outer loop vectorization in the VPlan native path. ↩

LLVM developers upstream have been working hard on the performance of generated code, in every part of the pipeline from the frontend all the way through to the backend. So when we first saw these results we were naturally a bit surprised. But as it turns out, the GCC developers have been hard at work too.
Sometimes a bit of healthy competition isn’t a bad thing, so this blog post is the first in a series looking at the work going on upstream to improve performance and catch up to GCC.
Please note that this series focuses on RISC-V. Other targets may have
more competitive performance but we haven’t measured them yet. We’ll
specifically be focusing on the high-performance application processor
use case for RISC-V, e.g. compiling for a
profile like
RVA23. Unfortunately since we don’t have access to RVA23 compatible
hardware just yet we’ll be benchmarking on a SpacemiT-X60 powered
Banana Pi BPI-F3 with -march=rva22u64_v. We don’t want to use
-mcpu=spacemit-x60 since we want to emulate a portable configuration
that an OS distribution might compile packages with. And we want to
include the vector extension, as we’ll see in later blog posts that
optimization like auto-vectorization can have a major impact on
performance.
It goes without saying that a vague task like “make LLVM faster” is easier said than done. The first thing is to find something to make fast, and while you could read through the couple dozen million lines of code in LLVM until inspiration strikes, it’s generally easier to start the other way around by analyzing the code it generates.
Sometimes you’ll get lucky by just stumbling across something that could be made faster when hacking or poring through generated assembly. But there’s an endless amount of optimizations to be implemented and not all of them are equally impactful. If we really want to make large strides in performance we need to take a step back and triage what’s actually worth spending time on.
LNT, LLVM’s nightly testing infrastructure, is a great tool for this task. It’s both a web server that allows you to analyze benchmark results, and a command line tool to help run the benchmarks and submit the results to said web server.
As the name might imply, it’s normally used for detecting performance regressions by running benchmarks daily with the latest revision of Clang, flagging any tests that may have become slower or faster since the last revision.
But it also allows you to compare benchmark results across arbitrary configurations. You can run experiments to see what effects a flag has, or see the difference in performance on two pieces of hardware.
Moreover, you can pass in different compilers. In our case, we can do two “runs” with Clang and GCC. Here’s how we would kick these off:
for CC in clang riscv64-linux-gnu-gcc
do
lnt runtest test-suite bpi-f3-rva22u64_v-ReleaseLTO \
--sandbox /var/lib/lnt/ \
--test-suite=path/to/llvm-test-suite \
-DTEST_SUITE_SPEC2017_ROOT=path/to/cpu2017 \
--cc=$CC \
--cflags="-O3 -flto -march=rva22u64_v" \
--cxxflags="-O3 -flto -march=rva22u64_v" \
--benchmarking-only \
--build-threads=16 \
# cross-compile and run on another machine over ssh
--toolchain=rva22u64_v.cmake \
--remote-host=bpi-f3 \
# fight noise, run each benchmark 3 times on the same core
--exec-multisample=3 \
--run-under="taskset -c 5" \
# submit the results to a web server for easy viewing
--submit=https://mylntserver.com/submitRun
done
This command does a lot of heavy lifting. First off it invokes CMake
to configure a new build of llvm-test-suite and SPEC CPU 2017 with
-O3 -flto -march=rva22u64_v. But because compiling the benchmarks
on the Banana Pi BPI-F3 would be painfully slow, we’ve specified a
CMake toolchain
file
to cross-compile to riscv64-linux-gnu from an x86-64 build
machine. Here’s what the toolchain file looks like:
# rva22u64_v.cmake
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_C_COMPILER_TARGET riscv64-linux-gnu)
set(CMAKE_CXX_COMPILER_TARGET riscv64-linux-gnu)
set(ARCH riscv64)
If you’ve got your cross toolchain sysroots set up in the right place
in /usr/riscv64-linux-gnu/, things should “just work” and CMake will
magically build RISC-V binaries. On Debian distros you can install the
*-cross packages for this:
$ apt install libc6-dev-riscv64-cross libgcc-14-dev-riscv64-cross
libstdc++-12-dev-riscv64-cross
(You could also use mmdebstrap, or see Alex Bradbury’s guide to this)
After the benchmarks are built it rsyncs over the binaries to the
remote machine, and then sshes into it to begin running the
benchmarks. It will expect the sandbox path where the binaries are
built on the remote to also exist on the host, so something like
/var/lib/lnt should work across both. The BPI-F3 can also produce
some noisy results, so the --exec-multisample=3 and
-run-under="taskset -c 5" tell it to run the benchmarks multiple
times and pin them to the same core.
Finally it generates a report.json file and submits it to the web
server of choice. Navigate to the web interface and you’ll be shown
two “machines”, LNT’s parlance for a specific combination of hardware,
compiler and flags. You should see something like:
bpif3-rva22u64_v-ReleaseLTO__clang_DEV__riscv64 and
bpif3-rva22u64_v-ReleaseLTO__gcc_DEV__riscv64. Clicking into one of
these machines will allow you to compare it against the other.

Once on the LNT web interface you’ll be presented with a list of benchmarks with a lot of red percentages beside them. We now know what is slower, but next we need to know why they’re slower. We need to profile these benchmarks to see where all the cycles are spent and to figure out what Clang is doing differently from GCC.
LNT makes this easy, all you need to do is add --use-perf=profile to
the lnt runtest invocation and it will perform an additional run of
each benchmark wrapped in perf record. Profiling impacts run time so
it runs it separately to avoid interfering with the final results. If
you want to override the default events that are sampled you can
specify them with --perf-events=cycles:u,instructions:u,....
LNT will take care of copying back the collected profiles to the host machine and encoding them in the report, and in the web interface you’ll notice a “Profile” button beside the benchmark. Click on that and you’ll be brought to a side by side comparison of the profiles from the two machines:

From here you can dive in and see where the benchmark spends most of
its time. Select a function from the dropdown and choose one with a
particularly high percentage: This is how much it makes up overall of
whatever counter is active in the top right, like cycles or
instructions. Then do the same for the other run and you’ll be
presented with the disassemblies side-by-side below. Most importantly,
information about the counters is displayed inline with each
instruction, much like the output of perf annotate.
You might find the per-instruction counter cycle information to be a bit too fine-grained, so personally I like to use the “Control-Flow Graph” view mode in the top left. This groups the instructions into blocks and lets you see which blocks are the hottest. It also shows the edges between branches and their destinations which makes identifying loops a lot easier.
Lets take a look at how we can use LNT’s web interface to identify something that GCC does but Clang doesn’t (but should). Going back to the list of SPEC benchmark results we can see 508.namd_r is about 17% slower, so hopefully we should find something to optimize in there.
Jumping into the profile we can see there’s a bunch of functions that
all contribute a similar amount to the runtime. We’ll just pick the
hottest one at 14.3%,
ComputeNonbondedUtil::calc_pair_energy_fullelect(nonbonded*). It’s a
pretty big function, but in GCC’s profile 71% of the dynamic
instruction count comes from this single, albiet large block.

Looking at Clang’s profile on the opposite side we see a similar block
that accounts for 85% of the function’s instruction count. This
slightly higher proportion is a small hint that the block that Clang’s
producing is sub-optimal. If we take the hint and stare at it for long
enough, one thing starts to stand out is that Clang generates a
handful of fneg.d instructions which GCC doesn’t:
fneg.d fa0, fa0
fneg.d ft0, ft0
fneg.d ft2, ft2
fmul.d fa3, ft5, fa3
fmul.d fa0, fa3, fa0
fmul.d ft0, fa3, ft0
fmul.d fa3, fa3, ft2
fmadd.d fa2, fa4, fa2, fa0
fmadd.d ft6, fa4, ft6, ft0
fmadd.d fa4, fa4, ft1, fa3
fneg.d rd, rs1 negates a double and fmul.d multiplies two
doubles. fmadd.d rd, rs1, rs2, rs3 computes (rs1*rs2)+rs3, so here
we’re doing some calculation like (a*b)+(c*-d).
These fneg.ds and fmadd.ds are missing on GCC. Instead it emits
fmsub.d, which is entirely absent from the Clang code:
fmul.d fa1,fa4,fa1
fmul.d ft10,fa4,fa5
fmsub.d ft10,ft7,fa0,ft10
fmsub.d fa5,ft7,fa5,fa1
fmul.d fa1,fa4,fa1
fmsub.d fa1,ft7,fa0,fa1
fmsub.d rd, rs1, rs2, rs3 computes (rs1*rs2)-rs3, so GCC is
instead doing something like (a*b)-(c*d) and in doing so avoids the
need for the fneg.d. This sounds like a missed optimization in LLVM,
so lets take a look at fixing it.
The LLVM RISC-V scalar backend is pretty mature at this stage so it’s
surprising that we aren’t able to match fmsub.d. But if you take a
look in RISCVInstrInfoD.td, you’ll see that the pattern already
exists:
// fmsub: rs1 * rs2 - rs3
def : Pat<(any_fma FPR64:$rs1, FPR64:$rs2, (fneg FPR64:$rs3)),
(FMSUB_D FPR64:$rs1, FPR64:$rs2, FPR64:$rs3, FRM_DYN)>;
We’ll need to figure out why this pattern isn’t getting selected, so lets start by extracting the build commands so we can look under the hood and dump the LLVM IR:
$ cmake -B build -C cmake/caches/ReleaseLTO.cmake --toolchain=...
$ ninja -C build 508.namd_r -t clean
$ ninja -C build 508.namd_r -v
...
[44/45] : && llvm-project/build.release/bin/clang++ --target=riscv64-linux-gnu -march=rva22u64_v -O3 -fomit-frame-pointer -flto -DNDEBUG -fuse-ld=lld ... -o External/SPEC/CFP2017rate/508.namd_r/508.namd_r
This is an LTO build so the code generation step is actually happening
during link time. To dump the IR we can copy and paste the link
command from the verbose output and append -Wl,--save-temps to it,
which in turn tells the Clang driver to pass --save-temps to the
linker2.
$ llvm-project/build.release/bin/clang++ -Wl,--save-temps --target=riscv64-linux-gnu -march=rva22u64_v -O3 -fomit-frame-pointer -flto -DNDEBUG -fuse-ld=lld ... -o External/SPEC/CFP2017rate/508.namd_r/508.namd_r
$ ls External/SPEC/CFP2017rate/508.namd_r/508.namd_r*
External/SPEC/CFP2017rate/508.namd_r/508.namd_r
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.0.preopt.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.2.internalize.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.4.opt.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.5.precodegen.bc
The bitcode is dumped at various stages, and
508.namd_r.0.5.precodegen.bc is the particular stage we’re looking
for. This is after all the middle-end optimisations have run and is as
close as we’ll get before the backend begins. It contains the bitcode
for the entire program though, so lets find the symbol for the C++
function and extract just that corresponding LLVM IR function:
$ llvm-objdump -t 508.namd_r | grep calc_pair_energy_fullelect
...
000000000004562e l F .text 0000000000001c92 _ZN20ComputeNonbondedUtil26calc_pair_energy_fullelectEP9nonbonded
$ llvm-extract -f 508.namd_r.0.5.precodegen.bc --func _ZN20ComputeNonbondedUtil26calc_pair_energy_fullelectEP9nonbonded \
| llvm-dis > calc_pair_energy_fullelect.precodegen.ll
Now quickly grep the disassembled LLVM IR to see if we can find the
source of the fnegs:
%316 = fneg double %315
%neg = fmul double %mul922, %316
%317 = tail call double @llvm.fmuladd.f64(double %mul919, double %314, double %neg)
This looks promising. We have a @llvm.fmuladd that’s being fed by a
fmul of a fneg, which is similar to the (a*b)+(c*-d) pattern in
the resulting assembly. But looking back to our TableGen pattern for
fmsub.d, we want (any_fma $rs1, $rs2, (fneg $rs3)), i.e. a
llvm.fmuladd fed by a fneg of a fmul.
One thing about floating point arithmetic is that whilst it’s
generally not associative, we can hoist out the fneg from the fmul
since all negation does is flip the sign bit. So we can try to teach
InstCombine to hoist the fneg outwards like (fmul x, (fneg y)) ->
(fneg (fmul x, y)). But if we go to try that out we’ll see that
InstCombine already does the exact opposite:
Instruction *InstCombinerImpl::visitFNeg(UnaryOperator &I) {
Value *Op = I.getOperand(0);
// ...
Value *OneUse;
if (!match(Op, m_OneUse(m_Value(OneUse))))
return nullptr;
if (Instruction *R = hoistFNegAboveFMulFDiv(OneUse, I))
return replaceInstUsesWith(I, R);
// ...
}
Instruction *InstCombinerImpl::hoistFNegAboveFMulFDiv(Value *FNegOp,
Instruction &FMFSource) {
Value *X, *Y;
if (match(FNegOp, m_FMul(m_Value(X), m_Value(Y)))) {
// Push into RHS which is more likely to simplify (const or another fneg).
// FIXME: It would be better to invert the transform.
return cast<Instruction>(Builder.CreateFMulFMF(
X, Builder.CreateFNegFMF(Y, &FMFSource), &FMFSource));
}
InstCombine usually has good reasons for canonicalizing certain IR
patterns, so we need to seriously reconsider if we want to change the
canonical form. InstCombine affects all targets and it could be the
case that some other backends have patterns that match fmul (fneg x,
y), in which case we don’t want disturb them. However for RISC-V we
know what our patterns for instruction selection are and what form we
want our incoming IR to be in. So a much better place to handle this
in is in RISCVISelLowering.cpp, which lets us massage it into shape at
the SelectionDAG level, in a way that’s localized to just our
target. “Un-canonicalizing” the IR is a common task that backends end
up performing, and this is what the resulting
patch ended up
looking like:
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -20248,6 +20248,17 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
return V;
break;
case ISD::FMUL: {
+ using namespace SDPatternMatch;
+ SDLoc DL(N);
+ EVT VT = N->getValueType(0);
+ SDValue X, Y;
+ // InstCombine canonicalizes fneg (fmul x, y) -> fmul x, (fneg y), see
+ // hoistFNegAboveFMulFDiv.
+ // Undo this and sink the fneg so we match more fmsub/fnmadd patterns.
+ if (sd_match(N, m_FMul(m_Value(X), m_OneUse(m_FNeg(m_Value(Y))))))
+ return DAG.getNode(ISD::FNEG, DL, VT,
+ DAG.getNode(ISD::FMUL, DL, VT, X, Y));
+
And if we rebuild our benchmark after applying it, we can see we the
fmsub.ds getting matched, saving a couple of instructions:
@@ -983,18 +983,15 @@
fld ft2, 48(a5)
fld ft3, 64(a5)
fld ft4, 72(a5)
- fneg.d fa0, fa0
- fneg.d ft0, ft0
- fneg.d ft2, ft2
fmul.d fa3, ft5, fa3
fmul.d fa0, fa3, fa0
fmul.d ft0, fa3, ft0
fmul.d fa3, fa3, ft2
fld ft2, 0(s1)
fmul.d fa4, ft5, fa4
- fmadd.d fa2, fa4, fa2, fa0
- fmadd.d ft6, fa4, ft6, ft0
- fmadd.d fa4, fa4, ft1, fa3
+ fmsub.d fa2, fa4, fa2, fa0
+ fmsub.d ft6, fa4, ft6, ft0
+ fmsub.d fa4, fa4, ft1, fa3
All in all this ended up giving a 1.77% improvement in instruction count for the 508.namd_r benchmark. It’s still not nearly as fast as GCC, but we’re a little bit closer than before we started.
Hopefully this has given you an overview of how to identify opportunities for optimization in LLVM, and what a typical fix might look like. The analysis is really the most important part, but if you don’t feel like setting up an LNT instance yourself locally Igalia runs one at cc-perf.igalia.com3. We run llvm-test-suite and SPEC CPU 2017 nightly built with Clang and GCC on a small set of RISC-V hardware4, but hopefully to be expanded in future. Feel free to use it to investigate some of the differences between Clang and GCC yourself, and maybe you’ll find some inspiration for optimizations.
In the next post in this series I’ll talk about a performance improvement that recently landed related to cost modelling.
Compiled with -march=rva22u64_v -O3 -flto, running the train
dataset on a 16GB Banana Pi BPI-F3 (SpacemiT X60), with GCC and
Clang from ToT on 2025-11-25. ↩
LLD in this case, configurable through CMake with
-DCMAKE_LINKER_TYPE=LLD. ↩
The LLVM foundation is also in the process of rebooting its canonical public server, which should hopefully be up and running in the coming months. ↩
Currently it consists of a few Banana Pi BPI-F3s and some HiFive Premier P550s, the latter of which were generously donated by RISC-V International. ↩
The gist is that vector instructions on RISC-V don’t encode the vector
length or element type in the instruction itself, but instead read
them from two registers vl and vtype that are configured with
vset[i]vli:
# setup vl and vtype so we're working with <a0 x i32> vectors
vsetivli zero, a0, e32, m1, ta, ma
vadd.vv v8, v10, v12
So in LLVM we have a pass called RISCVInsertVSETVLI that inserts the
necessary vset[i]vlis so that vl and vtype are setup correctly
for each pseudo instruction:
%a = PseudoVADD_VV_M2 %passthru, %rs2, %rs1, %avl, 5 /*e32*/, 3 /*ta, ma*/
--[RISCVInsertVSETVLI]-->
$x0 = PseudoVSETVLI %avl, 209 /*e32, m2, ta, ma*/, implicit-def $vl, implicit-def $vtype
%rd = PseudoVADD_VV_M2 %passthru, %rs2, %rs1, implicit $vl, implicit $vtype
Previously this was run before register allocation on virtual registers, but it had some drawbacks:
PseudoVSETVLI implicitly defines $vl and $vtype whilst the
pseudos implicitly use them, which turns every PseudoVSETVLI into
a scheduling boundary. For example, we can’t move this high-latency
vle32.v before the vadd.vv because they have different
vtypes:
$x0 = PseudoVSETVLI ... /*e32, m2, ta, ma*/, implicit-def $vl, implicit-def $vtype
%x = PseudoVADD_VV_M2 ..., implicit $vl, implicit $vtype
/* Can't move the vle32.v up past here! */
$x0 = PseudoVSETVLI ... /*e32, m8, ta, ma*/, implicit-def $vl, implicit-def $vtype
%y = PseudoVLE32_V_M8 ..., implicit $vl, implicit $vtype
It’s not practically possible to emit further vector pseudos during
or after regsiter allocation, since we would need to keep track of
the current vl and vtype and potentially emit new vsetvlis.
This is a blocker for vector rematerialization and partial
spilling: we’re stuck using whole register loads and stores when
spilling since they don’t use vl or vtype, even though we might
not need an entire vector register for a small vector.
The solution we ended up with was to insert the vsetvlis after
register allocation instead:
%rd = PseudoVADD_VV_M2 %passthru, %rs2, %rs1, %avl, 5 /*e32*/, 3 /*ta, ma*/
--[Allocate vector registers]-->
$v8m2 = PseudoVADD_VV_M2 $v8m2, $v8m2, $v10m2, %avl, 5 /*e32*/, 3 /*ta, ma*/
--[RISCVInsertVSETVLI]-->
$x0 = PseudoVSETVLI %avl, 209 /*e32, m2, ta, ma*/, implicit-def $vl, implicit-def $vtype
$v8m2 = PseudoVADD_VV_M2 $v8m2, $v8m2, $v10m2, implicit $vl, implicit $vtype
--[Allocate scalar registers]-->
...
The trick here is that we split register allocation into two, doing
one pass for vectors first and then a second one for scalars. This
means that RISCVInsertVSETVLI can be run in between the two,
operating on physical vector registers but whilst still able to create
virtual scalar registers (It might need to write to a register to set
the AVL to VLMAX: vsetvli a0, x0, ...).
The other big change needed was to update RISCVInsertVSETVLI to
operate on LiveIntervals now that it was out of SSA form. This isn’t
so much an issue with the physical vector registers, but doing the
dataflow analysis on the scalar AVL registers proved much trickier.
These changes are interesting in their own right, but in this blog post I want to focus on the logistics that went into landing them instead. For reference, there were only around 500 lines in the initial version of the patch:
$ git diff --stat --relative -- lib
lib/CodeGen/LiveDebugVariables.cpp | 2 +-
lib/CodeGen/LiveDebugVariables.h | 68 ------
lib/CodeGen/RegAllocBasic.cpp | 2 +-
lib/CodeGen/RegAllocGreedy.cpp | 2 +-
lib/CodeGen/VirtRegMap.cpp | 2 +-
lib/Target/RISCV/RISCVInsertVSETVLI.cpp | 394 ++++++++++++++++++++++++--------
lib/Target/RISCV/RISCVTargetMachine.cpp | 14 +-
7 files changed, 318 insertions(+), 166 deletions(-)
But getting this into the main branch took 20 something patches:
easily one of the longest running efforts I’ve been a part of during
my time at Igalia.
As you might expect with a significant change like this, it was initially put behind a flag that was disabled by default, hence why the initial patch was so small. Enabling it by default revealed how nearly every bit of vector code under the sun ended up being affected:
$ git diff --stat --relative -- test
test/CodeGen/RISCV/O0-pipeline.ll | 2 +-
test/CodeGen/RISCV/O3-pipeline.ll | 2 +-
.../early-clobber-tied-def-subreg-liveness.ll | 18 +-
test/CodeGen/RISCV/intrinsic-cttz-elts-vscale.ll | 26 +-
...
test/CodeGen/RISCV/rvv/vssubu-vp.ll | 4 +-
test/CodeGen/RISCV/rvv/vtrunc-vp.ll | 22 +-
test/CodeGen/RISCV/rvv/vuitofp-vp.ll | 28 +-
test/CodeGen/RISCV/rvv/vxrm-insert.ll | 24 +-
test/CodeGen/RISCV/rvv/vxrm.mir | 6 +-
test/CodeGen/RISCV/rvv/vzext-vp.ll | 2 +-
test/CodeGen/RISCV/spill-fpr-scalar.ll | 24 +-
test/CodeGen/RISCV/srem-seteq-illegal-types.ll | 4 +-
331 files changed, 14868 insertions(+), 14079 deletions(-)
Since we ultimately wanted post-register allocation
RISCVInsertVSETVLI to be the only configuration going forward, we
would have to confront these test diffs at some point. Disabling it
behind a flag would have just kicked the massive diff further down the
road to a separate patch that enables it by default.
Landing the changes enabled by default also keeps the code changes and test diffs in sync, which makes it easier to correlate the two when reading through the commit history. And in our case it was better to review the test diffs whilst the mental model of the code was still fresh in our heads.
We still kept the flag since it’s useful to have a fail-safe with risky changes like these, but we switched the flag over to be enabled by default and were now left with a 30k-ish line test diff.
A large test diff isn’t necessarily unreviewable provided the changes are boring, but unfortunately for us our changes were actually pretty interesting:
diff --git a/test/CodeGen/RISCV/rvv/vsetvli-insert.ll b/test/CodeGen/RISCV/rvv/vsetvli-insert.ll
index 29ce7c52e8fd..6fc3e3917a5c 100644
--- a/test/CodeGen/RISCV/rvv/vsetvli-insert.ll
+++ b/test/CodeGen/RISCV/rvv/vsetvli-insert.ll
@@ -352,15 +352,13 @@ entry:
define <vscale x 1 x double> @test18(<vscale x 1 x double> %a, double %b) nounwind {
; CHECK-LABEL: test18:
; CHECK: # %bb.0: # %entry
-; CHECK-NEXT: vsetivli zero, 6, e64, m1, tu, ma
-; CHECK-NEXT: vmv1r.v v9, v8
-; CHECK-NEXT: vfmv.s.f v9, fa0
-; CHECK-NEXT: vsetvli zero, zero, e64, m1, ta, ma
-; CHECK-NEXT: vfadd.vv v8, v8, v8
+; CHECK-NEXT: vsetivli a0, 6, e64, m1, ta, ma
+; CHECK-NEXT: vfadd.vv v9, v8, v8
; CHECK-NEXT: vsetvli zero, zero, e64, m1, tu, ma
; CHECK-NEXT: vfmv.s.f v8, fa0
+; CHECK-NEXT: vfmv.s.f v9, fa0
; CHECK-NEXT: vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT: vfadd.vv v8, v9, v8
+; CHECK-NEXT: vfadd.vv v8, v8, v9
; CHECK-NEXT: ret
In this one diff we have registers being renamed, instructions being
shuffled around different vsetvlis being inserted. In other places
we were seeing changes to the number of spills, sometimes more and
sometimes less.
All of these observable side effects are intertwined with the multiple things we needed to change in our patch. And when there’s multiple functional changes in a patch, it’s not as obvious as to what line of code caused what line of the test diff to change. Compound that over 300 something test files and you have a patch that no one will be able to realistically review.
The main tool for dealing with these massive patches is to split it up so you’re only changing one thing at a time. The first such change split off was the register allocation split:
This was nice and self contained, and the test diff was only around
130 lines. It was a natural first choice to split out from the patch
since it was tangential to RISCVInsertVSETVLI itself, and even though
this didn’t make a dent in the overall test diff we could rule out
the register allocation split as one of the forces at play. Getting it
out of the way early allowed us to stop thinking about it and save our
mental effort for other things.
The next step was to go ahead and carve out the NFC changes: No Functional Change changes. This is a common practice in LLVM and other projects where changes that don’t (intentionally) affect any observable behaviour are landed as patches labelled as NFC: things like refactoring or adding test cases.
Knowing that a patch isn’t supposed to have functional changes helps reasoning about it, and if it did end up accidentally changing some behaviour (which happens all the time) then it’s easier to bisect. We were able to find a good few NFC bits from the initial patch:
Some of these NFC patches were just a matter of tightening existing assumptions about the code by adding invariants:
Whilst others were about removing configurations to reduce the number of code paths:
And as the code diff became smaller, some of the edge cases became more apparent. We refactored things to make them more explicit, which made it more clear when we were dealing with them in the main patch:
All of the above had the same common goal, doing the thinking up front to simplify things further down the line. But whilst it did make the code changes easier to reason about, we still had the very large, very complicated test diff to deal with.
At some point the NFC patches started to dry up, and we had to start tweezing apart the tangle of changes in the test diff.
Ideally each patch would contain only functional change, but that’s easier said than done, especially on a project like LLVM with so many moving parts. Often you’ll need to touch multiple passes to make one codegen change, and a lot of the time these changes will be interlinked: you need to change something in pass A, but on its own it results in regressions until you land a change in pass B.
We began by ignoring those tricky changes for now, and instead looked for things that could be landed as a isolated, incremental, net-positive (or at worst neutral) changes.
Some of the test diffs were due to deficiencies in the existing code
that just happened to surface when combined with the changes in our
patch. For example, changes in the scheduling showed how we sometimes
failed to merge vsetvlis when they were in a different order.
A significant chunk of the test diffs were due to RISCVInsertVSETVLI
being moved after other passes. For example, moving it past
RISCVInsertReadWriteCSR and RISCVInsertWriteVXRM meant that
vsetvlis were now inserted after csrwis and fsrmis.
diff --git a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
index 5b271606f08ab..aa11e012af201 100644
--- a/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll
@@ -15,8 +15,8 @@ define <vscale x 1 x half> @vp_ceil_vv_nxv1f16(<vscale x 1 x half> %va, <vscale
; CHECK-NEXT: vfabs.v v9, v8, v0.t
; CHECK-NEXT: vsetvli zero, zero, e16, mf4, ta, mu
; CHECK-NEXT: vmflt.vf v0, v9, fa5, v0.t
-; CHECK-NEXT: vsetvli zero, zero, e16, mf4, ta, ma
; CHECK-NEXT: fsrmi a0, 3
+; CHECK-NEXT: vsetvli zero, zero, e16, mf4, ta, ma
; CHECK-NEXT: vfcvt.x.f.v v9, v8, v0.t
; CHECK-NEXT: fsrm a0
; CHECK-NEXT: vfcvt.f.x.v v9, v9, v0.t
We could take these relatively boring 9000-ish lines of test diff
separately by moving RISCVInsertVSETVLI just after these passes.
This was one of the patches where the test diff wasn’t actually a win: It was just important to be able to understand the changes in isolation.
Another trick we used to isolate the diffs was to split the pass itself and move part of it past register allocation.
This let us absorb the diffs for a specific function,
doLocalPostpass, which were small enough to be individually reviewable.
This was unfortunately one of the functional changes that introduced a temporary regression, which wouldn’t be fixed until the rest of the work was landed. We justified it though since we were able to understand where it was coming from and how the final patch would fix it.
At this stage, the last remaining change was to do the actual move to
after register allocation, which meant changing the pass to work on
LiveIntervals since we would be out of SSA.
At first glance this appeared to be one indivisible chunk of work, but
it turned out we could defer the changes from the pass reordering till
later by only moving it up past phi elimination. That was enough to
get us out of SSA and into LiveIntervals, which was ultimately the
tricky bit that we wanted to be sure we got right.
This patch had most of the code changes, but a minimal test diff of around 300 lines. Being able to land the riskiest part with a small test diff allowed us to see the code that ended up being affected, which would have been completely hidden otherwise in the original 30k line diff.
After this, we were just left with the original patch which was now
just a matter of moving RISCVInsertVSETVLI past register allocation:
diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
index 5aab138dae408..d9f8222669cab 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
@@ -399,6 +406,8 @@ bool RISCVPassConfig::addRegAssignAndRewriteFast() {
bool RISCVPassConfig::addRegAssignAndRewriteOptimized() {
addPass(createRVVRegAllocPass(true));
addPass(createVirtRegRewriter(false));
+ if (EnableVSETVLIAfterRVVRegAlloc)
+ addPass(createRISCVInsertVSETVLIPass());
Looking the numbers we were now were down to a 19k line test diff, still a considerable size but now the code changes were tiny.
This is the converse to #91440: we have a patch with minimal code
changes but a large test diff, and this is also much easier to
review. We can have confidence that all changes in the test diff come
from a single functional change, which in our case this was the
interaction of moving RISCVInsertVSETVLI past register allocation
and unblocking the machine scheduler.
None of the above splits were obvious to us at first. For a change like this, the only realistic starting point was to just implement it all at first. Once we had a picture of what the destination should look like, we could start to identify some initial things to split off.
From then on out it’s mostly a matter of splitting, rebasing and repeating the cycle. New incision points will become clearer as both the code and test diffs are whittled down, and hopefully the cycle will eventually lead you to a patch small enough that you can digest everything that’s going on inside of it.
Splitting up big patches is not a new or rare idea, and it’s done all
the time in LLVM. Take for example the ongoing work to replace
getelementptr with
ptradd,
being split up into several smaller changes to canonicalize and
massage the GEPs into shape before a new instruction is added:
Likewise the epic journey to remove debug intrinsics was carried out over multiple RFCs:
Being able to submit reviewable and landable patches is half the battle of collaborative open source software development. There’s no one-size-fits-all strategy and it often ends up being much more difficult than writing the actual code. But if you’re working on something big, hopefully some of the ideas in this post will be of inspiration to you.
Many thanks to Piyou and the reviewers who helped drive this forward, unblocking the way for further enhancements to vector codegen in the RISC-V backend.
]]>