<![CDATA[The Coding Nest]]>https://codingnest.com/https://codingnest.com/favicon.pngThe Coding Nesthttps://codingnest.com/Ghost 5.94Mon, 16 Mar 2026 20:57:17 GMT60<![CDATA[Random distributions are not one-size-fits-all (part 2.5)]]>https://codingnest.com/random-distributions-are-not-one-size-fits-all-part-2-5/668198436cedb4034513643cMon, 01 Jul 2024 12:05:28 GMT

I recently realized that I could try to implement the reuse-oriented distributions from part 2 using libdivide instead of native division (modulo). This post will quickly go through the results of benchmarking this approach.

Introduction

If you are not familiar with libdivide, it is a library for optimizing integer division in workloads where you will be using the same runtime[1] divisor multiple times. libdivide does not support computing modulo, but we can fake it by substituting n % d for n - (n / d) * d.

In the distribution reuse workload from part 2, both the OpenBSD and the Java algorithm compute modulo using the same modulus every time, so precomputing a faster division has the potential to improve performance.

Benchmarks

For this post, I just extended the benchmark for generating many random variates from the same range from part 2. For more details, look there.

Results

As in part 2, all running time shown is in time per single element, and you can click all images for a fullscreen version. There is also once again a separate interactive graph with all data and distributions.

Bar graph with the time to generate many numbers from single distributions, OpenBSD with HW division vs OpenBSD with libdivide

Interestingly, using libdivide is not always better. It provides a significant speed-up for small ranges but is a slowdown for large ranges. This is interesting, because it means that for large divisors, the performance of the hardware divider is close to the performance of multiplication[2].

Bar graph with the time to generate many numbers from single distributions, OpenBSD with libdivide vs Lemire's algorithm

Compared to Lemire's algorithm, OpenBSD + libdivide no longer loses significantly for small ranges. But it also does not win by a meaningful margin for bigger ranges. Technically, it wins against Lemire's with slower multiplication, but that is a deeply unfair comparison; both algorithms now rely on 64x64->128 extended multiplication, so we cannot use slower implementation just for one of them.

Conclusion

Using libdivide is not a strict win for either OpenBSD or Java[3] algorithm. In a world where Lemire's algorithm does not exist, it would be an interesting improvement, as it evens out the performance differences between the different output range sizes.

However, Lemire's algorithm does exist, and using libdivide turns the OpenBSD algorithm into a worse Lemire's algorithm. OpenBSD + libdivide replaces the modulo operation with an extended multiplication, addition, shift, and another multiplication + subtraction. Lemire's algorithm uses a clever insight about extended multiplication + downshift to range reduce the generated random number directly.



  1. If the divisor is known at compile time, the compiler will optimize it for you. ↩︎

  2. libdivide works by doing extended multiplication between the input and a precomputed constant, followed by some adds and shifts to fix up the results[4]. For our use case, that is, taking modulo, we add extra multiplication and subtraction on top of this. ↩︎

  3. I did not talk about results from Java's algorithm in this post for a simple reason: it was worse for this use case in part 2, and using libdivide has changed nothing about the underlying reason for this. I included the data in the big graph, but the tl;dr is that using libdivide for Java's algorithm is the same as with OpenBSD: better performance for small ranges, worse performance for large ranges. ↩︎

  4. Does this sound weirdly familiar? It should; Lemire's algorithm is based around a similar emul + downshift trick to reduce the input range into a predefined range. ↩︎

]]>
<![CDATA[Random distributions are not one-size-fits-all (part 2)]]>https://codingnest.com/random-distributions-are-not-one-size-fits-all-part-2/65a3048529c70e033d728705Mon, 01 Apr 2024 14:13:42 GMT

Updated 30.06.2024: I noticed a bug in the Java distribution implementation, which changed the performance profile. I reran the benchmarks and updated text below to reflect the reality.

In the previous post, we looked at Lemire's algorithm for generating uniformly distributed integers and how it can be specialized for different use cases. In this post, we will look at different algorithms for generating uniformly distributed random numbers and determine whether they fit specific use cases better than Lemire's algorithm.

Introduction

The original paper that introduced Lemire's algorithm compared it to two other algorithms, dubbed "OpenBSD" and "Java", based on the project they were taken from. Lemire's algorithm has won, by quite a lot. But when I was running the benchmarks for part 1, I noticed that when generating uint64_t, the expected differences decreased, or even disappeared entirely, due to the slow implementation of extended 64x64 -> 128-bit multiplication.

That got me thinking: Would Lemire's algorithm still win if we did not have access to hardware that has native support for 64x64->128 multiplication? Alternatively, how slow would the multiplication have to be for the implementations relying on modulo to win?

Let's first look at the different implementations of extended multiplication we will be using, and then we will look at the alternative algorithms for generating uniformly distributed random integers.

Extended multiplication

This post will benchmark three different approaches towards extended multiplication.

  1. Naive
    This is the approach we all learned in elementary school as "long multiplication". Since we are working on a computer, we will use 32-bit integers as the individual digits. The advantage of this approach is that it is evidently correct and easy to implement.
  2. Optimized
    You can find this approach floating around in places like Hacker's Delight. It also implements long multiplication, but unlike the naive approach, we do not keep to strictly 32-bit digits. This allows us to significantly reduce the number of additions that happen before we get the result. The advantage of this approach is that it is reasonably performant and also widely portable.
  3. Intrinsic
    This approach uses the compiler intrinsic for extended multiplication. This leads to great results[1] on CPUs that natively support 64x64 extended multiplication, but it is not available on all platforms, e.g. 32-bit x86 or ARM.

You can look at all of these on Compiler Explorer. When you look at the generated assembly, the optimized approach does less work than the naive approach. There is still the same total number of multiplications, but there are fewer additions, and fewer shifts and the critical chain is shorter.

Obviously, not all ASM instructions are equal, and counting them does not tell the whole story. So, we will benchmark all three approaches and compare their results later.

Alternative algorithms

Let's take a look at the pseudo-implementations of the two alternative algorithms.

OpenBSD

uint64_t openbsd(uint64_t max) {
    uint64_t threshold = (-max) % max;
    uint64_t x;
    do {
        x = random64();
    } while (x < threshold);
    return x % max;
}

As written, this algorithm needs to compute modulo twice per returned number, which virtually ensures terrible performance. Lemire's algorithm innovation is that, in most cases, it does not need even 1 modulo.

However, if we know we will be generating numbers from the same distribution repeatedly, one of the modulos can be amortized as it depends purely on the distribution range (see part 1). The specialized implementation will require only one modulo per generated number, thus providing reasonable performance.

For use cases where we cannot amortize the first modulo for computing threshold, we can safely assume that this algorithm is not competitive.

Java

uint64_t java(uint64_t max) {
    uint64_t x = random64();
    uint64_t r = x % max;
    while (x - r > -max) {
        x = random64();
        r = x % max;
    }
    return r;
}

This algorithm needs on average 1 modulo per generated number, but it might require multiple. Unlike the OpenBSD and Lemire algorithms, there are no optimizations we can apply for generating multiple numbers from the same distribution[2].

We should expect this to be about equal to the OpenBSD algorithm for generating many numbers in the same distribution, and much better for generating numbers from different distributions.

Benchmarks

Once again, all the benchmarking code is on GitHub. In fact, it is in the same repository. For this post, we have three different benchmarks:

  1. Extended multiplication
  2. Generating multiple random numbers from the same range
  3. Generating random numbers from different ranges

Setup

Just as with the previous post, the numbers in this post were taken from my personal machine with a Ryzen 7 4750 CPU, but for this post, I am using slightly newer MSVC in version 19.39.33520 (this MSVC version comes installed with VS 17.9.1) and compiling for CMake-generated Release configuration[3].

When benchmarking extended multiplication, we generate two arrays of varying sizes, multiply each pair, and sum the results to ensure that the computation is not optimized away. We do this because we cannot benchmark the multiplication of a single pair of numbers, as the overhead from Catch2's benchmarking becomes a significant part of the total runtime.

We use the same basic setup as in the previous part for the distribution benchmarks. There will be one benchmark where we generate a number from a different distribution every time and one where we generate many numbers from the same distribution. However, unlike the previous post, we do not attempt to model some real-world usage and instead look for bounds where Lemire's algorithm loses.

Because the previous post has shown that we can gain performance from specializing the implementation for generating multiple numbers from one distribution, I implemented each distribution in this post twice, once as standard for the no-reuse benchmark and once specialized for the reuse benchmark.

Results

All running time shown is in time per single element. You can click all images for a fullscreen version.

Extended multiplication

Bar graph with the time to multiply two 64-bit numbers into 128-bit resultYou can also play around with the interactive version of the graph

We see some slowdown for long inputs. This slowdown is not interesting to us, as it is caused by running out of space in lower levels of cache, but in distribution code, we will only ever multiply two numbers at once. Looking at the results for small inputs, we see that using naive multiplication is more than 4 times slower than getting the compiler to "do the right thing" using intrinsics. Even if we optimize our implementation, there is still a 2.6x slowdown from using portable code.

This shows that other algorithms might be able to win if we do not use native 64x64 extended multiplication.

Single-use distributions

This is the case where we generate every number from a different range.

Bar graph with the time to generate numbers from different distributionsYou can also play around with the interactive version of the graph

As expected, the OpenBSD algorithm is not competitive for this use case. It takes roughly twice as long as the Java one because it pays the cost of the modulo twice per number. But Java is not competitive either. Lemire's algorithm is much faster even when it uses the naive extended multiplication implementation, and with the best extended multiplication implementation, it is roughly twice as fast.

Let's take a look at generating a lot of numbers from the same range instead.

Reused distributions

Updated 30.06.2024: I noticed a bug in the Java distribution implementation, which changed its performance profile. I reran the benchmarks and updated text below to reflect the reality.

This is the case where we generate a lot of numbers from the same range.

I originally collected a lot more data than is shown below, but I could not make the graph with all of them readable in image form. If you want to see the data in detail, here is an interactive graph.

Bar graph with the time to generate many numbers from single distributionsYou can also play around with the interactive version of the graph

The first thing we see here is that the time needed to generate one uniformly distributed number varies across different bounds. There are two reasons for this. The first is that the bounds change how many numbers from the underlying random number generators will be rejected[4]. The second reason is that the DIV[5] instruction has variable latency based on the input values.

Because Lemire's algorithm does not use modulo, the differences in its runtime come purely from the effect of rejection frequency. For N == UINT64_MAX / 2 + 1, each output from the random number generator has only \(\approx 50\%\) chance of being usable, which causes the massive spike there. Both OpenBSD and Java algorithms use modulo, and we can see that their runtime decreases[6] even between bounds with roughly the same amount of rejection.

When we compare the distributions between themselves, we can see that the OpenBSD algorithm consistently wins over Java. This is a reversal of the results from the previous use case and is caused by the Java algorithm needing to compute one modulo per candidate number from a random number generator. In comparison, OpenBSD only needs one modulo per returned variate.

We can also see the OpenBSD algorithm slightly winning over Lemire's algorithm for specific, very high bounds. However, Lemire's algorithm is the better option for a generic use case, as it outperforms OpenBSD significantly for low bounds and performs equally or only slightly worse for high bounds.

The situation would be more complicated if we forced Lemire's algorithm to use portable extended multiplication. In this case, the OpenBSD algorithm achieves performance parity with Lemire's at lower bounds than against intrinsic extended multiplication and, for higher bounds, actually outperforms it.

Conclusion

In the shuffle-like workload, Lemire's algorithm outperforms both Java and OpenBSD algorithms by a significant margin. In fact, OpenBSD performs so badly for this workload type that if you have to pick a single algorithm for all your needs, you should never pick the OpenBSD algorithm.

The OpenBSD algorithm can outperform Lemire's when generating many numbers from the same, very large, range. However, the biggest measured win for OpenBSD is ~8%, while it loses by much bigger margins for small ranges, so Lemire's algorithm is the better workload-independent choice. However, we can gain performance by choosing and picking different algorithms based on the size of the range from which to generate. This links back to my old claim that an ideal <random> library should expose the different distribution algorithms under their names rather than under abstract names based on their statistical properties. Back then I based this opinion on allowing future extensions to the standard library, as people invent better algorithms to generate random numbers with specific properties, but these results provide another reason; usage-dependent performance differences.


That's it for this part. The next part will be about generating uniformly distributed floating point numbers and how different algorithms lead not just to different performance, but to fundamentally different generated numbers.



  1. For x64 the emul turns into a single mul instruction (+ few movs to move the results into the right registers), while for ARM64 it turns into a umulh and mul pair. ↩︎

  2. Technically, when writing a C++-style distribution object, we can precompute the max value from the user-provided [a, b] range. This optimization has no measurable effect, but I did this in the linked benchmark code anyway, just for completeness's sake. ↩︎

  3. Visual Studio solutions generated by CMake do not have LTO enabled for Release builds. However, all relevant functions are visible and inlinable in the benchmarks, so there should be no difference from this. ↩︎

  4. Remember that generating uniformly distributed random numbers from a random bit generator requires a range reduction and rejecting the "surplus" numbers that cannot be fully mapped to the target range. ↩︎

  5. On x64, div returns the division result and the remainder. On ARM64, udiv returns only the division result (and needs to be followed by msub to get the remainder), but it also has variable latency. ↩︎

  6. The larger the divisor, the faster the modulo runs. I threw together a benchmark for modulo performance over the different bounds. Here are the results as an interactive graph (as still image). ↩︎

]]>
<![CDATA[Random distributions are not one-size-fits-all (part 1)]]>https://codingnest.com/random-distributions-are-not-one-size-fits-all-part-1/65778b060e4757199e312a8dSat, 13 Jan 2024 23:52:51 GMT

Recently, I was once again on CppCast, and once again[1], I talked about Catch2 and C++'s standard library support for random numbers. One of the things I said was that random distributions are not one-size-fits-all.

This post is about how this applies to Lemire's algorithm, which is the state of the art for generating uniformly distributed integers in specific intervals.

Introduction

In C++, <random> separates the concern of generating random numbers between Random Number Engines, which generate integers in some predefined range, and distributions, which turn these into other numbers in the range requested by the user. This is a powerful, expert-friendly design, even though the actual standardized library leaves a lot to be desired.

Notably, the distributions are specified as types and are allowed to be stateful. This means that a distribution could be written to optimize for repeatedly generating numbers from the same instance rather than generating a single number as fast as possible[2].

This post will look specifically at Lemire's algorithm for generating uniformly distributed integers. This is because I recently implemented reproducible uniform integer distribution for Catch2 using Lemire's algorithm, so I can use my own code for the benchmarks[3].

Lemire's algorithm is the current state of the art for generating uniformly distributed random integers in an interval [0, max)[4]. I won't go over the proof of the algorithm in this post; if you want to know why it works, go read the paper, but we will look at a pseudo-implementation.

uint32_t lemire( uint32_t max ) {
  uint64_t long_mult = random32() * uint64_t(max);
  uint32_t low_bits = uint32_t(long_mult);
  // Cheap probabilistic check
  if (low_bits < max) {
    uint32_t rejection_threshold = -max % max;
    // Precise check
    while (low_bits < rejection_threshold) {
      long_mult = random32() * uint64_t(max);
      low_bits = uint32_t(long_mult);
    }
  }
  return long_mult >> 32;
}

The core trick is in the probabilistic check guarding the expensive modulo operation used to calculate precise rejection threshold. The chance that we will need the modulo is \( \frac{\textbf{max}}{2^{32}} \), which translates to "basically zero" for a lot of the common use cases[5].

But we can also end up computing rejection_threshold on every invocation. And if you look at how we compute it, you will notice that it does not depend on the generated random number, only on max. This is where a stateful object can optimize the code for generating multiple numbers; it can compute rejection_threshold once and use it every time it generates a new number.

Will precomputing the threshold pay off? Let's write some benchmarks and see.

Benchmarks

Please note that while this post has benchmark results for uint32_t and uint64_t, the latter are not representative of an optimized implementation. Lemire's algorithm uses extended multiplication, but the used implementation from Catch2 uses naive long-multiplication. This means that the results for uint64_t are much slower than they would be with optimized implementation. Thus, they can show the trend, but the performance difference between the two implementations is smaller than it should be.

Setup

All the code for the benchmarks is publicly available on GitHub. I took the code for the RNG and Lemire's algorithm from Catch2, the benchmarks themselves are done using Catch2's micro benchmarking support.

There are three benchmarks:

  1. Stacked heavily for reuse.
    The larger the target interval, the greater the chance Lemire's algorithm will need to compute modulo. Thus, we will generate numbers in interval [0, numeric_limits<type> - 1)[6], and generate lot of them.
  2. Stacked heavily against reuse.
    As long as we never reuse the distribution object instance, there is no advantage to precomputing the modulo. If we also always create the distribution object with a new bound, this corresponds to how it would be used during a Fisher-Yates shuffle.
  3. Benchmark based on use in Catch2.
    Catch2 internally uses uniform int distribution in two places. Both places keep reusing the same distribution but with significantly different bounds and expected number of invocations.

For every benchmark, we have three contenders:

  1. Just the RNG
    This does not generate the required numbers, but it acts as a baseline to tell us what the overhead of the distribution algorithm is.
  2. Plain implementation of Lemire's algorithm
  3. Implementation of Lemire's algorithm with precomputed thresholds

The numbers in this post were taken from my personal machine with a Ryzen 7 4750 CPU, using MSVC in version 19.38.33133 (this MSVC version comes installed with VS 17.8.3) and compiling for CMake-generated Release configuration[7].

Results

Benchmark stacked for reuse

uint32_t

n RNG plain Lemire reuse Lemire \(\frac{\textbf{plain}}{\textbf{reuse}} \)
100'000 271.75 ± 2.34 us 291.29 ± 3.57 us 292.30 ± 4.15 us 1.00
1'000'000 2.70 ± 0.01 ms 2.94 ± 0.05 ms 2.92 ± 0.03 ms 1.01
10'000'000 27.06 ± 0.05 ms 29.26 ± 0.16 ms 29.33 ± 0.16 ms 1.00

uint64_t

n RNG plain Lemire reuse Lemire \(\frac{\textbf{plain}}{\textbf{reuse}} \)
100'000 430.37 ± 8.55 us 545.54 ± 11.43 us 540.70 ± 10.84 us 1.01
1'000'000 4.36 ± 0.03 ms 5.42 ± 0.04 ms 5.48 ± 0.04 ms 0.99
10'000'000 43.37 ± 0.09 ms 54.46 ± 0.11 ms 54.45 ± 0.13 ms 1.00

As we can see, when we stack the deck towards the reuse-optimized implementation, the reuse-optimized implementation performs …

… exactly the same as the basic implementation.

Wait what?

Benchmarking is hard: the optimizer is good at its job

As it turns out, when the optimizer sees code like this

constexpr uint32_t high = std::numeric_limits<uint32_t>::max()-1;
BENCHMARK("noreuse, iters=" + std::to_string(iters)) {
	uint32_t sum = 0;
	lemire_algorithm_no_reuse<uint32_t> dist(0, high);
	for (size_t n = 0; n < iters; ++n) {
		sum += dist(rng);
	}
	return sum;
};

it uses the fact that high is statically known to inline everything and precompute the rejection threshold during compilation, so there is no div instruction in the resulting code. Thus, the plain implementation never pays extra cost by recomputing the modulo and cannot lose.

You can play with a simplified example on Compiler Explorer. To keep the generated ASM short, the code only generates a single number, and the RNG implementation is opaque to avoid inlining it.

Since the compiler does more or less the same for the reuse-optimized implementation, neither has an advantage when the target interval is statically known. So, to measure the effect we care about, we have to hide the interval from the compiler. In the benchmark code, this is done by calling an opaque function, which returns the desired value but cannot be inlined due to lack of LTO.

Benchmark results revisited

Benchmark stacked for reuse (again)

uint32_t

n RNG plain Lemire reuse Lemire \(\frac{\textbf{plain}}{\textbf{reuse}} \) \(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000 271.68 ± 4.14 us 335.53 ± 3.65 us 275.80 ± 4.79 us 1.22 15.50
1'000'000 2.69 ± 0.01 ms 3.38 ± 0.01 ms 2.74 ± 0.01 ms 1.23 13.80
10'000'000 26.89 ± 0.05 ms 33.76 ± 0.04 ms 28.00 ± 0.04 ms 1.21 6.19

uint64_t

n RNG plain Lemire reuse Lemire \(\frac{\textbf{plain}}{\textbf{reuse}} \) \(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000 418.77 ± 6.33 us 638.53 ± 11.42 us 588.60 ± 11.42 us 1.08 1.29
1'000'000 4.35 ± 0.04 ms 6.26 ± 0.04 ms 6.07 ± 0.03 ms 1.03 1.11
10'000'000 42.51 ± 0.09 ms 64.74 ± 0.12 ms 59.53 ± 0.12 ms 1.09 1.31

As we can see, after we stop the optimizer from optimizing out the important part, precomputing the modulo speeds up the random number generation by \(\approx 20\%\), with the distribution itself being up to \(15\times\) faster.[8] Remember that this benchmark is the best case for the stateful implementation, as we stacked everything in its favour.

Let's take a look at the opposite benchmark.

Benchmark stacked against reuse

uint32_t

n RNG plain Lemire reuse Lemire \(\frac{\textbf{plain}}{\textbf{reuse}} \) \(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000 263.53 ± 3.69 us 284.64 ± 5.10 us 525.50 ± 3.14 us 0.54 0.08
1'000'000 2.63 ± 0.01 ms 2.86 ± 0.02 ms 4.90 ± 0.01 ms 0.58 0.10
10'000'000 26.39 ± 0.20 ms 28.37 ± 0.08 ms 44.97 ± 0.04 ms 0.63 0.11

uint64_t

n RNG plain Lemire reuse Lemire \(\frac{\textbf{plain}}{\textbf{reuse}} \) \(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000 449.11 ± 10.71 us 630.46 ± 12.96 us 928.24 ± 6.31 us 0.68 0.38
1'000'000 4.54 ± 0.04 ms 6.27 ± 0.04 ms 8.96 ± 0.03 ms 0.70 0.39
10'000'000 44.84 ± 0.14 ms 62.77 ± 0.10 ms 84.98 ± 0.07 ms 0.74 0.45

Once again, the difference in the overhead of the two implementations is massive, with plain Lemire being 9 to 12 times faster. This boils down to this benchmark being an inversion of the first one. Where the first benchmark forced the plain Lemire implementation to compute modulo for every number and let the reuse variant compute it exactly once, this benchmark forces the reuse variant to always compute modulo while plain Lemire avoids the computation most of the time.

The difference in the total runtime is much more significant this time around, with the reuse variant being 60-80% slower. This is a terrible result and shows that the reuse variant does not make a good default.

Benchmark(s) from Catch2

The two places in Catch2 needing uniformly generated integers are generating uniformly distributed floating point numbers[9], and resampling benchmark measurements. The interval for the former depends on the range the user asks for, but we will use \([0, 33554430]\) and \([0, 18014398509481982]\) for the benchmark[10]. These numbers present the upper bound on what we can expect and thus are most favourable for the reuse implementation. The number of iterations is unknown, but generally, I would expect it to be low (\(<1000\)).

Resampling benchmark measurements defaults to interval \([0, 100)\) and generates 10M random numbers with the one instance it creates[11]. The user can customize both of these, but generally they won't.

Generating floats

uint32_t

n RNG plain Lemire reuse Lemire \(\frac{\textbf{plain}}{\textbf{reuse}} \) \(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1 2.64 ± 0.11 ns 2.86 ± 0.10 ns 4.06 ± 0.06 ns 0.70 0.15
5 13.20 ± 0.45 ns 13.89 ± 0.76 ns 13.70 ± 0.23 ns 1.01 1.38
10 26.30 ± 0.37 ns 27.60 ± 0.42 ns 27.06 ± 0.45 ns 1.02 1.71
100 262.76 ± 3.77 ns 276.58 ± 3.71 ns 270.62 ± 0.52 ns 1.02 1.76
1000 2.63 ± 0.03 us 2.75 ± 0.05 us 2.69 ± 0.04 us 1.02 2.00

uint64_t

n RNG plain Lemire reuse Lemire \(\frac{\textbf{plain}}{\textbf{reuse}} \) \(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1 3.93 ± 0.13 ns 8.34 ± 0.36 ns 8.44 ± 0.10 ns 0.99 0.98
5 20.08 ± 0.31 ns 31.09 ± 0.50 ns 31.35 ± 0.80 ns 0.99 0.98
10 40.48 ± 0.94 ns 59.29 ± 0.70 ns 60.26 ± 4.21 ns 0.98 0.95
100 407.88 ± 14.16 ns 575.40 ± 8.08 ns 580.08 ± 10.15 ns 0.99 0.97
1000 4.08 ± 0.08 us 5.74 ± 0.23 us 5.69 ± 0.19 us 1.01 1.03

Unsurprisingly, when you only want to get a single random (floating point) number, the plain Lemire algorithm has a large advantage. What is surprising is how few we need to generate for the reuse implementation to catch up and eventually gain an advantage. Even 1000 numbers should be too low to require the plain implementation to recompute the division repeatedly.

For uint64_t, we see the slooow multiplication implementation hide any other difference.

Benchmark resampling

uint32_t

n RNG plain Lemire reuse Lemire \(\frac{\textbf{plain}}{\textbf{reuse}} \) \(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1'000'000 2.63 ± 0.01 ms 2.72 ± 0.01 ms 2.71 ± 0.01 ms 1.00 1.13
10'000'000 26.33 ± 0.04 ms 27.21 ± 0.05 ms 27.19 ± 0.04 ms 1.00 1.05
100'000'000 263.29 ± 0.17 ms 271.99 ± 0.17 ms 271.77 ± 0.16 ms 1.00 1.03
1'000'000'000 2.63 ± 0.01 s 2.83 ± 0.14 s 2.79 ± 0.02 s 1.01 1.25

uint64_t

n RNG plain Lemire reuse Lemire \(\frac{\textbf{plain}}{\textbf{reuse}} \) \(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1'000'000 4.18 ± 0.01 ms 5.79 ± 0.02 ms 5.84 ± 0.01 ms 0.99 0.97
10'000'000 41.87 ± 0.04 ms 57.77 ± 0.15 ms 58.41 ± 0.07 ms 0.99 0.96
100'000'000 418.45 ± 0.23 ms 577.70 ± 0.58 ms 585.25 ± 1.64 ms 0.99 0.95
1'000'000'000 4.19 ± 0.02 s 5.80 ± 0.02 s 5.78 ± 0.01 s 1.00 1.01

For benchmark resampling, there are minimal differences between the two implementations. Plain Lemire implementation is very unlikely to pay the extra division cost due to the small upper bound, and the reuse implementation easily amortizes the additional cost over the many generated numbers. When generating a lot[12] of numbers, the reuse implementation starts winning again, but only by small margins.

Conclusion

So, what did we learn from the benchmarks?

First, if the compiler can see the interval bounds during compilation time, it does not matter which of the two shown implementations we pick, because it will optimize the division away in either case. In other words, a function like this

uint32_t roll_D6(RNG& rng) {
	lemire_algorithm<uint32_t> dist(1, 6);    
    return dist(rng);
}

will always perform optimally with either implementation. But when the compiler cannot see the bounds, their performance difference can be massive. If we craft benchmarks specifically to advantage one implementation over the other, we see more than a 10x difference between the overhead imposed by the two implementations.

From how we crafted the edge case benchmarks, we also see that if we can expose only one implementation, the implementation optimized for reuse is a bad default. The benchmark favouring reuse is contrived, and as we make it less so, the advantage of the reuse implementation decreases. On the other hand, the benchmark favouring plain Lemire is based on a real-world usage pattern representing the Fisher-Yates shuffle.

We have also learned that Lemire's algorithm strongly relies on fast extended multiplication. If we do not have optimized extended multiplication for a specific width, other implementation considerations do not matter, and Lemire's algorithm becomes much slower.

From the workload-specific benchmarks, we've learned that Catch2 is indeed better off with the reuse-optimized implementation.[13] Catch2 does not use random shuffle[14], and while the performance improvement will be slight, it is functionally free because Catch2 had to implement its own distribution anyway.


That's it for this post. I will write at least one follow-up post about the case where we care about the statistical properties of the generated numbers but not their reproducibility. This means that we can use completely different algorithms, rather than looking into different implementation strategies for the same one.)



  1. Previously, I was on for episode 248, and the topic was "Catch2 and std::random". ↩︎

  2. Technically, some distributions already generate more than one number at a time, e.g. the Marsaglia polar method generates two normally distributed numbers per run. ↩︎

  3. I also implemented reproducible uniform floating point distribution for Catch2. This might come up later in this post. ↩︎

  4. The transformation from [0, max) interval to an [a, b) interval is trivial; just add a to the result. For [a, b], you use max = b + 1. ↩︎

  5. Consider simulating a fair D6. In that case, \(\text{max} = 6\), and the chance of any call having to calculate the exact rejection threshold is \(\frac{3}{2^{31}}\). But even in less nice cases, the chance that we have to calculate the modulo is small. When randomly shuffling 100k elements, there is still \(\approx 31\%\) chance of not having to compute modulo at all[15]. ↩︎

  6. Why numeric_limits::max() - 1 and not numeric_limits::max()? Because when all values in the type are valid, returning bits from the random number generator directly is a common optimization. Thus, we ask for the largest number in type to be invalid to force the actual rejection algorithm to run. ↩︎

  7. Visual Studio solutions generated by CMake do not have LTO enabled for Release builds. However, all relevant functions are visible and inlinable in the benchmarks, so there should be no difference from this. ↩︎

  8. Why look at both the total runtime and the distribution overhead? Because both are interesting. The overall runtime is what the end user sees, but the speed of the RNG largely dominates it. Meanwhile, the difference in the distribution overhead tells us what effect the implementation choices have, but it does not tell us whether the difference matters in practice. ↩︎

  9. Catch2 implements Frédéric Goualard's algorithm for drawing uniformly distributed floating point numbers in an interval. As part of the process, it generates a uniformly distributed integer first. ↩︎

  10. \(33554430\) (or \(2^{25} - 2)\) is how many equidistant floats are there between -FLT_MAX and FLT_MAX. \(18014398509481982\) (or \(2^{54} - 2\)) is how many equidistant doubles are there between -DBL_MAX and DBL_MAX. ↩︎

  11. This is because the default is to perform 100k resamples of 100 samples each. Selecting each sample requires generating a random number first. ↩︎

  12. Originally, I thought benchmarking with 1'000'000'000 numbers was unreasonable, but when I was writing this post, someone came onto Catch2's Discord for help with slow benchmarks. They set benchmarking samples to 2000, which means that just preparing all the resamplings required generating 2 (short) billion numbers from the same distribution. ↩︎

  13. But I do need to implement 64 -> 128-bit multiplication before I can use it for resampling benchmark results, or it will be a significant slowdown from the currently used std::uniform_int_distribution. ↩︎

  14. Catch2 supports random shuffling of tests, but it is not implemented as a random shuffle to guarantee subset invariance during the shuffle. ↩︎

  15. You could also decrease the chance of needing modulo to almost zero by using 64-bit indices to shuffle the elements, but doubling the output width of the generator will likely cost you more than the occasional modulo would. ↩︎

]]>
<![CDATA[The Little Things: The Missing Performance in std::vector]]>https://codingnest.com/the-little-things-the-missing-performance-in-std-vector/64d143120e4757199e3123deSun, 27 Aug 2023 15:36:33 GMT

std::vector is often said to be the default container. This is because it provides a simple abstraction over a growable contiguous chunk of memory, thus providing good baseline performance for common operations, such as adding more elements or iterating over them.

However, many non-standard reimplementations of std::vector have emerged over time. These are usually driven by performance consideration, e.g. avoiding allocations for small vectors[1], having static capacity[2], tightly integrating with the OS for faster reallocations[3], or optimizing moving elements with bitwise copying[4].

Most of these cannot be merged into std::vector without changing it significantly, e.g. small vector optimization completely changes the expected complexity of moving the vector[5]. Other can require other library and language changes, such as support for bitwise moves. But recently, I ran into a simple API change that can improve the performance of a common usage pattern by 10+ %.

Let's consider a simple function that takes one vector and returns a transformation of the input.

std::vector<size_t> convert_to_indices(std::vector<Key> const& keys) {
    std::vector<size_t> indices;
    for (Key k : keys) {
        indices.push_back(to_index(k));
    }
    return indices;
}

As I mentioned in a previous article, this implementation leaves a bunch of performance on the table. We can easily make it faster by reserving enough memory for indices ahead of time, like so

std::vector<size_t> convert_to_indices(std::vector<Key> const& keys) {
    std::vector<size_t> indices;
    indices.reserve(keys.size());
    for (Key k : keys) {
        indices.push_back(to_index(k));
    }
    return indices;
}

But this code still leaves some performance on the table, depending on the compiler[6] and stdlib implementation. So, how can we fix this?

We cannot, not without changing std::vector. The lost performance comes from the fact that vector::push_back has to check whether it needs to grow the vector, even though it will never happen in the convert_to_indices function. A sufficiently smart compiler could optimize this out, but neither GCC nor Clang do (and MSVC makes a complete hash of things).

What we need is a function that inserts a new element into the vector, but does not check for capacity[7]. I like the name push_back_unchecked for this functionality, so I will use it in the rest of this post.

Benchmarks

How much of a difference does using push_back_unchecked make? To figure this out, we would need a reimplementation of std::vector, which also includes push_back_unchecked. So I made a very simple one. It only supports push_back and push_back_unchecked and a few miscellaneous functions like size, or reserve.

Let's run 3 different benchmarks, all following the basic structure from the snippet earlier, that is, create an output vector, reserve enough space for all elements, and then add the elements to the output vector.

The transformation operation is going to be

  1. The Key -> index conversion snippet from earlier
  2. Multiplying every element by 2

For each of these, we will compare the performance of std::vector::push_back, nonstd::vector::push_back, and nonstd::vector::push_back_unchecked[8].

You can check the benchmark implementation in the GitHub repository.

I ended up collecting results for the three main compilers, specifically MSVC 19.29.30151[9], GCC 10.5, and Clang 15.0.7 (using libc++-15)[10]. GCC and Clang were run under WSL2 and MSVC on Windows 10 directly.

Results

I collated the results for all benchmarks into one table per compiler.

  • std column is the result of using std::vector::push_back
  • nonstd is the result of using nonstd::vector::push_back
  • unchecked is the result of using nonstd::vector::push_back_unchecked
  • \(\frac{\textbf{std}}{\textbf{nonstd}}\) is the speedup from using nonstd::vector::push_back instead of std::vector::push_back
  • \(\frac{\textbf{nonstd}}{\textbf{unchecked}}\) is the speedup from using nonstd::vector::push_back_unchecked instead of nonstd::vector::push_back

Generally we want to see that the \(\frac{\textbf{std}}{\textbf{nonstd}}\) is around 1, and \(\frac{\textbf{nonstd}}{\textbf{unchecked}}\) is more than 1.

Clang

benchmark n std nonstd unchecked \(\frac{\textbf{std}}{\textbf{nonstd}}\) \(\frac{\textbf{nonstd}}{\textbf{unchecked}}\)
copy + mult 1'000 540 ± 48 ns 551 ± 27 ns 93 ± 2 ns 0.98 5.92
100'000 51 ± 2.0 us 51 ± 4.1 us 10 ± 3.0 us 1.00 5.10
10'000'000 9.5 ± 0.2 ms 9.5 ± 0.3 ms 7.3 ± 0.7 ms 1.00 1.30
to_index 1'000 460 ± 19 ns 470 ± 40 ns 168 ± 10 ns 0.98 2.80
100'000 44 ± 3.5 us 44 ± 8.9 us 15 ± 4.6 us 1.00 2.93
10'000'000 14.2 ± 0.1 ms 14.5 ± 0.2 ms 13.1 ± 0.1 ms 0.98 1.11

The results from Clang+libc++ are about what we want to see. The performance of using plain push_back is roughly the same between nonstd::vector and libc++'s std::vector, and push_back_unchecked provides improvements ranging from significant for N == 10'000'000 to massive for smaller inputs.

GCC

benchmark n std nonstd unchecked \(\frac{\textbf{std}}{\textbf{nonstd}}\) \(\frac{\textbf{nonstd}}{\textbf{unchecked}}\)
copy + mult 1'000 546 ± 12 ns 540 ± 17 ns 167 ± 8 ns 1.01 3.23
100'000 56 ± 11.4 us 55 ± 4.9 us 12 ± 2.9 us 1.02 4.58
10'000'000 9.7 ± 0.2 ms 9.7 ± 0.1 ms 7.5 ± 0.8 ms 1.00 1.29
to_index 1'000 528 ± 29 ns 552 ± 91 ns 175 ± 17 ns 0.95 3.15
100'000 51 ± 7.7 us 52 ± 8.2 us 15 ± 4.8 us 0.98 3.47
10'000'000 15.1 ± 0.1 ms 15.2 ± 0.1 ms 13.4 ± 1.6 ms 0.99 1.13

The results from GCC+libstdc++ are very similar to those from Clang+libc++. Compared to Clang's results, there is a slightly greater difference between using std::vector and nonstd::vector, but the differences are still minor. GCC does seem to be worse at taking advantage of push_back_unchecked, in the copy + mult benchmark, providing significantly worse performance for small input sizes.

MSVC

benchmark n std nonstd unchecked \(\frac{\textbf{std}}{\textbf{nonstd}}\) \(\frac{\textbf{nonstd}}{\textbf{unchecked}}\)
copy + mult 1'000 792 ± 36 ns 599 ± 32 ns 525 ± 33 ns 1.32 1.14
100'000 73 ± 2.4 us 54 ± 3.2 us 46 ± 3.9 us 1.35 1.17
10'000'000 24.0 ± 2.6 ms 13.3 ± 5.1 ms 11.5 ± 4.1 ms 1.80 1.16
to_index 1'000 791 ± 39 ns 563 ± 35 ns 397 ± 23 ns 1.40 1.42
100'000 91 ± 31.5 us 50 ± 5.9 us 34 ± 9.5 us 1.82 1.47
10'000'000 29.4 ± 4.3 ms 20.4 ± 7.3 ms 18.4 ± 6.6 ms 1.44 1.11

The results for MSVC are, in a way, the most interesting ones. This is because nonstd::vector already significantly outperforms MS' std::vector. MSVC also does not take advantage of push_back_unchecked to generate massively faster code for small inputs the way GCC and Clang do.

Conclusion

From the benchmarks, push_back_unchecked provides a nice (10+%) improvement for a common usage pattern. The improvement could even be called "massive" for specific compiler and input size combinations. Obviously, it is not a panacea; the improvement comes from simplifying the code the compiler has to reason about when adding a new element to the vector. As adding the new element (or computing the new element) becomes more expensive, the total improvement trends to zero.

The trade-off for the increased performance is a whole new way to trigger UB when using a vector. Is this trade-off worth it? I think it certainly is, and we should add something like this to the actual std::vector. There is even a precedent for this in the standard library, std::basic_string::resize_and_overwrite[11], which also allows the user to bypass the capacity checks when operating on a basic_string instance.

And finally, push_back is not the only member function on std::vector we could apply this treatment to. emplace_back is an obvious candidate here, as it also operates on individual elements and is often used in cases where we know the required capacity ahead of time. To me, the question mark is over things like assign and other range-based functions. On the one hand, doing one capacity check per the whole range should be cheap enough not to matter. On the other, if we add push_back_unchecked, adding, e.g. assign_unchecked would be symmetrical, and there is no doubt that it would help someone out.

Other lessons learned

Having learned my lesson from trying to improve the performance of Catch2, I always benchmarked my own vector implementation against std's. These baseline benchmarks have paid off almost immediately because in the first implementation, my vector's push_back was almost twice as slow as libstdc++'s std::vector's.

The code below is the first implementation of push_back I had, see if you can spot the issue:

    void push_back(T&& elem) {
        if (m_next != m_end) {
            Detail::construct_at(m_next, std::move(elem));
            ++m_next;
            return;
        }
        const auto current_size = size();
        const auto new_cap = calculate_new_cap(current_size + 1);
        auto* new_memory = Detail::allocate_uninit<T>(new_cap);
        // Constructing new element has to happen before we move
        // old elements, in case someone is doing `v.push_back(v[0])`
        try {
            Detail::construct_at(new_memory + current_size, std::move(elem));
        }
        catch (...) {
            Detail::deallocate_no_destroy(new_memory, new_cap);
            throw;
        }
        m_next = std::uninitialized_move(m_first, m_next, new_memory);
        // Account for the inserted element
        ++m_next;
        Detail::deallocate_no_destroy(m_first, current_size);
        m_first = new_memory;
        m_end = m_first + new_cap;
    }

There is no real issue if you define real issue as code that is inefficient on its own. However, when I benchmarked this implementation against libstdc++'s std::vector, I found that its performance was up to 4x worse than the actual std::vector.

Why? Because inlining is the most important optimization in the world of modern C++, the GCC's inliner did not like how big the function is. When I instead outlined the last few lines into its own helper function, adopt_new_memory, I got performance on par with std::vector.

This is the "fixed" version:

    void push_back(T&& elem) {
        if (m_next != m_end) {
            Detail::construct_at(m_next, std::move(elem));
            ++m_next;
            return;
        }
        const auto current_size = size();
        const auto new_cap = calculate_new_cap(current_size + 1);
        auto* new_memory = Detail::allocate_uninit<T>(new_cap);
        // Constructing new element has to happen before we move
        // old elements, in case someone is doing `v.push_back(v[0])`
        try {
            Detail::construct_at(new_memory + current_size, std::move(elem));
        }
        catch (...) {
            Detail::deallocate_no_destroy(new_memory, new_cap);
            throw;
        }
        adopt_new_memory(new_memory, new_cap);
        // Account for the element inserted ahead of time
        ++m_next;
    }

Next up was benchmarking against MSVC's stdlib. Compared to MSVC's std::vector, the implementation above was significantly slower, up to 3x. Because both std::vector and nonstd::vector were significantly slower when compiled with MSVC than when compiled with GCC, I had to look for some pattern that broke MSVC's optimizer badly.

After some thinking, I noticed that I could completely remove the try-catch block because in nonstd::vector, I static_assert that the elements in it are no-throw move-constructible.

This meant that the implementation of push_back(T&&)[12] could be simplified down to just

    void push_back(T&& elem) {
        if (m_next != m_end) {
            Detail::construct_at(m_next, std::move(elem));
            ++m_next;
            return;
        }
        const auto current_size = size();
        const auto new_cap = calculate_new_cap(current_size + 1);
        auto* new_memory = Detail::allocate_uninit<T>(new_cap);
        // Constructing new element has to happen before we move
        // old elements, in case someone is doing `v.push_back(v[0])`
        Detail::construct_at(new_memory + current_size, std::move(elem));
        adopt_new_memory(new_memory, new_cap);
        // Account for the element inserted ahead of time
        ++m_next;
    }

This change made no difference in the performance with GCC (or, as I later checked, Clang), but this version of push_back(T&&) now beats MS's std::vector easily. Why? Because the MSVC version I have is bad at handling try-catch blocks where no exceptions can be thrown.

Let's take a look at an example where two functions should compile into the same ASM if the compiler properly optimizes try-catch blocks:

#include <cstdio>

void f1(int* ptr, int i) {
    *ptr = i;
}

void f2(int* ptr, int i) {
    try {
        *ptr = i;
    } catch (...) {
        puts("RIP\n");
        throw;
    }
}

godbolt

Notice that GCC is smart enough to compile both f1 and f2 into the same ASM. MSVC 19.29 instead makes a hash of f2, compiling in the whole exception handler. MSVC 19.37 performs a lot better; the two functions still aren't identical, but now the difference is only that f2 allocates 40 bytes of stack space for no reason.


And that's all for this article. Reimplementing a tiny subset of std::vector in a way to keep the comparison fair was an interesting exercise, and figuring out why it optimizes differently was quite fun.

Standardizing std::vector::push_back_unchecked would be nice, but I am not touching wg21 unless I am well-compensated for doing so.



  1. This is usually called a "Small Vector". The idea is to have a buffer for some N elements inline with the object so that the vector does not need to allocate memory while it contains less than N elements. std::string already does this because it has the advantage that its elements are trivial types that cannot tell the difference. Overall, I think this is the most common vector mutation in the wild, see e.g. LLVM, Abseil, Boost, many hits on GitHub and in 3 of 4 commercial codebases I've worked on. ↩︎

  2. This is usually called "Static Vector". It boils down to a small vector that does not allow increasing capacity. This mutation used to be useful for constexpr programming, but C++20 allows us to use std::vector in most constexpr contexts. This leaves the static vector with a much smaller niche; cases where there is a statically known bound on the size, the bound is small, and we either won't need to move the vector, or the elements are cheap to move. This was another fairly common vector mutation, but lately, it became less common. It can still be found in, e.g. Boost. ↩︎

  3. As far as I know, this is a relatively new idea, usually called something like "Pinned Vector". The idea is that a vector can reserve lots of contiguous virtual memory on construction and only ask for it to be committed and backed by actual physical memory once it is used. This lets the pinned vector avoid copying/moving elements during reallocation, thus providing performance benefits when the final size is not known up-front. Pinned vectors also have a secondary advantage in that adding more elements does not invalidate pointers to already inserted elements. ↩︎

  4. Many types are movable without invoking their move constructor/assignment operator, just by copying their bits into another place. Both std::vector itself and std::unique_ptr<T> are examples of such types because their representation is not self-referencing, even though they need a custom destructor. The C++ standard currently does not support this, even though there have been some proposals to introduce support for this into the language (by adding a new keyword along the lines of =default and =delete) and into the standard library (by adding a trait the types can opt into). An example of this approach is the vector reimplementation in Facebook's folly library. While folly::fbvector fails to compile if the type it stores does not specialize a trait to indicate that it can be moved bitwise, some implementations instead leave checking this to their user. ↩︎

  5. std::string already does small string optimization, but it has the advantage that its elements are always trivial. This means that it can easily move the elements without risking exceptions being thrown or causing unexpected performance overhead. ↩︎

  6. A sufficiently smart compiler could fix this. But I have to use GCC/Clang/MSVC, so I have no sufficiently smart compiler. ↩︎

  7. Yes, this means that calling it without sufficient capacity invokes UB. ↩︎

  8. The reason to include std::vector::push_back in the benchmarks is to check that the implementation of nonstd::vector::push_back is not doing something stupid, and thus giving an unfair advantage to nonstd::vector::push_back_unchecked. I needed to make multiple changes for my custom vector to match std::vector on GCC and MSVC. I describe some of them in the last section. ↩︎

  9. This MSVC version corresponds to fully patched VS 2019 as of the time of writing. ↩︎

  10. These compiler versions are what I use at work, so it is what I already have set up locally. Given the results, it would be nice to rerun the benchmark with newer MSVC to see how much the improved optimizer helps. ↩︎

  11. This API covers a few more use cases than just plain push_back_unchecked would, but it also imposes a lot more compilation costs and takes advantage of character types being much simpler to work with. I don't think a copy of this API would be a good fit for a generic vector type. ↩︎

  12. Note that we cannot make the same change for push_back(T const&), because we only assert no-throwability of moves, not copies. For copies, we generally assume that they will want to acquire resources, e.g. allocate memory, thus, throwing copy constructors are reasonably common. ↩︎

]]>
<![CDATA[Stop talking about Amdahl's law]]>https://codingnest.com/stop-talking-about-amdahls-law/64c28599a062e6148d2e46d2Sun, 02 Apr 2023 15:02:39 GMT

Without talking about Gustafson's law and the assumptions both make.

Let me preface this by saying there is nothing wrong with Amdahl's law[1] itself. After all, it is just an observation of the result of applying simple math.

But I believe there is a lot wrong with its current use in pedagogy around parallel programming[2].

Amdahl's law

Amdahl's law is a formula to calculate the speedup of a task by using more computing resources, e.g. multiple threads (assuming perfect parallelization). Formally written

\[
S(n) = \frac{1}{(1-p) + \frac{p}{n}}
\]

where

  • \(n\) is the number of threads you have
  • \(p\) is the fraction of runtime taken up by parallelizable code

There are two important things to notice about the formula.

  • The result is the improvement in latency of the task, that is, how much less real time you have to wait from start to finish.
  • For large \(n\), the result simplifies to just \(\frac{1}{(1 - p)}\).

Informally we can also generalize it a bit:

Optimizing a single part of code cannot provide overall improvements larger than the fraction of time initially taken by the optimized part.

We expect experienced programmers to understand the underlying principle naturally; the colder a section of code is, the less payoff there is in optimizing it. This is because if a section of code takes up 1% of the overall runtime, you will, at most, knock off 1% of the overall runtime by optimizing it.

My issue with Amdahl's law when talking about multithreading is simple. Amdahl's law is used to explain the limit of parallelizing code through a large number of threads. If the parallelizable code takes up 95% of the runtime before parallelization, even perfect parallelization across 1000s of cores won't decrease the runtime more than 20x (5% of the original time).

This limit is correct, but providing further context is necessary for the students to avoid leaving with the idea that parallelizing past few cores is pointless. The issue is that Amdahl's law makes an important assumption about the problem, and students who have just been exposed to a bunch of new complex ideas (concurrency, synchronization, etc.), are unlikely to realize what the assumption is. Can you figure out the assumption?

> Click here for the answer <

Amdahl's law assumes that there is always a fixed amount of work.


This assumption holds for a lot of important software.

Take grep as a simple example. No matter how quickly you can search for a pattern in a single[3] file, you will always have to wait the time it takes to transfer the bytes from the drive[4] to the memory so you can read and match them.

Filters in Photoshop are another example. Most of the time, they will be applied to a single image, sometimes to a well-defined batch of images. No matter how fast/parallelized the computational part of the filter becomes, there will always be non-parallelizable steps, like dealing with IO.

Gustafson's law

Gustafson's law talks about the slowdown from forcing parallel code to be run on a serial computer.

\[
S(n) = (1 - p) + p * n
\]

as with Amdahl's law,

  • \(n\) is the number of threads you have
  • \(p\) is the fraction of runtime taken up by parallelizable code

Notice that there are no diminishing returns on the parallelism level applied to the task. Where does this difference come from? The difference in the basic assumptions of the shape of the work. Can you figure out the assumption underlying Gustafson's law?

> Click here for the answer <

Gustafson's law assumes a fixed amount of time to do the work.


This assumption also holds for a lot of important software.

For example, models for weather forecasting are massively parallel but must finish running within a set amount of time[5]. So when we get more parallel resources, we can run more granular[6] models or larger model cohorts, improving our ability to accurately forecast short-term weather.

Various other physics simulations have the same property, where we can saturate an arbitrary number of cores[7] to achieve better and more precise results, but we need to get the results within some predetermined amount of time.

But the tasks do not have to be inherently parallel for the software to obey Gustafson's law. The work that uses up the extra resources can be orthogonal from the work that is always done; e.g. old word processor running on an old machine might not even have available resources to run simple spell check, while a new one can do online machine learning on your writing style to tell you when your voice has changed.

Conclusion

Both Amdahl's and Gustafson's laws are correct while holding very different positions about the efficiency of parallelism. This difference comes from their contradictory assumptions about the properties of the problem our code is solving, and thus at most, one of them applies to our code.

Somehow it became common to show students just Amdahl's law as part of multithreading courses without explaining the underlying assumption it makes. But showing your students just Amdahl's law[8] without explaining it is doing them a huge disservice, so, please:

stop talking about Amdahl's law without also explaining Gustafson's law

If you do not have the time to do both, skip showing either.


Quick terminology note; Keeping the problem size constant while increasing the number of cores is called strong scaling. Increasing the number of cores and the size of the problem is called weak scaling.



  1. The fact that it is called "law" annoys me a bit. There is nothing factually wrong with Amdahl's law though[9]. ↩︎

  2. Amdahl's law is perfectly applicable to single-threaded programming, but I have never seen it brought up in that context. I find this interesting because we teach people about its implications, but we never name the rule as such. ↩︎

  3. Arguably, as the implementation of your grep-like utility becomes faster, the users will use it to search through more files. This is because a common usage pattern is to look for the file with a specific pattern, and once users realize that they can get the results in a reasonable time, even if they are lazy with prefiltering the files, they will get lazy. ↩︎

  4. Assuming cold FS cache. ↩︎

  5. If the weather model does not finish in time, we won't get the results (prediction) before the weather is actually happening. And once the weather is happening, well, we can just observe it for much more accurate data. ↩︎

  6. Higher granularity means that the "cells" used to simplify the simulations become smaller, which in turn means that there will be more of them, and thus there is more work to do. ↩︎

  7. To quote a friend working on physics simulations in a field I won't pretend to understand: "If it got me 10x cores, I would turn Hindu". ↩︎

  8. Obviously, this is also true for showing just Gustafson's law. But I have never met someone who said somebody showed them just Gustafon's law without further explanation. I have only seen this happen with Amdahl's law. ↩︎

  9. Technically, second-order effects can violate Amdahl's law in some specific cases[10], but that requires code explicitly crafted to do that. Let's assume we are talking about plain old boring programs instead. ↩︎

  10. The simplest way to violate Amdahl's law is through CPU cache effects on microbenchmark results. If an optimization of one part reduces how much i/d cache it uses up, that can provide speed up to the non-optimized parts due to not evicting them from the CPU cache. ↩︎

]]>
<![CDATA[The Little Things: Why you should always have benchmarks ready]]>https://codingnest.com/the-little-things-why-you-should-always-have-benchmarks-ready/64c28599a062e6148d2e46d1Wed, 08 Mar 2023 19:45:02 GMT

At the end of January, I removed some low-hanging allocations from Catch2. This meant that running the SelfTest binary made 8k fewer allocations with v3.3.1 than with v3.3.0. Unbeknownst to me, it also made section tracking in v3.3.1 ~4x slower. This post is about how that happened and what I/you should learn from this.

Let's start at the beginning.

The beginning

As I already mentioned, I decided to work on the number of allocations in Catch2. This was motivated by this topic coming up at work[1] and by looking at Snitch, which is a test framework that provides API similar to Catch2's but with zero allocations[2].

As this was originally just a one-day project[3], I did not want to do any significant refactoring, only small local changes, like reserving enough memory ahead of time or ensuring that things end up moved rather than copied. I say originally because I noticed that the section tracking requires a quadratic number of allocations in the number of sibling sections, and I decided to fix this, even though it required a larger surgery.

The result was that running tests/SelfTest with v3.3.1 needed just over 8k fewer allocations than it needed with v3.3.0. And also, unbeknownst to me, that section tracking became about 4x slower. I did not notice this until quite a lot later because I did not benchmark the changes. After all, all the changes were trivial, and having fewer allocations means better runtime, no?

The realization

About three weeks later, I wrote some microbenchmarks for specific extreme uses of Catch2[4]. The most interesting results came from a microbenchmark for section tracking, with 1000 sibling sections:

TEST_CASE("Some simple test hahaha", "[tag1][tag2]"){
    SECTION("Some section name 0"){  }
    SECTION("Some section name 1"){  }
    SECTION("Some section name 2"){  }
    SECTION("Some section name 3"){  }
    SECTION("Some section name 4"){  }
    ...
    SECTION("Some section name 999"){  }
}

Tracking allocations coming from running this empty test case gave me some impressive numbers; v3.3.0 made 2013410[5] allocations, but v3.3.1 made only 11410 allocations[6].

This is a great improvement, but there was a Catch2 catch. Running the test compiled against v3.3.1 was noticeably slower than running the one compiled against v3.3.0. So I benchmarked the binaries using hyperfine and found out that the difference was between 2.5x (debug configuration) to more than 4x (release configuration).

At this point, I looked through the 8 commits that went into v3.3.1 again, expecting to quickly find some very suspicious change that would make the issue obvious. I found absolutely nothing suspicious, so I had to use a more systematic approach...

The investigation

With only 8 commits that could've caused the difference, the simplest way of narrowing down the suspects was to just build the benchmark against all of them and compare their runtimes. This clearly narrowed down the slowdown to single commit[7], unsurprisingly the most complex one.

Try looking at the linked commit as if you were doing a code review, and see if you find the issue by inspection. Do you think you see it?

> Click here for a hint < It is one of the first changes in catch_test_case_tracker.cpp

I sure didn't, so I had to dig deeper to figure it out. First, I tried profiling the binary built with v3.3.1. Let's take a look at a summary of functions on the hot path:

Function Name Inclusive CPU in % Exclusive CPU in %
Catch::TestCaseTracking::SectionTracker::acquire 97.14% 0.09%
Catch::TestCaseTracking::ITracker::findChild 96.75% 17.22%
Catch::StringRef::operator== 77.78% 27.14%
[External Call] vcruntime140.dll 50.64% 50.64%

Looking at the inclusive[8] time, we spend the absolute majority of the time looking for a child of a tracker. This is done when checking whether Catch2 should enter the currently encountered section. Given that our microbenchmark only ever walks over SECTION macros (it will walk over 1M SECTIONs in one execution), this makes sense and is about what we would expect.

Looking for the child section spends most of the time comparing StringRefs, which in turn spends most of the time in a runtime library, specifically in std::memcmp, to compare the contents.

With the benefit of hindsight, the issue is obvious from the trace.

However, at the time, I had no idea, so I tried various things:

  • Make StringRef::operator== eligible for inlining without LTO by moving it to the header. This helped a bit, but not enough to offset the slowdown.
  • Avoid passing the newly introduced NameAndLocationRef by value. The difference this made did not surpass noise.
  • Better inlining and devirtualization of various functions in trackers. In aggregate, these changes improved performance slightly, but not nearly enough to bring the perf back to previous levels.

After spinning my wheels for an embarrassingly long time, I finally had an idea;

I could compare profiles from running the benchmark with v3.3.0 and v3.3.1.

This is the hot path for the same benchmark with v3.3.0:

Function Name Inclusive CPU in % Exclusive CPU in %
Catch::TestCaseTracking::SectionTracker::acquire 79.75% 0.08%
Catch::TestCaseTracking::ITracker::findChild 79.25% 22.76%
Catch::SourceLineInfo::operator== 53.61% 53.61%

We still spend the majority of time looking for trackers, but as it runs faster in v3.3.0, it takes up less of the total CPU time. However, this time the majority of the time is taken up by a different operator== than with v3.3.1[9].

After seeing this, the issue became obvious.

Look at the commit in question again. Do you see it now?

> Click here for the answer <
         auto it = std::find_if(
             m_children.begin(),
             m_children.end(),
             [&nameAndLocation]( ITrackerPtr const& tracker ) {
-                return tracker->nameAndLocation().location ==
-                           nameAndLocation.location &&
-                       tracker->nameAndLocation().name == nameAndLocation.name;
+                return tracker->nameAndLocation() == nameAndLocation;
             } );
> Click here for the explanation <

Changing the lambda to use NameAndLocation::operator== directly changed the order in which the tracker's members are compared. The original version checked the tracker's location first and the tracker's name second. op== instead checks the name first and then the location second. op== uses this order because that is the order they are defined in NameAndLocation.

In most cases, the line where a SECTION is defined is a better discriminant than the SECTION's name. This is because it is a lot faster to compare the lines than it is to compare their names. Thus, when the refactoring changed the order in which they were checked, it slowed down the comparison significantly, and thus the section tracking also slowed down significantly.



Takeaways

  • Even innocuous changes can have massive consequences.

The refactoring was an obvious way to make the code cleaner. Instead of manually comparing two members, we called the appropriate operator==. The result was cleaner code that took significantly longer to run, due to differences in how it processes real-world inputs.

  • You should always have some benchmarks, and run them regularly[10].

I ran into a similar issue last week at work. I noticed a function that copied a vector, and optionally also XORed the data with a constant. I thought it was overcomplicated and cleaned it up, but because the benchmarks showed that the result was slower[11], the changes never made it to git.

  • When investigating differences in running time, profiling just one binary might not be enough.

You should profile both binaries and then diff the resulting traces. I am not aware of any tools for smart trace diffing, but usually, just opening both of them side by side is enough.



  1. We did not need to reduce the number of allocations by itself, but we needed to reduce the fragmentation caused by having interspersed shortlived and longlived allocations. ↩︎

  2. It makes some interesting tradeoffs, some of which I agree with (e.g. if it did not break backwards compatibility, TEST_CASE would also accept only string literals), and some of which I don't (matcher description has to be saved internally in the object, SECTION tracking is much more limited). ↩︎

  3. I really don't have much spare time for projects nowadays, even less for Catch2. ↩︎

  4. E.g. 1000s of TEST_CASEs in a single file, 1000s of REQUIREs in a single test case, 1000s of sibling SECTIONs, huge generators and so on. ↩︎

  5. Given that the absolute majority of these allocations come from std::strings, having shorter section names would lead to less allocations. Trying to figure out a realistic distribution of section name lengths is beyond the scope of this post though. For the sake of completeness, if the names are shortened to just the number, the result is 8410 vs 6410 allocations. ↩︎

  6. Further cleanups made in v3.3.2 brought this down to 10409 allocations, and 6409 allocations for the short name case. ↩︎

  7. Other commits also had some effect, both positive and negative, but this one was responsible for the absolute majority of the slowdown. ↩︎

  8. Simplified a bit, function's inclusive time is the time spent between entering the function and exiting the function. This means that time spent in functions called from the function also count in its inclusive time. Function's exclusive time is the time spent only inside that one function. ↩︎

  9. If you noticed that this time there is no separate line for calls into runtime for std::strcmp, yeah, there isn't. There should be a bunch of calls there, but the profile shows nothing. I suspect that MSVC treats calls to std::strcmp differently from std::memcmp, and inlines it. ↩︎

  10. Incidentally, I am looking for help setting up some benchmarks for Catch2. The hard part is tracking the results over time, so that trends can be identified, rather than just getting the results of the latest run. ↩︎

  11. I investigated why the new code was slower, and the result turned out quite surprising and funny. The compiler was able to vectorize the rewritten function, so it should've been quite a lot faster. But because all inputs were very short, the vectorized loop did not pay off. ↩︎

]]>
<![CDATA[The Little Things: My ?radical? opinions about unit tests]]>https://codingnest.com/the-little-things-my-radical-opinions-about-unit-tests/64c28599a062e6148d2e46ceSat, 10 Sep 2022 15:17:36 GMT

Due to maintaining Catch2 and generally caring about SW correctness, I spend a lot of my time thinking about tests[1]. This has left me with many opinions about tests, some conventional and some heterodox.

Originally I wanted to make a lightning talk out of these, but I won't be giving any this year, so instead, I wrote them up here. And since this is a blog post, I also have enough space to explain the reasons behind the opinions.

Unit test opinions

Note that the list below is in no particular order. I added opinions to the list as I remembered them.

  1. You should be using Catch2.

Okay, I am not entirely serious about this one, but I had to include it anyway.

  1. If the test runner finishes without running any tests, it should return a non-zero error code.

When I was making this change to Catch2, I expected it to be non-controversial. After all, everyone has run into having green CI while tests weren't running due to a misconfiguration, no? And making the test runner fail if it didn't run any tests would easily catch this problem.

I ran two Twitter polls[2] about this, and as it turns out, people disagree, in some cases, and agree in others. If the test runner is called without a filter, e.g. just ./tests, then the majority of people voted to return 0, but when there is a filter, e.g. ./tests "some test filter or another", then the majority of people voted to return non-zero error code.

There is an easy argument about the logical purity of returning 0. A non-zero return code means that at least one test has failed. If no tests were run, then no tests could've failed; thus, returning 0 is the correct answer.

And logically, it is the correct answer. But pragmatically, returning non-zero return code when no tests were run is much more useful behaviour, and I prefer usefulness in practice over logical purity.

If you want to read the arguments people made for/against returning 0 in either case, look at the replies to the poll tweets.

  1. Unit tests aren't good because they provide correctness. In fact, they are bad at providing correctness.

Unit tests only provide a little bit of correctness. What makes unit tests valuable is that they also need only a little effort to provide that small bit of correctness. They also provide correctness under the pay-as-you-go model.

This means that there is good correspondence between the effort you expend upon the unit tests and how much correctness guarantees you get out of your effort. And you can put in as much, or as little, effort as you want/can afford. This contrasts sharply with formal verification, complex fuzzing setups, etc., where you have to invest a significant amount of effort up-front for them to pay out. And if you stop investing the effort halfway through, you get nothing back.

The downside to the pay-as-you-go model is that if you want a high degree of correctness guarantees, you will pay a lot more than you would for using a different approach.

  1. You should not be running your unit tests in isolation.

There are two main reasons to run your tests in isolation. The first one is to run them in parallel. The second one is to avoid positive interference between tests, where one test changes the global state in a way that causes a later test to pass. Both are good reasons, but there is a better way to achieve these goals.

There are also two issues with running your tests in isolation. The first one is that on some platforms, the static overhead from running each test in its own process can be a significant part of the total runtime of the tests. The other is that running your unit tests in isolation also hides negative interference between tests, where one test changes the global state in a way that would cause a later test to fail.

The first issue is rarely a problem in practice[3]. The real problem with running tests in isolation is the second issue. In my experience, negative interference between tests usually means issues in the code under test rather than the tests themselves, and thus it is important to find it and fix it.

If you instead run your tests in randomized order inside a single process, you will eliminate both positive and negative interference between tests. The disadvantage to this approach is that you don't get parallel test execution and that figuring out which tests interfere with each other is complicated due to the shuffling.

The thing is, good tools[4] can solve both of these disadvantages. You can get back parallel test execution if your test runner supports splitting up tests into randomized batches. And debugging test interference is easy if your test runner's shuffle implementation is subset-invariant[5].

  1. Tests should be run in random order by default.

Running tests in random order is the correct behaviour for the long-term health of tests. Your defaults should adhere to the correct configuration unless there is an overriding constraint.

Catch2 currently does not default to this but will with the next major (breaking) release. During the last major release, I considered this bit too radical, especially in conjunction with the other changes to defaults I made.

  1. Test Driven Development (TDD) is not useful for actual development.

In my experience, there are two options when I need to write a larger piece of code. The first one is that I know how to implement it; thus, I get no benefit from the "write tests first" part of TDD. The second one is that I don't know how to implement something, and then TDD is still useless because what I actually need is to do a bunch of design up-front... and then writing tests first has the same issue as in the first case.

These issues are compounded further by the fact that I often work on things where the "run tests" step can require multiple CPU years to evaluate appropriately, or crunching through half a terabyte of data, etc., and comparing the output with the current implementation.

Don't read this to mean that you should not write tests. You should be writing tests. You just don't have to dogmatically write them first. Write them when it makes sense, and don't be afraid to write them in large batches.

  1. TDD is a good learning tool.

While I don't think TDD is useful in day-to-day work (see the previous opinion), I would still recommend every starting developer to try using it for some time, say a year, while they are learning to write code. The reason behind this is quite simple. A common complaint when starting with writing unit tests is that writing unit tests is annoying due to the interface of the code under test being annoying to use.

Being forced to use the interface before writing the implementation (through writing the tests first) is an excellent way to determine whether the interface is usable. And since the implementation is not written yet, the barrier to changing the interface is trivial.

Eventually, the developer should be able to evaluate the interface without being forced to write code that uses it, which is a good time to stop using TDD.

  1. Your use of tests will be influenced by your test framework. This makes picking a good test framework critical.

This does not seem like a radical opinion, but I think few people appreciate how much this is true and what it means.

A straightforward example is the difference between the test names you end up with, if your test names have to be legal identifiers (e.g. GTest) or if they can be arbitrary strings (e.g. Catch2). Which one of these do you prefer to read?

TEST(Exceptions, ExceptionMessagesContainTimestampAndLocation) {
    ...
}
TEST_CASE("Exception messages contain timestamp and location", "[exceptions]") {
    ...
}

I ran into another, worse, example this year at NDC TechTown. During a discussion about tests and why tests should be run shuffled and not isolated (see point 3), someone told me that debugging shuffled tests is too hard. Why? Because you have to find which tests cause the issue and removing tests changes their order, subsetting the tests is annoying, and so on.

This is only true if your test framework does not provide good support for test shuffling, but if you picked such a framework, then trying to do the right thing is now more painful than it has to be. Maybe even painful enough that you won't bother running the tests shuffled, losing out on significant correctness improvement.

I also saw a huge example last year, incidentally also at NDC TechTown. In his keynote "Testing as an equal 1st class citizen (to coding)", Jon Jagger had a part where he notes that he no longer uses descriptive test names because writing them as a valid identifier is annoying and ends up unreadable anyway. Instead, he uses UUID-ish names, like test_de724a, test_de724b and so on. The description of the test is then pushed down into the test's docstring (this is in Python). Because the docstring is an arbitrary string, it can contain newlines, and long sentences, making it more readable than the identifier-like name.

Another supposed bonus to this approach was that it makes running all tests related to some functionality easy, e.g. pytest -k de7, because the test name prefix encoded other properties of the test.

I think this idea is pretty ingenious.

But it is an ingenious workaround for inadequate tools. Users shouldn't have to write test names as valid identifiers, and users should be able to group tests without messing with the test names.



update 14.09.2022

Shortly after publication, I remembered another great example of this principle, the "Lakos rule"[6]. The purpose behind the rule was to make defensive checks in narrow contract[7] functions "testable". The idea was that the assertions would be configured to throw instead of terminating, and then the tests would check for exceptions being thrown. But for this to work, the narrow contract functions couldn't be marked noexcept because that disallows them from throwing.

There is a relatively simple alternative to this; something called "death tests". A death test checks whether a specific expression terminates the binary that executes it. GTest is a good example of a test framework that supports death tests and supports them well.

For a bit over 8 years[8], the Lakos rule was used to guide the use of the noexcept specifier in the C++ standard library. Many functions that could be marked noexcept are not because they have a narrow contract and the Lakos rule said that they should not be marked noexcept.

So a lack of good death test support in the test framework Bloomberg uses caused a non-trivial difference in the contents of the C++ standard. That's a nice impact for something as trivial as a test framework choice, right?


Other random testing opinions

I also decided to add a smattering of opinions that are not about unit tests. I will not try to explain these as much as the ones about unit tests, maybe in a later article.

  • Property-based testing is neat; we should do more of it. But we need good tooling support first.
  • Dogmatically insisting on one assertion per test case is stupid. Writing more assertions is often easier than defining a new matcher.
  • Being pragmatic is more important than being clean/correct/logical.
  • Fuzzing is good. The issue is making it work in resource-constrained environments.
  • Mocking should be only used very rarely.
  • Overtesting is worse than (slight) undertesting.
  • Using special function names to declare that a function is a test is wrong.

  1. And I don't mean just unit tests, e.g. my undergrad thesis was about applying symbolic execution towards testing hard real-time safety-critical systems. Later on I did bunch of work with formal verification tools, namely Alloy and UPPAAL. ↩︎

  2. Twitter poll about the case without filter, Twitter poll about the case with test filter. Later on, I also ran a Twitter poll about the case with a tag filter. ↩︎

  3. I've run into a case where the static overhead from running each test in its own process was about 15% of the total runtime. This was due to a combination of most test cases being very cheap to execute, spawning processes on that platform being expensive, and large test binaries. Remember that all the tests in the binary must be registered to run even a single one. ↩︎

  4. Such as Catch2. ↩︎

  5. Subset-invariant shuffle means that the relative order of two test cases is the same, no matter how many other test cases are also shuffled at the time. Obviously, this assumes a fixed random seed. Different random seeds can (and should) change the relative order of the same two test cases. ↩︎

  6. Lakos rule says that, in general, narrow contract functions should not be marked noexcept. There is a bit more to it than just that, e.g. it also says that non-throwing functions with a wide contract[7:1] should be marked noexcept and has a part about special functions with a conditionally wide contract. ↩︎

  7. Narrow contract functions, or sometimes just "narrow functions", are functions that can be called only when some preconditions are met. The classic example in C++ is std::vector::operator[] because it expects to be called only with a valid index as the argument. Calling it with an invalid index is undefined behaviour and a user error. Wide contract functions, or "wide functions", are functions that have no preconditions. The classic example in C++ is std::vector::at because it accepts out-of-range indices... it reacts by throwing an exception. ↩︎ ↩︎

  8. N3279 formalized stdlib's noexcept policy following arguments made in N3248. P1656 changed the policy at the end of 2019. Funnily enough, P1656 explicitly calls out that neither libc++ nor libstdc++ use in-process validation because trying to do so ran into too many issues to be helpful. When P1656 was presented, MSVC's STL implementer had a similar, maybe even a bit more extreme, position. (ISO rules say that the meeting notes are not public, so you need to be part of wg21 to open the link.) ↩︎

]]>
<![CDATA[NDC TechTown 2022: Trip Report]]>https://codingnest.com/ndc-techtown-2022-trip-report/64c28599a062e6148d2e46cdFri, 09 Sep 2022 10:00:32 GMT

Last week I was at NDC TechTown 2022, and I decided to write down my thoughts on the talks I saw (including the two talks given at the meetup before the conference) and other related things. Note that I saw just one of the talks scheduled for Thursday due to giving two talks of my own.

The meetup

There were two talks at the meetup. Robert C. Seacord gave the first one, Arne Mertz gave the second one, and to be honest, I did not like either of them. Content-wise Arne Mertz's talk was fine[1], but the presentation itself was boring. But to be fair, that might be because it started at 8:30, and everyone had a bunch of pizza in them by that point. On the other hand, the first talk (the one given by RCS) was kinda entertaining but lousy.

The conference

Generally, I prefer talks that give me new ideas or contexts to compose thoughts into over well-made talks that reiterate a thing I've known for many years. For this reason, I ended up liking the "Auto-testing for situational awareness" talk by James Westfall.

Technically, the talk was terrible. Bad colour choices, code listings used tiny font that was unreadable when presented. Content-wise, the parts I found interesting could fit within a well-made lightning talk. When I walked past the evaluation box, I saw that it mainly got yellow cards[2], and I think the talk-as-given entirely deserved it. And yet, I don't regret attending.

There were two essential core ideas to the talk. The first one was that when looking at collected metrics, you need to look not only at current (e.g. today's) values but at the results over time, as the delta between individual results carries the critical information. The second one was regularly updating your dashboards so that correlated metrics are easy to cross-reference. Doing so then speeds up investigations into suspicious results.

As an example, consider a sudden jump in the average memory usage of your machines. Is this a sign of issues with the new release? It might be. But a quick look at work batch sizes might tell you that actually, it is the batch sizes that have changed, and the memory usage change is perfectly fine.

I also attended five other talks. I wrote down some thoughts about them below, in chronological order:

  1. "Keynote: Abstraction Patterns" by Kate Gregory

It was fine. The talk was competently made and given, and the contents were correct. The issue I have with the keynote is that I would feel condescending if I was giving this talk to fresh grads because I expect them to already know at least 80% of the contents.

And yet, people who do consulting and thus saw more places than I do, tell me that not only are there professionals who need to be told this, there are a lot of them. At the same time, I cannot think of my employer where the team really needed this. I find these two facts hard to square...

I guess the moral of the story is to select your employers well. ¯\_(ツ)_/¯

  1. "Properties of Unit Tests" by Arne Mertz

It was fine. I think Arne is a boring speaker[3], but the contents were, again, an uncontroversial recap of good practices for unit tests. It was fun to see how Catch2 solves some issues better than the competition. The talk also gets some bonus points from me for giving me inspiration for two new lightning talks.

  1. "Typical C++, but why?" by Björn Fahller

This was a good talk. Or rather, a bunch of good lightning talks concatenated into a full-length talk. I liked the throughline with puzzle art, and the contents provided a small set of easy-to-apply refactorings towards using strong types to improve codebases. Even though the contents weren't new to me, I expect this to be new for more people than the keynote and would recommend it to more people.

  1. "Keynote - The Boeing 737 MAX: When Humans and Technology Don't Mix" by Kyle Kotowick

Fun talk. Also an excellent topic for the party.

Seriously though, the speaker was good, and since I stopped following the 737 MAX saga, it was nice to get an update on the final results. Spoilers: the CEO walked away with a ton of money.

  1. "Learning Rust the wrong way." by Ólafur Waage

The last talk I attended and also my favourite one. If I had to pick one talk to recommend, it would be this one. Despite the title, the talk wasn't about Rust; it was about learning, approaches to learning and what evidence for the different approaches we have.

My talks

This year I gave two talks, "5 Years Of Teaching C++: A Retrospective" (slides, video) and "An Introduction To Floating Point Math" (slides, video), both on Thursday. I thought the first one went better, as it had enough audience discussion to end exactly at the end of the timeslot, while the second one ended at ~40 minutes due to no audience questions.

Interestingly audience evaluation did not agree with me, with the talk about floating point getting much better results.

Other stuff

As always, the catering was excellent (unlimited ice cream! good food! snacks!), and the AV crew was friendly and helpful. However, NDC TechTown still does not provide the speakers with a screen mirroring the projection. This means that to check what the audience sees or to use a pointer, the speaker has to turn around[4].

However, this year I was pretty annoyed by the scheduling. When I submitted my talks, one thing I put into the notes was that if I were to give both, I would want to give them on different days. When the agenda was first published, I had one talk on Wednesday and one on Thursday, so that was fine.

Except... about two weeks before the conference, I was finishing my slides[5], went to the conference agenda, and could not find my talk on Wednesday. Both of my talks were now scheduled on Thursday, in consecutive slots. That was not a pleasant surprise.

Later the scheduling changed again, literally on the day before the conference (I got a Slack DM while on the train to Kongsberg). The floating point talk was moved to the very last timeslot at the conference, Thursday, from 15:00. Overall, this did a lot to convince me not to go again next year, so I am not sure if I will submit to the next CFP.

To make my early impressions even worse, when I arrived at the hotel, I discovered that the room I got this year was atrocious. Oh well.


I got feedback from early readers that this post won't make me any friends due to openly stating when I think a talk was bad or the speaker is boring. I can see it, but I prefer giving honest feedback rather than quietly not mentioning a talk at all.


  1. It recapped some best practices for code review, nothing controversial. I disagreed with some parts, but the differences mainly come down to personal preferences and the style of the project, e.g. I think large-ish PRs are fine, as long as the individual commits are small and atomic. I prefer this for both internal projects and especially for OSS because for OSS splitting up useful features into progressive PRs risks ending up without useful features but with useless scaffolding code. ↩︎

  2. NDC TechTown provides a straightforward evaluation tool for the audience. When you are leaving the room, you walk past a plastic box with a stack of green (great!), yellow (meh), and red (bad!) papers. You then toss one of the papers into the box, which the crew later counts, records, and presumably, NDC TechTown evaluates later[6]. ↩︎

  3. I know that I have the same problem. That does not change the fact that it is an issue. ↩︎

  4. Actually, Hanka got them to provide her with a monitor. But only in a single room, which I luckily also was giving my talk in, and it was on the floor, which isn't a great place for it. Hopefully, they will take the hint for the following year. ↩︎

  5. I wanted to double-check that my title slides match the title written in the agenda. ↩︎

  6. Did you notice that I did not mention sharing the results with the speaker? For some reason, if you want to know what the results your talk got as a speaker, you have to ask for that explicitly later. ↩︎

]]>
<![CDATA[The Little Things: Comparing Floating Point Numbers]]>https://codingnest.com/the-little-things-comparing-floating-point-numbers/64c28599a062e6148d2e46cbThu, 02 Sep 2021 21:53:01 GMT

There is a lot of confusion about floating-point numbers and a lot of bad advice going around. IEEE-754 floating-point numbers are a complex beast[1], and comparing them is not always easy, but in this post, we will take a look at different approaches and their tradeoffs.

Note that this whole post assumes binary IEEE-754 floating-point numbers. There are more different types of floating-point numbers, e.g. IBM likes decimal floating-point numbers enough to support them in hardware. However, most of the text below should be applicable to different representations too.

Floating point basics

I do not want to get into too many details about the representation of floating-point numbers or their arithmetics, but we still need to go over some important points. They are required to build an understanding of the different comparison methods we will look at later.

Floating-point numbers are a (one) way of dealing with real numbers in fixed-size storage inside a computer. The binary representation consists of 3 parts, the sign bit, the mantissa, and the exponent.

The sign bit should be self-explanatory. It decides which sign the number resulting from the rest of the bits will have[2]. The mantissa stores the digits of the represented number, while the exponent stores the magnitude of the number.

Because the total number of bits split between these three parts is fixed, we must logically lose precision when representing some numbers due to insufficient bits in the mantissa. The fact that the bit allocation to each part of the representation is also fixed[3] means that as we represent higher numbers, the absolute loss of precision increases. However, the relative loss of precision remains the same.

Floating-point numbers also contain some special values used to represent specific "states" outside of normal operations. As an example, if a number is so big it overflows the floating-point type, it will be represented as infinity (or negative infinity in case of underflow). The other important special kind of values are the NaN (Not a Number) values.

There are different types of NaN, but the important part of them is that they are the result of invalid floating point operation, e.g. \(\frac{0}{0}\) or \(\frac{\infty}{\infty}\) and that they behave unintuitively, because \(\textrm{NaN} \neq \textrm{NaN}\)[4].

With this knowledge we can now look at how we can compare two floating-point numbers.

Comparing floating-point numbers

There are 4 (5) different ways to compare floating-point numbers. They are:

  • Bitwise comparison
  • Direct ("exact") IEEE-754 comparison
  • Absolute margin comparison
  • Relative epsilon comparison
  • ULP (Unit In Last Place) based comparison

Apart from bitwise comparison, all of them have their merits (and drawbacks). The bitwise comparison is included only to contrast it with the "exact" comparison, I am not aware of any use for it in the real world.

Bitwise and direct comparison

The idea behind bitwise comparison is exceedingly simple. Two floating-point numbers are equal iff their bit representations are the same.

This is not what happens if you write lhs == rhs[5] in your code.

If you write lhs == rhs in your code, you get what is often called "exact" comparison. However, this doesn't mean that the numbers are compared bitwise because e.g. -0. == 0. and NaN != NaN, even though in the first case both sides have different bit representations, and in the latter case, both sides might have the exact same bit representation

Direct comparison is useful only rarely, but it is not completely useless. Because the basic operations[6] are specified exactly, any computation using only them should[7] provide specific output for an input. The situation is worse for various transcendental functions[8], but reasonably fast correctly rounded libraries are beginning to exist.

All in all, if you are writing code that does some computations with floating-point numbers and you require the results to be portable, you should have a bunch of tests relying purely on direct comparison.

Absolute margin comparison

Absolute margin comparison is the name for writing \(|\textrm{lhs} - \textrm{rhs}| \leq \textrm{margin}\)[9]. This means that two numbers are equal if their distance is less than some fixed margin.

The two big advantages of absolute margin comparison are that it is easy to reason about decimally ("I want to be within 0.5 of the correct result") and that it does not break down close to 0. The disadvantage is that it instead breaks down for large values of lhs or rhs, where it decays into direct comparison[10].

Relative epsilon comparison

The relative epsilon[11] comparison is the name for writing \(|\textrm{lhs} - \textrm{rhs}| \leq \varepsilon * \max(|\textrm{lhs}|, |\textrm{rhs}|)\)[12]. This means that two numbers are equal if they are within some factor of each other.

Unlike margin comparison, epsilon comparison does not break down for large lhs and rhs values. The tradeoff is that it instead breaks down (by decaying to exact comparison) around 0[13]. Just like margin comparison, it is quite easy to reason about decimally ("I want to be within 5% of the correct result").

You can also swap the maximum for a minimum of the two numbers, which gives you a stricter comparison[14] but with the same advantages and disadvantages.

ULP-based comparison

The last option is to compare two numbers based on their ULP distance. The ULP distance of two numbers is how many representable floating-point numbers there is between them + 1. This means that if two numbers do not have any other representable numbers between them, their ULP distance is 1. If there is one number between them, the distance is 2, etc.

The big advantage of using ULP comparisons is that it automatically scales across different magnitudes of compared numbers. It doesn't break down around 0, nor does it break down for large numbers. ULP based comparison is also very easy to reason about numerically. You know what operations happened to the input and thus how far the output can be from the canonical answer and still be considered correct.

The significant disadvantage is that it is very hard impossible to reason about decimally without being an expert in numerical computations. Imagine explaining to a non-technical customer that you guarantee to be within 5 ULP of the correct answer.


So, what does all this mean? What comparison should you use in your code?

Sadly there is no one-size-fits-all answer. When comparing two floating-point numbers, you need to understand your domain and how the numbers came to be and then decide based on that.

What about Catch2?

I maintain a popular testing framework, Catch2, so you might be wondering how does Catch2 handle comparing floating-point numbers. Catch2 provides some useful tools for comparing floating-point numbers, namely Approx and 3 different floating-point matchers, but doesn't make any decisions for you.

Approx is a type that provides standard relational operators, so it can be used directly in assertions and provides both margin and epsilon comparisons. Approx equals a number if the number is either margin or epsilon (or both) equal to the target.

There are two crucial things[15] to remember about Approx. The first is that the epsilon comparison scales only with the Approx'd value, not the min/max of both sides of the comparison. The other is that a default-constructed Approx instance only performs epsilon comparison (margin defaults to 0).

The matchers each implement one of the three approximate comparisons, and since they are matchers, you can arbitrarily combine them to compare two numbers with the desired semantics. However, it is important to remember that the ULP matcher does have a slightly non-standard interpretation of ULP distance.

The ULP matcher's underlying assumption is that the distance between two numbers that directly compare equal should be 0, even though this is not the interpretation by the standard library, e.g. through std::nextafter. This means that e.g. ulpDistance(-0, 0) == 0 as far as the ULP matcher is concerned, leading to other minor differences from naive ULP distance calculations.

Summarizing the behaviour of the ULP matcher:
\[
\begin{align}
x = y &\implies \textrm{ulpDistance}(x, y) = 0 \\
\textrm{ulpDistance}(\textrm{max-finite}, \infty) &= 1 \\
\textrm{ulpDistance}(x, -x) &= 2 \times \textrm{ulpDistance}(x, 0) \\
\textrm{ulpDistance}(\textrm{NaN}, x) &= \infty
\end{align}
\]


That is all for this post. Now you can go and fix floating-point comparisons in your code. Or use this post to win internet arguments. As long as you don't give advice assuming that floating-point comparisons are one-size-fits-all, it is fine by me.


  1. As it turns out, trying to represent real numbers (of which there is an uncountable infinity) using only fixed space is a very complex problem. ↩︎

  2. This simplifies some things but also means that zero can be either positive or negative. Luckily they compare equal, but that makes the cases where you cannot use them interchangeably even more surprising. ↩︎

  3. For single-precision floating-point numbers (usually float), the exponent has 8 bits, and mantissa has 23 bits. For double-precision floating-point numbers (usually double), the exponent has 11 bits, and the mantissa has 52 bits. ↩︎

  4. Not only are NaNs never equal to any floating value (including other NaNs), trying to compare them also always produces an unordered result. This means that all of \(\textrm{NaN} < x\), \(\textrm{NaN} \le x\), \(\textrm{NaN} > x\), \(\textrm{NaN} \ge x\) are false, even if \(x\) is another NaN. ↩︎

  5. Obviously this assumes that a and b are of a floating-point type, e.g. double. ↩︎

  6. This means addition, subtraction, multiplication, division, and, surprisingly, sqrt given specific rounding mode. ↩︎

  7. This doesn't apply if you use -ffast-math or equivalent, if your code targets x87 FPU, or if your compiler likes to use fused multiply-add instructions. ↩︎

  8. These are your logarithms, sine and cosine, etc. The definition of transcendental function is that they are not expressible as a finite combination of additions, subtractions, multiplications, divisions, powers or roots. ↩︎

  9. Interestingly, you will get different results if you implement the comparison using absolute value versus splitting it into two comparion (so you get `\((\textrm{lhs} + \textrm{margin} \geq \textrm{rhs} \wedge \textrm{rhs} + \textrm{margin} \geq \textrm{lhs}\)). Specifically comparison using absolute value will reject two infinities, while the split comparison accepts infinities as equal. For this reason Catch2 internally uses the latter implementation. ↩︎

  10. This is because of the floating nature of the floating-point numbers, and the smallest representable difference becoming bigger for bigger numbers. Remember that for large enough numbers, X + 1 == X. ↩︎

  11. To make matters more confusing, people also sometimes talk about something called machine epsilon. Machine epsilon is the difference between 1.0 and the next higher representable value (in other words, a value that is 1 ULP from 1.0 in the direction of positive infinity). This epsilon is a different epsilon than the one used in relative comparisons, even though they can be related (e.g. Catch2 defaults relative epsilon to 100 * machine epsilon of the type). ↩︎

  12. Notice that the formulas for absolute margin and relative epsilon comparisons are very similar. The only difference is whether we scale the allowed differences or not. ↩︎

  13. To understand how this happens, consider a comparison between 0 and some other number. The only way this comparison can ever pass is to set epsilon to 1, thus allowing up to 100% difference between the two sides. Such epsilon, in turn, can only fail for numbers with different signs, making the comparison useless.

    Even if the other side is the smallest representable positive number and thus likely "close enough" for approximate comparisons, we would still have to set the epsilon to 1 for the comparison to succeed. ↩︎

  14. Consider \(\varepsilon = 0.1\), \(\textrm{lhs} = 10\), and \(\textrm{rhs} = 11.1\). ↩︎

  15. Both of these behaviours are there for legacy reasons because I do not want to break backwards compatibility in a way that can cause tests to pass when they would not before. Generally, I recommend considering Approx legacy and building the comparisons out of matchers. ↩︎

]]>
<![CDATA[The Little Things: Testing with Catch2]]>https://codingnest.com/the-little-things-testing-with-catch-2/64c28599a062e6148d2e46caWed, 05 May 2021 21:51:50 GMT

This post will go over testing with Catch2 and will be very example heavy. I want to cover first the basic usage of Catch2 (tests, assertions, sections, test grouping) and then some more advanced and less used features. I will not cover every feature Catch2 has, just those that I think are most likely to be generally helpful.

Note that this post is not about the whys, hows, and whats of testing. I intend to write a post about that too, but this one exists to show off Catch2.

All examples in this post will be written against the v3 branch of Catch2.

Catch2 basics

As with all testing frameworks, the two most fundamental parts of Catch2 are test cases that contain assertions. Assertions exist in the REQUIRE[1] macro and must be contained within a test case[2], which in turn is created using the TEST_CASE macro.

The following simple example defines a single test case with 3 assertions. The test case is called "simple test case", which we can use to refer to the test case later. There is also an implementation of factorial with a tiny bug which the tests will run into.

#include <catch2/catch_test_macros.hpp>

static int factorial(int n) {
    if (n <= 1) {
        return n;
    }
    return n * factorial(n - 1);
}

TEST_CASE("Simple test case") {
    REQUIRE(factorial( 1) == 1);
    REQUIRE(factorial(10) == 3'628'800);
    REQUIRE(factorial( 0) == 1);
}

Compiling and running the example gives this (abridged) output:

...............................................................................

/app/example.cpp:13: FAILED:
  REQUIRE( factorial( 0) == 1 )
with expansion:
  0 == 1

===============================================================================
test cases: 1 | 1 failed
assertions: 3 | 2 passed | 1 failed

The interesting part of it is that in the case of a failure[3], we see both the original expression, REQUIRE(factorial( 0) == 1), and the actual compared values: 0 == 1.

Do you see the bug?[4]

Sections

Sections are a feature that is not common in the xUnit family of testing frameworks. They allow defining multiple paths through a test case. These paths can (partially) overlap and thus can be used to provide set-up and tear-down functionality. In the simple example below, there will be two paths through the test. The first one will print "1A\n", and the other will print "1B\n".

#include <catch2/catch_test_macros.hpp>

#include <iostream>

TEST_CASE("Section showcase") {
    std::cout << '1';
    SECTION("A") {
        std::cout << 'A';
    }
    SECTION("B") {
        std::cout << 'B';
    }
    std::cout << '\n';
}

(try it on godbolt)

Sections can also be nested. The following example defines 4 paths through the test case, printing "1Aa\n", "1Ab\n", "1Ba\n", and "1Bb\n" respectively.

#include <catch2/catch_test_macros.hpp>

#include <iostream>

TEST_CASE("Section showcase") {
    std::cout << '1';
    SECTION("A") {
        std::cout << 'A';
        SECTION("a") { std::cout << 'a'; }
        SECTION("b") { std::cout << 'b'; }
    }
    SECTION("B") {
        std::cout << 'B';
        SECTION("a") { std::cout << 'a'; }
        SECTION("b") { std::cout << 'b'; }
    }
    std::cout << '\n';
}

(try it on godbolt)

Ultimately, the use of SECTIONs boils down to defining a tree of tests that share some of the code. The tests are then run in a depth-first, top-to-bottom order.

Please note that while the only absolute limit on nesting sections is whatever your compiler can handle before giving out/running out of memory, nesting beyond 2-3 levels is usually unreadable in practice.

Stringifying custom types

In the very first example, when the assertion failed, Catch2 showed us the actual values on both sides of the comparison. To do this, it needs to know how to turn a type into a string it can display; otherwise, it will just show the value as "{ ? }". There are two ways[5] to have your type properly stringified by Catch2:

  1. Provide ADL-findable overload of operator<<(std::ostream&, T const&) for your type.
  2. Specialize Catch::StringMaker<T> for your type.

The second option has higher priority, so if a type has both operator<< overload and StringMaker specialization, the specialization will be used.

(try it on godbolt)

Test case tagging and grouping

Test cases can also be associated with strings called tags. Tags have two purposes. One is to allow users of Catch2 to group tests that have something in common, e.g. tests for custom allocators, and the other is to mark a test as having some specific property, e.g. that it is expected to fail.

Test cases are assigned their tags via the second[6] (optional) argument to TEST_CASE macro, e.g. TEST_CASE("widgets can be constructed from strings", "[widget][input-validation]") creates a test case with two tags, [widget] and [input-validation].

Some tags can also have special meaning. In general, Catch2 reserves tag names starting with "!" for its own purposes, e.g. [!shouldfail] inverts the pass/fail logic of a test. If an assertion fails, the test case succeeds, but if no assertion fails, then the test case fails. Catch2 also ascribes special meaning to tags starting with ".", e.g. [.] or [.widget]. These mark the tagged tests as "hidden" – hidden tests will be run if they are explicitly selected, they will not be run by default.

Let's take a look at an example:

#include <catch2/catch_test_macros.hpp>
#include <iostream>

TEST_CASE("first", "[A][foo]") {
    std::cout << "first\n";
}

TEST_CASE("second", "[B][.foo]") {
    std::cout << "second\n";
}

TEST_CASE("third", "[C][bar]") {
    std::cout << "third\n";
}

TEST_CASE("fourth", "[A][.][bar]") {
    std::cout << "fourth\n";
}

Compiling the tests above into their own binary and running it without further arguments will run tests "first" and "third" because the other two tests are hidden. Specifying the "[foo]" tag will run tests "first" and "second", and so on. You can also ask for all tests that are not tagged with "[foo]" by negating the tag: "~[foo]". This will run only one test, "third".

You can also specify multiple tags as the test filter; "[tag1][tag2]" means run tests that have both tags, "[tag1],[tag2]" means run tests that have either of the two tags.

More advanced features

There are three more advanced features that I want to showcase:

  • Matchers
  • Generators
  • Benchmarking

Matchers

Matchers are helpful for testing more complex properties than can be expressed with a simple comparison operator. For example, if a function returns a set of values but does not promise a specific order, we cannot compare the result to expected values directly.

In Catch2, matchers are usually[7] used in the REQUIRE_THAT(expression, matcher) macro. This is shown in the example below, where we check that the (shuffled) vector contains the correct elements in an unspecified order:

#include <catch2/catch_test_macros.hpp>
#include <catch2/matchers/catch_matchers_vector.hpp>

#include <algorithm>
#include <random>

TEST_CASE("vector unordered matcher", "[matchers][vector]") {
    using Catch::Matchers::UnorderedEquals;
    std::vector<int> vec{0, 1, 2, 3, 4};
    
    std::shuffle(vec.begin(), vec.end(), std::random_device{});
    
    REQUIRE_THAT(vec, UnorderedEquals<int>({0, 1, 2, 3, 4}));
}

(try it on godbolt)

Catch2's matchers can also be combined together with logical operators &&, || and !. These do what you expect given their meaning for boolean expression, so that matcher1 && !matcher2 only accepts input if matcher1 accepts it and matcher2 does not. Thus, in the example below, the combined matcher requires the input string to either not contain "MongoDB" or "web scale".

#include <catch2/catch_test_macros.hpp>
#include <catch2/matchers/catch_matchers_string.hpp>

std::string description() {
    return "MongoDB is web scale!";
}

TEST_CASE("combining matchers") {
    using Catch::Matchers::Contains;
    
    REQUIRE_THAT(description(),
                 !Contains("MongoDB") || !Contains("web scale"));
}

(try it on godbolt)

For more on Catch2's matchers (e.g. which matchers are implemented in Catch2 and how to implement your own matchers), look into the matcher documentation.

Generators

Generators are Catch2's implementation of data-driven testing. The core idea is that you can keep the same test code but feed the test code different inputs to test different cases.

Data generators are declared inside test cases with the GENERATE macro, and a generator expression inside it. The example below shows a test case that will be run for 3 different inputs - 2, 4, and 6:

#include <catch2/catch_test_macros.hpp>
#include <catch2/generators/catch_generators.hpp>

TEST_CASE("Simple generator use") {
    auto number = GENERATE(2, 4, 5);
    CAPTURE(number);
    REQUIRE(number % 2 == 0);
}

(try it on godbolt)

Generators can be mixed with sections. When doing so, you can reason about them as if they defined another section from their GENERATE statement until the end of the scope, and that section will be entered for each generated input. This means that the example below will print 6 lines, "A\n", "B\n", "B\n", "A\n", "B\n", and "B\n".

#include <catch2/catch_test_macros.hpp>
#include <catch2/generators/catch_generators.hpp>

#include <iostream>

TEST_CASE("Simple generator use") {
    auto number = GENERATE(2, 4);
    SECTION("A") {
        std::cout << "A\n";
    }
    SECTION("B") {
        auto number2 = GENERATE(1, 3);
        std::cout << "B\n";
    }
}

(try it on godbolt)

Catch2 also provides some built-in utility generators, like table, which helps with defining sets of inputs and the expected results:

#include <catch2/catch_test_macros.hpp>
#include <catch2/generators/catch_generators.hpp>

#include <string.h>
#include <tuple>

TEST_CASE("tables", "[generators]") {
    auto data = GENERATE(table<char const*, int>({
        {"first", 5},
        {"second", 6},
        {"third", 5},
        {"etc...", 6}
    }));

    REQUIRE(strlen(std::get<0>(data)) == static_cast<size_t>(std::get<1>(data)));
}

(try it on godbolt)

There is also variety of higher-order generators, e.g. filter, or take. These can be used to create complex test data generators, like in the example below where we generate 10 odd random integers in range [-100, 100]:

#include <catch2/catch_test_macros.hpp>
#include <catch2/generators/catch_generators_adapters.hpp>
#include <catch2/generators/catch_generators_random.hpp>

TEST_CASE("Chaining generators") {
    auto i = GENERATE(take(10, filter([](int i) {
                              return i % 2 == 1;
                           }, random(-100, 100))));
    REQUIRE(i > -100);
    REQUIRE(i < 100);
    REQUIRE(i % 2 == 1);
}

(try it on godbolt)

For more on Catch2's generators (e.g. which generators are implemented in Catch2 and how to implement your own), look into the generator documentation.

(Micro)Benchmarking

Catch2 also provides basic microbenchmarking support. You can insert a benchmark into any test case using the BENCHMARK macro followed by a block of code to benchmark. You can also combine benchmarks and assertions[8], as shown in the example below:

#include <catch2/catch_test_macros.hpp>
#include <catch2/benchmark/catch_benchmark.hpp>

static int factorial(int n) {
    return n <= 1? 1 : n * factorial(n-1);
}

TEST_CASE("Simple benchmark") {
    REQUIRE(factorial(12) == 479'001'600);

    BENCHMARK("factorial 12") {
        return factorial(12); // <-- returned values won't be optimized away
    }; // <--- !! semicolon !!
}

(try it on godbolt)

If you want to run benchmarks for different input sizes, you can combine generators with benchmarks, like in the example below:

#include <catch2/catch_test_macros.hpp>
#include <catch2/benchmark/catch_benchmark.hpp>
#include <catch2/generators/catch_generators.hpp>

static int factorial(int n) {
    return n <= 1? 1 : n * factorial(n-1);
}

TEST_CASE("Validated benchmark") {
    int input, expected_result;
    std::tie(input, expected_result) = GENERATE(table<int, int>( {
        {0, 1},
        {1, 1},
        {5, 120},
        {10, 3'628'800},
        {12, 479'001'600},
    }));

    REQUIRE(factorial(input) == expected_result);

    BENCHMARK("factorial " + std::to_string(input)) {
        return factorial(input);
    };
}

(try it on godbolt)

For more on Catch2's microbenchmarking support (e.g. how to handle constructors and destructors, or how to add a setup step for your benchmark), look into the benchmarking documentation.

Final words

The above is by no means everything that Catch2 provides. I picked three features that I feel are most widely useful while being least widely known, and just on top of my head, I know I have skipped over at least:

  • Templated test cases (same test across different types)
  • Running specific sections in a test case
  • Running test cases in random order
  • Facilities for comparing floating-point numbers
  • Writing your own reporters
  • Logging extra information during a test run

And even I definitely do not remember everything present in Catch2. However, most[9] of the things provided are documented, and often you can find handy features by reading through the documentation.


  1. All REQUIRE* assertion macros also have a corresponding CHECK* assertion that has continue-on-test-failure semantics instead. The sole exception is STATIC_REQUIRE, which is functionally a static_assert but integrated with Catch2's assertion reporting. ↩︎

  2. This does not mean that it has to be directly in the code of TEST_CASE, just that a test case must be executing when an assertion is encountered. So it is fine to have, e.g. a helper function that uses REQUIRE, as long as it is only called from test cases. ↩︎

  3. You can also get this output for non-failing assertions if you pass the -s flag when invoking the test binary. ↩︎

  4. The first return statement should always return 1, not n. ↩︎

  5. You can also change how an exception's message is retrieved by Catch2. The default is to rely on std::exception::what. ↩︎

  6. Note that the first argument is also optional. TEST_CASE() is a valid "constructor" and will give the resulting test a special, "anonymous" name + number. However, you cannot skip the test name if you want to specify tags due to how optional arguments in C++ work. You can ask for your test case to be given an anonymous name by providing an empty name instead. This means that TEST_CASE("", "[some-tag-or-another]") will be an "anonymous" test with [some-tag-or-another] as its tag. ↩︎

  7. The two exceptions are REQUIRE_THROWS_WITH and REQUIRE_THROWS_MATCHES, which check that provided expression throws an exception, and then use matchers to validate that the desired exception was thrown. ↩︎

  8. Just make sure you are not benchmarking the assertions alongside your own code. ↩︎

  9. I am sorry for the state of reporter docs. I am waiting to actually write them until I finish breaking their interface for v3. ↩︎

]]>
<![CDATA[The Little Things: Everyday efficiencies]]>https://codingnest.com/the-little-things-everyday-efficiencies/64c28599a062e6148d2e46c9Thu, 29 Apr 2021 15:04:11 GMT

At some point, we have all heard a quote attributed to Donald Knuth, saying that:

Premature optimization is the root of all evil

There have been many fights over whether this applies, when is an optimization premature, and so on. This post is not meant to participate in these fights[1], but I do want to quote Donald Knuth in full before continuing:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

The full quote says that we should avoid pessimizing maintenance in the name of performance, unless we know that the performance matters. Thus the topic of this post: some ways to speed-up frequently written code without sacrificing the maintainability of the code.

We will look at two boring, yet commonly done, things, and see how we can easily lose (or gain) performance based on how we implement them. The two things are:

  • iterating over containers
  • bulk data transformation

Iterating over containers

C++11 added a new type of loop, called range-based for loop (or for-each loop, or range-for loop). It serves to iterate over all elements in a range, as defined by the iterators returned from begin() and end(). Using range-based for loops greatly simplifies some patterns relying on iterators, such as iterating over all entries in a std::set.

// pre C++11
for (std::set<std::string>::const_iterator it = set.begin(); it != set.end(); ++it) {
    std::cout << *it << '\n';
}

// post C++11
for (auto const& elem : set) {
    std::cout << elem  << '\n';
}

The most significant advantage of using range-for is that it is more limited than other forms of loops. Inside the loop you cannot refer to the index or iterator of the element[2], which helps communicate your intent: you want to iterate all elements, and there is no index-based trickery going on.

There is also a secondary advantage, though, and that is its potential to improve runtime performance. We will look at some examples and compare the generated assembly for an index loop over a std::vector with the assembly generated when using a range loop instead.

Consider these two simple functions:

void indexed(std::vector<int>& in) {
    for (size_t idx = 0; idx < vec.size(); ++idx) {
        vec[idx] *= 2;
    }
}

void foreach(std::vector<int>& in) {
    for (auto& elem : vec) {
        vec *= 2;
    }
}

both of them do the same thing, that is, multiply each element in a vector by 2. However, when using GCC 10.2 -O2, they do not compile into quite the same assembly (godbolt link):

indexed(std::vector<int, std::allocator<int> >&):
        mov     rax, QWORD PTR [rdi]
        mov     rdx, QWORD PTR [rdi+8]
        sub     rdx, rax
        mov     rcx, rdx
        shr     rcx, 2
        je      .L1
        add     rdx, rax
.L3:
        sal     DWORD PTR [rax]
        add     rax, 4
        cmp     rdx, rax
        jne     .L3
.L1:
        ret

foreach(std::vector<int, std::allocator<int> >&):
        mov     rax, QWORD PTR [rdi]
        mov     rdx, QWORD PTR [rdi+8]
        cmp     rax, rdx
        je      .L9
.L11:
        sal     DWORD PTR [rax]
        add     rax, 4
        cmp     rax, rdx
        jne     .L11
.L9:
        ret

The critical part, the inner loop itself, is the same for both - 4 instructions, but indexed has 7 instructions before the loop, while foreach has only 4. While the difference is tiny, and with larger inputs completely negligible, we should understand where it comes from before moving onto more complex examples.

The explanation is quite simple. std::vector consists of 3 pointers[3], one for the start of allocated memory, one for the first empty slot, and one that points one-past-the-allocation. This representation then means that std::vector::size has to be implemented as a subtraction between two pointers, which adds the extra instructions to the start of indexed.

So, for a simple example, the performance advantage goes to for-range loop, but it is only a constant factor advantage. This means that the larger the actual input, the smaller the difference between the two loops is.

Now, let's take a look at a more complex example. More specifically, we will look at what happens if we call an opaque function inside the loop:

void foo(std::vector<int> const&);

void indexed(std::vector<std::vector<int>> const& in) {
    for (size_t idx = 0; idx < in.size(); ++idx) {
        foo(in[idx]);
    }
}

void foreach(std::vector<std::vector<int>> const& in) {
    for (auto& elem : in) {
        foo(elem);
    }
}

again, both of them do the same thing, that is, call foo on every element in in, and again, they compile into different assembly. But this time, the assembly is significantly different (godbolt link):

indexed(std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&):
        mov     rdx, QWORD PTR [rdi]
        cmp     QWORD PTR [rdi+8], rdx
        je      .L6
        push    r12
        mov     r12, rdi
        push    rbp
        movabs  rbp, -6148914691236517205
        push    rbx
        xor     ebx, ebx
.L3:
        lea     rax, [rbx+rbx*2]
        add     rbx, 1
        lea     rdi, [rdx+rax*8]
        call    foo(std::vector<int, std::allocator<int> > const&)
        mov     rdx, QWORD PTR [r12]
        mov     rax, QWORD PTR [r12+8]
        sub     rax, rdx
        sar     rax, 3
        imul    rax, rbp
        cmp     rbx, rax
        jb      .L3
        pop     rbx
        pop     rbp
        pop     r12
        ret
.L6:
        ret

foreach(std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&):
        push    rbp
        push    rbx
        sub     rsp, 8
        mov     rbx, QWORD PTR [rdi]
        mov     rbp, QWORD PTR [rdi+8]
        cmp     rbx, rbp
        je      .L10
.L12:
        mov     rdi, rbx
        add     rbx, 24
        call    foo(std::vector<int, std::allocator<int> > const&)
        cmp     rbp, rbx
        jne     .L12
.L10:
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret

This time, the inner loops differ significantly, and foreach has a significant performance advantage. In indexed, the inner loop consists of 11 instructions, while in foreach it consists of only 5 instructions. The reason for this difference is due to the opaque call to foo.

The call to foo forbids the compiler from assuming that in is unchanged[4] between iterations. Without this assumption, in.size() has to be recalculated every iteration[5], which requires loading in in's members from memory, followed by a subtraction+division[6] to get the actual size.

The foreach function does not have to reload in on every iteration for a very simple reason: range-for is syntax sugar for an iterator loop that stores the begin and end iterators locally, before the loop starts. Thanks to this, the range-for loop does not have to recalculate size on every iteration[7]. There is a cost to this difference though. If foo does modify in, and causes reallocation, then foreach invokes UB, while indexed works correctly.

Bulk data transformation

Another common operation is bulk transforming data from one representation to another, e.g. extracting list of UserIds from JSON. Let's take a look at two simple functions:

std::vector<int> no_reserve(std::vector<int> const& input) {
    std::vector<int> ret;
    for (int elem : input) {
        ret.push_back(2 * elem);
    }
    return ret;
}

std::vector<int> do_reserve(std::vector<int> const& input) {
    std::vector<int> ret;
    ret.reserve(input.size());
    for (int elem : input) {
        ret.push_back(2 * elem);
    }
    return ret;
}

Both these functions take a vector<int> and return new vector, with all elements multiplied by two. The difference is that do_reserve reserves sufficient space in the return vector before filling it. Obviously this is going to perform better, but how much? Let's benchmark it, using Catch2's benchmarking support:

#include <catch2/catch_test_macros.hpp>
#include <catch2/generators/catch_generators.hpp>
#include <catch2/benchmark/catch_benchmark_all.hpp>
#include <vector>

namespace {

std::vector<int> generate_ints(size_t sz) {
    std::vector<int> ret;
    ret.reserve(sz);
    
    for (size_t i = 0; i < sz; ++i) {
        ret.push_back(i % 128);
    }
    
    return ret;
}

std::vector<double> no_reserve(std::vector<int> const& input) { ... }
std::vector<double> do_reserve(std::vector<int> const& input) { ... }

} // end unnamed namespace


TEST_CASE("Benchmark reserving vectors", "[reserve][benchmark]") {
    const auto size = GENERATE(10'000,
                              100'000,
                            1'000'000,
                           10'000'000);
    auto data = generate_ints(size);
    CAPTURE(size);
    BENCHMARK("no reserve") {
        auto tripled = no_reserve(data);
        return tripled;
    };
    BENCHMARK("reserve") {
        auto tripled = do_reserve(data);
        return tripled;
    };
    SUCCEED();
}

Compiling the above with in Release configuration, using Clang 10 and running it on my machine, I get these results:

size no_reserve do_reserve relative speedup
10K 9.89 ± 0.08 us 7.42 ± 0.01 us 1.15x
100K 94.34 ± 0.31 us 76.56 ± 0.27 us 1.23x
1M 1.01 ± 0.00 ms 0.79 ± 0.00 ms 1.27x
10M 36.06 ± 0.02 ms 17.70 ± 0.01 ms 2.04x

The exact timings aren't important. What is important is that the speed-up increases with the increasing size of the data. The speed-up increases because the larger the input size is, the more times the no_reserve function ends up reallocating the return vector, and the more times the elements inside it are copied. Given that both functions perform the same transformation, the difference is entirely due to the superfluous reallocations.

When interpreting the numbers above, you should keep in mind that in our example, the transformation work per element is trivial[8]. If the per-element work was less trivial, the relative speed-up would be lesser. An example with the inner loop changed to calculate exp(elem) is shown in this table:

size no_reserve do_reserve relative speedup
10K 119.15 ± 0.41 us 115.94 ± 0.42 us 1.03x
100K 1.19 ± 0.00 ms 1.16 ± 0.00 ms 1.03x
1M 12.13 ± 0.00 ms 11.60 ± 0.00 ms 1.05x
10M 183.38 ± 0.04 ms 138.98 ± 0.03 ms 1.32x

As with using range-for to iterate ranges, calling vector::reserve when we know the final size of a vector will improve the code's performance without impacting future maintainability of the code. Thus, we should use it when possible.

However, calling vector::reserve multiple times on a single instance is very likely a performance bug. Repeat calls to vector::reserve on the same instance can easily lead to O(n^2) overall complexity for appending elements (or O(n) for single vector::push_back call). This problem commonly occurs when we insert elements in batches of, say, 100, and every time we "helpfully" reserve current_size + batch_size.

As a general rule, unless you 100% know what you are doing, reserve should never be called on one vector instance more than once during its lifetime. Ideally, you know what the final size will be and can reserve that outright. Less ideally, you can guard the call to reserve with a check that the vector instance does not have allocated any capacity yet. Doing so can improve the performance of repeated batch inserts without risking the accidentally quadratic behaviour.

Bonus: inserting newlines into streams

Even though std::format has been standardized into C++20, and should be preferred to formatting using streams, I expect that we will still be dealing with streams and stream formatting for a long time[9]. Because streams are commonly poorly taught, many people end up writing unintentionally pesimized code, and I would prefer if they did not. Luckily, keeping with theme of this post, the better performing code is also more maintainable.

Let's say that we want to write a bunch of strings to a stream, with each string being on its own line. A straightforward implementation of such function could look like this:

void write_strings(std::ostream& out, std::vector<std::string> const& input) {
    for (auto const& elem : input) {
        out << elem << std::endl;
    }
}

This code works, but the use of std::endl to write the newlines is inefficient because it does more than just write a newline. It also flushes the stream, which is an expensive operation. Keeping with the theme of this post, the way to remove this inefficiency is, once again, to explicitly state our intent in the code, and insert \n to the stream.

void write_strings(std::ostream& out, std::vector<std::string> const& input) {
    for (auto const& elem : input) {
        out << elem << "\n";
    }
}

But wait, why are we appending a string consisting of a single character to the stream? We only want to add single character, not a string. This gives us our third implementation:

void write_strings(std::ostream& out, std::vector<std::string> const& input) {
    for (auto const& elem : input) {
        out << elem << '\n';
    }
}

I wrote a quick benchmark, where these functions wrote out a bunch of strings[10] to a file. Running it on Linux machine with SSD as the main drive, I get the following numbers:

n std::endl "\n" '\n' endl vs "\n" speedup "\n" vs '\n' speedup
100k 1.90 ms 1.61 ms 1.60 ms 1.18x 1.01x
1M 19.59 ms 16.79 ms 16.47 ms 1.17x 1.02x
10M 196.43 ms 169.23 ms 166.93 ms 1.16x 1.01x

From the numbers, you can see that going from std::endl to "\n" is a significant improvement, and there is also a tiny improvement going from "\n" (inserting the newline as a string of single character) to '\n' (inserting the newline as a single character).

Putting it all together, if you want to insert a newline to a stream, you should insert it as \n, whether as a part of a string, or as a single character. If you want to also flush the stream at the same time, you should use \n + std::flush, to explicitly document[11] your intent of flushing the stream, rather than using std::endl.


That's all for this post. Maybe the next one will come in sooner than in 6 months.



  1. My opinion on this boils down to "it depends, be smart about it, consider more than just the code in isolation". The long version could easily take up another 4k words post, so I will not go there today. ↩︎

  2. You can, of course, explicitly ask for the index by wrapping the iterated range in an enumerate, which then turns the returned element into an (index, element) pair. In doing so, you state you are using the indices for something up-front, at the loop declaration, which is still helpful in determining the programmer's intent. ↩︎

  3. This is true for all three big implementations, libc++, libstdc++, MS's STL, and also some other, like EA STL, and FB's Folly. The reason for this is that it improves the performance of insert-at-the-end operations (push_back, emplace_back, etc), and calls to end(), at the cost of worsened performance of size() and capacity() queries. As those should be used relatively rarely, this tradeoff is generally seen as beneficial. ↩︎

  4. Yes, foo is allowed to modify in, and the only thing preventing that is that it would be a terrible programming style. While both functions take in by const&, the constness there only constrains the indexed/foreach functions, nothing else. Especially not the actual argument passed by the const&. ↩︎

  5. Actually, the loop in indexed could be modified to provide the same performance by writing it as for (size_t idx = 0, end = in.size(); idx < end; ++idx). However, this version is less common, and thus less maintainable than the range-for loop. ↩︎

  6. The compiler is smart enough to replace the division with multiplication by reciprocal constant, which has the same final effect but performs better. ↩︎

  7. Note that just using iterators for the loop is not enough. If you use them naively and formulate your end condition as iter != in.end(), then the loop will be simpler but will still try to reload in in every iteration. You have to store the iterator returned from in.end() outside the loop. ↩︎

  8. Multiplying int by two should be turned into a single bitshift by any optimizing compiler worth the name. ↩︎

  9. One of the big reasons they won't die out soon is that appending to a stream was the big customization point on how to serialize a type since C++98. There is significant inertia in that, and there is a reason format has special support for types that know how to write themselves to a stream but do not provide custom formatter. ↩︎

  10. This was a very hacky benchmark, so the input strings are nowhere near evenly distributed. Most of the strings have 30 characters, about 10% have 20 characters, and 1% has 10 characters. The strings are also strictly numeric. ↩︎

  11. Yes, you could use std::endl and add a comment saying that you do want the flushing behaviour. However, doing so is more work than having a separate newline and stream flush, and also makes for a worse maintenance experience later. Because the use of std::endl is usually an antipattern, everyone touching the code will have to read the comment and understand that this one use is fine, unlike usual. ↩︎

]]>
<![CDATA[The Little Things: Speeding up C++ compilation]]>https://codingnest.com/the-little-things-speeding-up-c-compilation/64c28599a062e6148d2e46c7Sun, 20 Sep 2020 19:25:18 GMT

The Little Things is a new series of posts based on Locksley's internal training sessions. Often the contents are either proprietary (e.g. the inner workings of specific master key platforms) or not generally interesting (e.g. our internal libraries and tooling), but sometimes the contents are suitable for a wider audience, in which case I want to share them.


This post will be about some source-level techniques for speeding up C++ compilation, and their (dis)advantages. It will not talk about things external to C++, such as buying better hardware, using a better build system, or using smarter linker[1]. It will also not talk about the tooling that can find compilation bottlenecks, as that will be a subject of a later post.

Overview of C++ compilation model

I will start with a quick overview of the C++ compilation model, to provide context for some of the tricks I will show later. Note that this overview will be very coarse, if you want a detailed look at the subtleties of the 9 phase compilation model defined in the C++ standard, look elsewhere.

We will consider the compilation of C++ binary to happen in 3 steps:

  1. Preprocessing
  2. Compilation
  3. Linking

Preprocessing

The first step is preprocessing. During it, the preprocessor takes a .cpp file and parses it, looking for preprocessor directives, such as #include, #define, #ifdef, etc.

Let's take this super simple file as an example

// tiny.cpp
#define KONSTANTA 123

int main() {
    return KONSTANTA;
}

It contains one preprocessor directive, #define. It says that any following occurence of KONSTANTA should be replaced with 123. Running the file through a preprocessor leads to output like this one:

$ clang++ -E tiny.cpp
# 1 "tiny.cpp"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 383 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "tiny.cpp" 2


int main() {
    return 123;
}

We can see that in return KONSTANTA the KONSTANTA part was replaced with 123, as it should be. We also see that the compiler left itself a bunch of other notes, that we do not care about that much[2].

The big problem with the preprocessor model is that the #include directive literally means "copy-paste all of this file's contents here". Of course, if that file's contents contain further #include directives, then more files will be opened, their contents copied out, and in turn, the compiler will have more code to deal with. In other words, preprocessing increases the size of the input, usually significantly so.

The following is a simple "Hello World" in C++, using streams.

// hello-world.cpp
#include <iostream>

int main() {
    std::cout << "Hello World\n";
}

After preprocessing, the file will have 28115[3] lines for the next step, compilation, to deal with.

$ clang++ -E hello-world.cpp | wc -l
28115

Compilation

After a file is preprocessed, it is compiled into an object file. Object files contain the actual code to run, but cannot be run without linking. One of the reasons for this is that object files can refer to symbols (usually functions) that they do not have the definition (code) for. This happens, e.g. if a .cpp file uses a function that has been declared, but not defined, like so:

// unlinked.cpp
void bar(); // defined elsewhere (hopefully)

void foo() {
    bar();
}

You can look inside a compiled object file to see what symbols it provides and what symbols it needs, using nm (Linux) or dumpbin (Windows). If we look at the output for the unlinked.cpp file, we get this:

$ clang++ -c unlinked.cpp && nm -C unlinked.o
                 U bar()
0000000000000000 T foo()

U means that the symbol is not defined in this object file. T means that the symbol is in the text/code section and that it is exported, which means that other object files can get foo from this unlinked.o. It is important to know that symbols might also be present in an object file, but not be available to other object files. Such symbols are marked with t.

Linking

After all the files have been compiled into object files, they have to be linked into the final binary artefact. During linking, all the various object files are smashed together in a specific format, e.g. ELF, and the various references to undefined symbols in object files are resolved with the address of the symbol, as provided by a different object file (or library).

With this overview done, we can start tackling the different ways to speed up the compilation of your code. Let's start simple.

#include less

Including a file usually brings in a lot of extra code, which the compiler then needs to parse and check. Thus the simplest, and usually also the biggest, way to speed up the compilation of your code, is to just #include fewer files. Reducing the include set is especially beneficial in header files, as they are likely to be included from other files, thus amplifying the impact of your improvements.

The easiest way to do this is to remove any unused includes. Unused includes shouldn't happen often, but sometimes they are left behind during refactoring, and using a tool like IWYU can[4] make it simple to do. However, just cleaning up unused includes is unlikely to provide many benefits, and so you will have to reach for bigger guns, forward declarations and manual outlining.

But before explaining forward declarations and manual outlining, I want to go over the costs of header inclusion quickly, so we can build up intuition on what sort of speed-ups we can expect from pruning down include graphs.

The cost of header inclusion

The table below shows the time required by Clang[5] to compile a file that only includes some stdlib headers.

header(s) included time to compile (ms) difference from baseline (ms)
none 11.3 ± 0.2 -
<vector> 68.8 ± 0.3 57.5 ± 0.36
<string> 136.3 ± 0.8 125.0 ± 0.82
<stdexcept> 137.0 ± 0.8 125.7 ± 0.82
<vector>, <string> 155.3 ± 0.9 144.0 ± 0.92
<string>, <stdexcept> 136.7 ± 0.7 125.4 ± 0.73
<vector>, <string>, <stdexcept> 156.1 ± 0.8 144.8 ± 0.82

The first row shows the time needed to compile a completely empty file, to provide a baseline time required by the compiler to start, read the file, and do nothing. The other lines are more interesting. As the second line says, just including <vector> adds 57 ms to compilation times, even though there will be no actual line emitted. As we can see, the cost to include <string> is more than double of <vector>, and the cost to include <stdexcept> is about the same as for <string>.

More interesting are the rows for combinations of headers, because no combination of headers is as expensive as compiling each of them on its own. The reason is quite simple: their internal include overlap. The most extreme case is <string> + <stdexcept>, because <stdexcept> is basically <string> + couple of types deriving from std::exception.

What you should take away from this are two things:

  • Even if you do not use anything from a header, you still have to pay for it.
  • Include costs do not neatly sum, nor subtract.

Now let's go through techniques we can use to include fewer files.

Forward declarations

Quite often, when we mention a type, we only need to know that it exists but do not need to know its definition. The common case is creating a pointer or a reference to a type, in which case you need a knowledge that the type exists (a forward declaration), but not what it looks like (a definition).

As an example, this header is valid:

class KeyShape; // forward declaration

size_t count_differences(KeyShape const& lhs, KeyShape const& rhs);

as long as the implementation file includes the appropriate headers:

#include "key-shape.hpp" // provides the full definition of KeyShape

size_t count_differences(KeyShape const& lhs, KeyShape const& rhs) {
    assert(lhs.positions() == rhs.positions());
    ...
}

You can also use forward declaration together with some templated classes, whose size does not change depending on the template argument, e.g. std::unique_ptr and std::vector[6]. However, doing so can force you to outline your constructors, destructors and other special member functions (SMFs), as those usually need to see the full definition of the type. Your code then ends up looking like this:

// foo.hpp
#include <memory>

class Bar;

class Foo {
    std::unique_ptr<Bar> m_ptr;
public:
    Foo(); // = default;
    ~Foo(); // = default;
};
// foo.cpp
#include "bar.hpp"

Foo::Foo() = default;
Foo::~Foo() = default;

Notice that we still use the compiler-generated default constructor and destructor, but do so in the .cpp file, where we see the full definition of Bar. I also like to use the // = default; comment to signal to other programmers reading the code that the SMF is explicitly declared but will be defaulted, and thus there won't be any special logic in it.

When using this technique, please remember that the outlined functions cannot be inlined without LTO. In other words, you probably do not want to outline every function just because you can, because calling trivial functions can be much more expensive than inlining their code directly.

Explicit outlining

The idea underlying explicit outlining is quite simple: sometimes we get better results if a piece of code is explicitly split away from a function. One of the most common reasons is, perhaps ironically, improving inlining by making the common path of a function small. However, in our case, the reason for doing this is to improve the compilation times.

If a piece of code is expensive to compile, and inlining it is not crucial for performance, only one TU has to pay for compiling it. The canonical example of this is throwing an exception in general, and exceptions from <stdexcept> in particular. Throwing an exception generates quite a lot of code, and throwing more complex standard exception types, such as std::runtime_error, also requires an expensive[7] header, <stdexcept> to be included.

By instead replacing all throw foo; statements with calls to a helper function along the lines of [[noreturn]] void throw_foo(char const* msg), the call sites become smaller, and all compilation costs related to the throw statements are concentrated in a single TU. This is a useful optimization even for code that is only present in a .cpp file. For code in headers[8], this optimization is almost critical, due to the multiplicative effect of textual code inclusion.

Let's try this with a simple example: consider a toy constexpr static_vector[9] implementation. It will throw std::logic_error from push_back if there is no more capacity, and we will test two version: one that throws the exception inline, and one that instead calls a helper function to do it.

The inline-throwing implementation looks something like this:

#include <stdexcept>

class static_vector {
    int arr[10]{};
    std::size_t idx = 0;
public:
    constexpr void push_back(int i) {
        if (idx >= 10) {
            throw std::logic_error("overflew static vector");
        }
        arr[idx++] = i;
    }
    constexpr std::size_t size() const { return idx; }
    
    // other constexpr accessors and modifiers as appropriate
};

The only change in the out-of-line throwing implementation is that the throw std::logic_error(...) line is replaced with a call to a throw_logic_error helper function. Otherwise, they are the same.

We will now create 5 TUs that include the static vector header, and contain a simple function that uses the static vector, like this:

#include "static-vector.hpp"

void foo1(int n) {
    static_vector vec;
    for (int i = 0; i < n / 2; ++i) {
        vec.push_back(i);
    }
}

Using the same compiler, settings[5:1], and machine as before, compiling a full binary in the inline-throwing case takes up 883.2 ms (± 1.8), while the out-of-line-throwing case takes up 285.5 ms (± 0.8). This is a significant (~3x) improvement, and the improvement grows with the number of compiled TUs that include the static-vector.hpp header. Of course, it is good to also keep in mind that the more complex the TUs would be, the smaller the improvement would be, as the cost of the <stdexcept> header becomes a smaller part of the total cost of the TU.

There is not much more to be said about improving your build times by just including less stuff, so it is time to look at another trick: using hidden friends.

Hidden Friends

Hidden friends is the name of a technique that uses relatively obscure rule about the visibility of names (functions/operators) to reduce the size of overload sets. The basic idea is that a friend function declared only inside a class can only be found and called via Argument Dependent Lookup (ADL). This then means that the function does not participate in overload resolution unless its "owning" type is present in the expression.

Hidden friends are best explained with some examples.

operator<< as hidden friend

struct A {
    friend int operator<<(A, int); // hidden friend
    friend int operator<<(int, A); // not a hidden friend
};
int operator<<(int, A);

In the snippet above, only the first overload of operator<< is a hidden friend. The second overload is not, because it is also declared outside of A's declaration.

Pruning the overload set has multiple advantages:

  • Shorter compilation errors when overload resolution fails. Compare the error for the same expression with hidden friends versus without them.
  • Less chance for implicit conversions to happen. For an implicit conversion to happen, at least one argument has to already have the target type, overload that would require implicit conversions of all arguments cannot be selected. Example
  • Faster compilation, because the compiler has less work to do.

Given the topic of this post, that last advantage is what we care about. So how much difference does using hidden friends make? To test this, I generated a simple .cpp file with 200 structs like the one above, giving a total of 400[10] overloads of operator<<. The TU also contains a one-line function that returns A1{} << 1, to induce overload resolution of operator<<.

When using hidden overloads, it took Clang[5:2] 25.4 (± 0.1) ms to compile this TU into an object file. Without hidden overloads, it took 36.7 (± 0.2) ms. This is already a nice speed-up, the question is, will the speed-up scale with more overload resolutions in the TU? Let's try modifying the function to contain 1/10/50/100 summed up operator<< calls, and see the results.

operator<< calls hidden (ms) non-hidden (ms) speed up
1 25.4 ± 0.1 36.7 ± 0.2 1.44 ± 0.01
10 25.3 ± 0.1 40.2 ± 0.2 1.59 ± 0.01
50 27.6 ± 0.2 57.9 ± 0.6 2.10 ± 0.02
100 29.9 ± 0.1 79.9 ± 1.4 2.67 ± 0.05

As we can see, the speed up increases with the number of overload resolutions required by the TU, even though the overload resolution always happens for the same expression. However, even for large TUs, with large overload sets and many overload resolutions, the difference in absolute number is ~50 ms. This is a nice speed-up, but if you remember the table on the cost of including different stdlib headers, you know that this is less than the difference between compiling an empty file and a file that includes <vector>.

In practice, this means that you are more likely to see larger improvements in compilation times from pruning unnecessary #includes than using hidden friends. However, hidden-friends also improve your code in different ways and are surprisingly powerful in highly templated code.

There is one disadvantage to using hidden friends. The header where you declare the class and the hidden friend must contain all other declarations involved in declaring the hidden friend. This can increase the header's heft significantly, e.g. if you need to include <iosfwd> for std::ostream& for stream insertion operator[11].

To sum it all up, using hidden friends improves your compilation times, improves your error messages, and also prevents some cases of implicit conversions. This means that you should default to providing operator overloads and ADL customization points as hidden friends[12].

Now let's look at the last trick we will look at today, putting less pressure on the linker.

Link less

There are two ways to have the linker do less work. The first one is to hide symbols from linking, the second one is to make symbols names shorter. Because the latter is... not worth it except in extreme cases[13], we will only look at the former.

During the compilation model overview, I mentioned that a symbol could be present in an object file without being available to other object files. Such symbol is said to have an internal linkage (as opposed to having external linkage). The compilation speed advantage of symbols with internal linkage comes from the fact that the linker does not have to keep track of it as available, and thus has less work to do.

As we will see later, there are also runtime-performance and object file size benefits to symbol hiding, but first, let's look at an example.

// local-linkage.cpp
static int helper1() { return -1; }

namespace {
int helper2() { return  1; }
}

int do_stuff() { return helper1() + helper2(); }

In the example above, both helper1 and helper2 have internal linkage. helper1 because of the static keyword, helper2 because it is enclosed in an unnamed[14] namespace. We can check this with nm:

$ clang++ -c local-linkage.cpp && nm -C local-linkage.o
0000000000000000 T do_stuff()
0000000000000030 t helper1()
0000000000000040 t (anonymous namespace)::helper2()

What's even more interesting is that if we up the optimization level, both helper1 and helper2 disappear entirely. This is because they are small enough to be inlined in do_stuff, and no code from different TU can refer to them, because they have internal linkage.

$ clang++ -c local-linkage.cpp -O1 && nm -C local-linkage.o
0000000000000000 T do_stuff()

This is also how internal linkage can improve runtime performance. Because the compiler sees all places where the symbol is used, it has more motivation to inline it into the call sites to remove the function altogether. And even if it cannot, it can optimize the code with extra knowledge based on its call sites.

The compilation performance improvements from hiding your symbols are generally small. After all, the amount of work a linker does per-symbol is small, especially if your linker is smart about it. However, large binaries can have millions of symbols, and just as with hidden friends, there are also non-compilation performance benefits to hiding symbols, namely preventing ODR violations between helper functions.


That's all for this post. In a later post, I intend to write about tools that can be used to find places where your compile times suffer unnecessarily, and about some other techniques to mitigate this.


  1. However, if your day job is writing C++, you should be using ninja, work on a machine with lots of cores and RAM, and see if lld works for you. ↩︎

  2. These can have many different uses. One of the more interesting ones is that compiler can mark down which file (and line) a line came from, so it then can issue diagnostics with the proper context. MSVC STL uses a similar trick to tell the compiler that it should step through all layers of, e.g. std::function internals, when the user is asking for "Just My Code" debugging. ↩︎

  3. At least against a specific version of libstdc++ using a specific version of Clang. Changing the version of either can change the exact number, but not that much... on the other hand, switching to MSVC + MSVC STL gives you about 50k lines after preprocessing. ↩︎

  4. As of the time of writing, IWYU has a big problem in that it hardcodes assumptions from Google's C++ style guide around includes. This means that if your project uses the same style, it will work quite well for you, even though you still have to check what changes it made, or it will include <iosfwd> to provide size_t. If your project uses <> style for internal includes, then it will suggest replacing every single one of your includes. ↩︎

  5. The compilation used Clang in version 10, compiling against libstdc++ version 8 (release date 20200304), using -g -std=c++17 -c -o /dev/null command line arguments. ↩︎ ↩︎ ↩︎

  6. Remember that std::vector consists of three pointers to a chunk of dynamically allocated memory. Because sizeof(T*) does not change based on T, sizeof(std::vector<T>) also does not change based on T. ↩︎

  7. At least if you are not already including <string>. ↩︎

  8. This can mean templated code, constexpr code, or just plain code that is important to inline even without LTO build. ↩︎

  9. static_vector is a vector with fixed capacity, usually allocated in the object itself. This allows it to be constexpr before C++20's constexpr allocations[15], with the tradeoff being that you have to know how much memory you will need ahead of time and that you have to fit onto the stack. ↩︎

  10. I think that using 400 overloads simulates a TU in a relatively large (but not impossibly so) codebase. If you feel the number seems too high, consider that basically every QT type has operator<< for QDebug and that your own custom types are likely to provide operator<< for std::ostream. Alternatively, think of highly generic libraries that create many different types, which leads to many different instantiations of templated functions. ↩︎

  11. Our codebase avoids using hidden friends for specific classes, because even on Linux, including <iosfwd> increases the header heft 3+ times. With MSVC, the difference is orders-of-magnitude. ↩︎

  12. Like with the advice to use std::vector by default, this does not mean that there are never cases where using something else is better. Just that you need to have a reason to deviate, and you should always document the reasons you had. ↩︎

  13. The one case where I can see myself recommending shortening your symbols names for compilation performance is for internal types in expression templates. If you are doing something like converting brainfuck code into a complex nested type, using shorter names for the internal types might very well be worth it, and the type will be unreadable either way. ↩︎

  14. It is more often called anonymous namespace, but unnamed namespace is more technically correct. ↩︎

  15. Note that even with C++20, the memory allocation cannot outlive the compilation phase and be accessible during runtime. If you need a dynamically sized constexpr data structure, you still need something like the static_vector. ↩︎

]]>
<![CDATA[Generating random numbers using C++ standard library: the solutions]]>https://codingnest.com/generating-random-numbers-using-c-standard-library-the-solutions/64c28599a062e6148d2e46c6Sat, 23 May 2020 19:59:01 GMT

Last week I wrote about the various problem with using C++'s standard library (mainly <random>) to generate random numbers. This week I will outline what I think are the (standardizable) solutions to fix the functionality in <random>[1] and make it widely usable.

The content of this post is based on the three C++ standardization papers I presented in Prague, P2058, P2059, P2060, and various conversations I had afterwards on the same topic.

Now, onto the solutions themselves.


Fixing std::random_device

In my last post, I complained that std::random_device is allowed to be not random at all, and there is no way to find out, because std::random_device::entropy is interpreted very differently across different standard library implementations.

My ideal way to fix this would be to mandate that a standard library implementation only provides std::random_device if it provides proper randomness. And by proper, I mean cryptographically strong. While this sounds onerous, the three major implementations already provide this in practice, they just do not advertise it... However, I also think that such a proposal would never pass the standard committee, and so we need to fix it differently.

Provide users with better queries for the properties of the implementation

Users generally care about one of two things.

  1. Whether the random_device is random, that is, it does not produce the same sequence every time the code is run.
  2. Whether the random_device produces cryptographically secure outputs.

Obviously, the second property is much stronger, because a random_device that is cryptographically secure is also random, but random_device can be random while not being cryptographically secure. As currently standardized, a random_device is also allowed to be neither random nor cryptographically secure[2].

A nice feature of these properties is that they are binary, so the answer to them is either yes, or no, with no possibilities in-between. They are also reasonably well defined, which should avoid an entropy-like fiasco with implementations interpreting them differently and causing them to be useless in practice.

My proposal to fix std::random_device in standard simply follows from the above. std::random_device interface should be extended with 2 new member functions:

class random_device {
   ...
   // Returns true if different instances generate different bytes
   constexpr bool is_random() const;
   
   // Returns true if generated bytes are cryptographically secure
   bool is_cryptographically_secure() const;
};

You might notice that only is_random is constexpr. The reason for that is that it is the weaker property and, outside of maliciously constructed cases, the implementation should know whether the random_device is randomized. is_random could even be made static, if we restricted users from using the explicit random_device(const string& token) constructor[3].

is_cryptographically_secure is not constexpr to increase implementations' latitude to handle things like hardware errata, which can only be checked at runtime. Just like is_random, it could be made static if we imposed further restrictions on users of random_device.

Deprecate std::random_device::entropy

Now that random_device provides a way to query basic properties of its implementation, we should also remove deprecate[4] random_device::entropy, because it is wholly useless, and (very) potentially even dangerous.

Provide reproducible distributions

How should reproducible distributions be standardized is the place where I changed my opinion the most since writing a paper. Initially, my preferred solution was to standardize the algorithms underlying std::*_distribution, but that is no longer the case. Nowadays, my prefered solution is to:

Standardize specific algorithms as distributions

The basic idea is simple, we standardize specific algorithms under their own name, and users who want reproducibility just use one of these specific algorithms. As an example, one of the possible algorithms to implement std::normal_distribution is Marsaglia polar method. To provide reproducible normal distribution, it would be standardized as std::marsaglia_polar_method_distribution.

This solution has a significant advantage in that it is both backwards compatible because it does not change the meaning of existing code, and it allows future extensions. If, we standardize some set of algorithms as the reproducible distributions, and 10 years after that someone comes up with a better algorithm for generating normally[5] distributed numbers, then it can easily be standardized in the next C++ standard. C++ code then can adopt this new algorithm if they do not need backwards compatibility, or keep using the old ones, if they do need backwards compatibility.

It is also very expert friendly, as different algorithms have different performance and numeric characteristics, which the experts might care about. As an example, Marsaglia polar method calls the underlying RNG more often than Box-Muller transform does, but it does not use trigonometric functions and provides slightly better numeric properties.

This approach is not without its negatives. The two big ones are that it introduces a lot of new types, and thus maintenance burden, into the standard library, and that it makes using <random> even less user-friendly. A user that wants a reproducible distribution has to pick which exact algorithm to use. Doing so requires either obtaining a significant amount of expert knowledge, or picking one essentially at random.

Other considered (and rejected) options

Back at Prague's meeting, I proposed two other alternatives[6] to the option above. In fact, I considered the option outlined above the worst one. However, I've changed my mind since then and no longer consider them good. They are:

  1. Mandate specific implementation of all std::foo_distribution types
  2. Provide std::reproducible_foo_distribution types with specified implementation

Both of these options share the same problem, that they do not provide future extensibility, and the same advantage in that they introduce less burden on both the maintainers and the non-expert users of <random>. They also provide some different trade-offs in regards to backwards compatibility, implementation latitude and so on.

Challenges, problems, and pitfalls

All three options mentioned above share one big problem, floating-point numbers. This problem further splits into two more problems, floating-point representations, and transcendental functions.

The problem with floating representations is that the C++ standard does not mandate a specific one. In practice, it is unlikely to encounter a platform that does not support IEEE-754, but the C++ standard allows them. There is also the issue of floating-point dialects, caused by compiler flags, such as -ffast-math.

This means that any standard-provided reproducible distribution over floating-point numbers will require some wording to the effect of "results are only reproducible between platforms with the same floating-point number representation"[7].

The other challenge to providing reproducible floating-point distributions is the fact that most algorithms for e.g. normal distribution use transcendental functions, such as trigonometric operations (Box-Muller), or logarithms (Marsaglia). The problem is that transcendental functions are computed by approximation, both the result and the precision of such approximations vary, and which approximation your code ends up using is compiler, platform, and settings dependent[8].

There are two possible workarounds for the transcendental functions issue:

  1. Standard mandates specific implementation for use in <random>
  2. We use algorithms that avoid these issues at the cost of performance[9]

Neither of these options is great, but they are workable. I don't think that <random> would be well served by just option 2, but I also don't think it should be overlooked.

Rework seeding of Random Number Engines

The last of my complaints in the previous post was that there is no right way to seed an unknown Random Number Engine[10] properly. This issue is caused by a combination of the requirements on Seed Sequence being overly restrictive, and that there is no way to ask an RNE how much seeding it requires upfront.

Strictly speaking, it is possible to fix this with just one change, letting users query any random number engine on how much data it requires for seeding itself. However, that would still leave proper seeding very unergonomic, and so I propose more changes, to fix this. They are:

  1. Let users query RNEs for required seed size
  2. Provide a weaker version of the Seed Sequence requirements
  3. Modify std::random_device to fulfil these requirements

Let users query Random Number Engines required seed size

The idea behind this change is simple. If we know how much random data is required to seed some RNE, we can generate that much randomness ahead of time, and then use a straightforward Seed Sequence type that just copies randomness in and out, while obeying all Seed Sequence requirements.

To do this, we add static constexpr size_t required_seed_size member function to the requirements on Random Number Engines. Its return value is the number of bytes the RNE requires to fully seed itself. Together with a simple, randomness-copying Seed Sequence sized_seed_seq, the code to fully seed a mt19937 with random data would look something like this:

// This prepares the seed sequence
constexpr auto data_needed = std::mt19337::required_seed_size() / sizeof(std::random_device::result_type);
std::array<std::random_device::result_type, data_needed> random_data;
std::generate(random_data.begin(), random_data.end(), std::random_device{});

// Actual seeding
std::mt19937 urbg(sized_seed_seq(random_data.begin(), random_data.end()));

While this works and does what we want, the usability is terrible. To fix the usability for the typical case of random seeding, we need to change the requirements of Seed Sequence.

Provide a weaker version of Seed Sequence requirements

In the ideal world, we would just pass a std::random_device to the constructor of the engine, like so:

std::mt19937(std::random_device{});

However, std::random_device is not a Seed Sequence, and thus the code above does not work. The requirements of Seed Sequence are also such that we cannot create a simple wrapper around random_device that fulfils them. Let's see what requirements we have to drop before a randomized_seed_seq, a seed sequence that just wraps std::random_device, is implementable.

Many of the requirements on Seed Sequence boil down to requiring Seed Sequence instances to be serializable and reproducible. A Seed Sequence-ish that wraps std::random_device cannot provide either, which means that

  • We should drop both param and size member functions. Without param, size is useless, and param cannot be implemented on top of random_device.
  • We should also drop both the range and the initializer list constructors. They require that the bits provided therein are used in the seed sequence, but that cannot be done with random_device.

Removing these functions leaves us with the default constructor and the generate member function. And also with the result_type typedef, but that is almost trivial[11]. We obviously want need to keep the default constructor, but we cannot satisfy the requirements that the state of all default-constructed instances is the same, so we will drop that part. The same thing applies to the generate member function. Any reasonable Seed Sequence has to provide it, but we would need to drop the requirement that the output depends on the inputs during construction (not that there are any).

Thus I propose a new set of named requirements, Basic Seed Sequence[12]. Type only has to fulfil 3 requirements to be considered a Basic Seed Sequence, namely:

  • It provides result_type typedef that is an unsigned integer type of at least[13] 32 bits.
  • It provides a default constructor with constant runtime complexity.
  • It provides a generate(rb, re) where rb and re are mutable random access iterators[14] which fills [rb, re) with 32-bit quantities. There are no constraints on the generated data.

This is the minimal set of requirements for a useful Seed Sequence-ish type, and a wrapper type over std::random_device can easily fullfill them:

class randomized_seed_seq {
    std::random_device m_dev;
    
    static_assert(32 <= sizeof(std::random_device::result_type) * CHAR_BIT,
                  "I don't wanna handle this case");
public:

    using result_type = std::random_device::result_type;
    
    template <typename Iter, typename Sentinel>
    void generate(Iter first, Sentinel last) {
        using dest_type = typename std::iterator_traits<Iter>::value_type;
        // We should also check that it is unsigned, but eh.
        static_assert(32 <= sizeof(dest_type) * CHAR_BIT, "");
        
        
        while (first != last) {
            // Note that we are _required_ to only output 32 bits
            *first++ = static_cast<uint32_t>(m_dev());
        }
    }
};

With the wrapper above, we can now seed any Random Number Engine like this:

randomized_seed_seq sseq;
std::mt19937 rng(sseq);

RNEs take the SeedSequence constructor argument using plain ref, so we cannot quite write an oneliner, but compared to the original monstrosity, this is good enough. However, I also think that users shouldn't have to wrap std::random_device in their own type to get this behaviour, but rather the standard should provide it. This leads me to my last suggestion:

Turn std::random_device into a Basic Seed Sequence

This one is simple. If we add generate to std::random_device, it becomes a Basic Seed Sequence as per the definition above. This would let users write these two lines to get a randomly seeded Random Number Engine:

std::random_device dev;
std::mt19937 rng(dev);

Users who require a large number of random bytes could also use this interface to achieve significant performance gain over successively calling random_device::operator()[15].

Other possible improvements

Until now, this post was about fixing the problems outlined in the previous one. However, in that post, I skipped over "small" issues with <random>, ones that are annoying but do not make it unusable. In this section, I want to also go over some other issues with <random>. These issues are too small to prevent people from using std.random, but they are still annoying enough while using it.

The following issues are mentioned in no specific order.

There are no modern PRNGs in <random>. The best PRNG in <random> is probably[16] the Mersenne Twister, but using Mersenne Twister instead of say Xorshift, or a PCG variant leaves a lot of performance lying on the table. This lack of modern PRNGs means that serious users will end up writing their own, even if all issues with seeding, distributions, and so on, are fixed.

Most (all?) of the PRNGs in <random> could be constexpr, but they are not. As far as I can tell, this is caused by the fact that nobody actually uses <random> enough to care about constexpr-ing it, rather than any technical reasons.

Random Number Engines take Seed Sequence arguments by plain reference. This prevents creating and fully seeding an RNE from being an oneliner.

There are no ease-of-use utilities. If all the fixes proposed in this post were incorporated, seeding a PRNG becomes easy. However, selecting a random element from
a std::vector would still require a significant amount of boilerplate.

There are likely many more tiny issues with <random> that I am either unaware of completely, or that I haven't run into recently enough to remember them. The point is that if all of my proposed changes were standardized, <random> would become much better but definitely not perfect.


That's it for this post, and for my writing about <random>. At some point in the future I want to write a post about my standardization efforts towards fixing <random>, but that will be a non-technical post about the standardization process itself, rather than about the technical details of <random>.


  1. I don't think the C library and std::rand can be fixed without effectively standardizing <random>, but for C and fixed. ↩︎

  2. I think that a nonrandom random_device is useless, and if a standard library implementation cannot provide random random_device, then it should not provide random_device at all. However, I do not see such a requirement getting past the WG21 committee. ↩︎

  3. MSVC already disallows it and libstdc++ allows the arguments to be /dev/urandom and /dev/random. libc++ is the odd one out here, but given that using the constructor is already non-portable, I believe it could be made more restrictive as well. ↩︎

  4. One does not simply remove things from C++ standard. I am honestly surprised how many things were actually removed after a period of deprecation, but it still takes time even if the removal happens. ↩︎

  5. Or any other specific distribution. ↩︎

  6. Given the experience from the meeting, I now know that this was a fundamentally wrong approach. Present the committee with one opinionated option if you want anything to be done. ↩︎

  7. And probably no guarantees for -ffast-math and equivalents, because that allows the compiler to reorder operations arbitrarily, potentially allowing different results from the same code due to different computational context. ↩︎

  8. As an example, if you are using x86 CPU, it almost definitely has hardware instruction for computing both sin and cos at the same time. No compiler worth its salt uses them, because it is significantly slower than a well-tuned software implementation. Instead, compilers call into sin/cos from the supporting math library. However, some library implementations punt on implementing these functions and use the CPU-provided instruction instead... ↩︎

  9. I am not aware of many algorithms in this category, but I am also not a numerics expert, and I did not perform much of research on this. This paper on generating normally distributed numbers without involving floating-point arithmetics at all looks interesting though. ↩︎

  10. Random Number Engine is any type that fulfils a set of named requirements in the standard. ↩︎

  11. We cannot blindly typedef it to random_device::result_type, because random_device::result_type is unsigned int, and thus is allowed to be less than 32 bits. However, papering over this difference is easy enough, and not even needed on most platforms. ↩︎

  12. I would prefer to give this set of requirements the name Seed Sequence and rename the current Seed Sequence to something like Serializable Seed Sequence, or Repeatable Seed Sequence, or Deterministic Seed Sequence, but it is easier to introduce a new name for the new requirements than to deal with backwards compatibility concerns. ↩︎

  13. The fact that this is at least 32 bits, rather than exactly 32 bits, has caused some interesting bugs in the real world, but let's be compatible with Seed Sequence. ↩︎

  14. The random access requirements are to keep as close to the current Seed Sequence as possible. If we went with the principle of requiring minimal iterator strength for the operation, we should probably require just forward iterators. On the other hand, I think it is unlikely that a random number engine will store its state in something like a list, and if we required contiguous iterators, we could utilize batch writes for significant performance gains. ↩︎

  15. As an example, MSVC's random device generates, (IIRC) 256 random bits per call to operator(). It then returns 32 of them and throws the rest away. Trying to generate a substantial amount of randomness via operator() then leads to a lot of wasted work. ↩︎

  16. It depends on your use case, but if you just want some PRNG, Mersenne Twister is a fine default. ↩︎

]]>
<![CDATA[Generating random numbers using C++ standard library: the problems]]>https://codingnest.com/generating-random-numbers-using-c-standard-library-the-problems/64c28599a062e6148d2e46c5Sun, 17 May 2020 19:23:20 GMT

Recently I found myself once again writing a long forum post about the problems with standard-provided random number generation facilities (both C++'s <random>, and C's rand) in C++. Since I keep writing these, I decided to write it all down into one blog post so that I can link it to people later. This is that blog post.

A quick summary of this post would be "Using C++'s standard library for random number generation is a bad idea, and you should either roll your own, or use an existing library. I recommend C++ PCG utilities, or, if you already use Boost, Boost.Random".

Now, onto the actual content itself.


In this post, we will use what should be a straightforward task: generate a bunch of uniformly distributed integers in the range [0, 100k).

C's standard library facilities

Let's start with some C-style random number generation.

// Seed based on time. Not really random.
std::srand(std::time(nullptr));

// Generate 1'000 random numbers in range 0-100'000
for (size_t _ = 0; _ < 1'000; ++_) {
    std::cout << std::rand() % 100'000 << '\n';
}

This code is simple enough to write and to understand but comes with a host of problems.

  1. The resulting numbers will not be uniformly distributed. The results will be biased towards lower numbers, due to the use of modulo.
  2. Numbers above 32767 might not be present at all.
  3. Whether the code is thread-safe is up to the implementation. Which functions invoke rand is also up to the implementation, so data races can happen without you expecting them.

If you do not see why converting the numbers using modulo cause non-uniformly distributed results, consider a simple case, where std::rand can only return 0, 1, or 2, each with the same probability, and we desire numbers in the range [0, 2). There are 2 ways to get 0, 0 % 2, and 2 % 2, while there is only one way to get 1, 1 % 2. In other words, we get 2:1 ratio of 0s to 1s due to using modulo.

The second problem is more obscure, but simpler to understand. The range of possible values generated by std::rand is specified as [0, RAND_MAX), where RAND_MAX can be any constant larger-or-equal to 32767. On platforms that use this lower bound[1], the example above will never print number larger than 32767.

The last problem is just a symptom of the original C specification ignored threading.

The first two problems are solvable. Replacing modulo with rejection sampling (and potentially calling std::rand multiple times if needed) solves the bias issue. To generate values larger than RAND_MAX, you can just concatenate the result of multiple calls to std::rand.

The thread-safety is impossible to solve in general case[2], but in specific cases, you can guard user code calls to std::rand with a mutex, and it should work well enough. Some implementations provide a per-thread std::rand, which is a much better solution, but you cannot rely on this.

However, solving all of this is either impossible, or a lot of non-trivial work, and even then you run into the problem that std::rand is allowed to return different numbers on different platforms given the same seed. At this point, it is easier to write your own set of random number generation tools, and so C++11 standardized its own set, in the form of <random>.

C++'s standard library facilities

At first glance, <random> seems exceedingly complex for a simple task. You have to pick a templated Uniform Random Bit Generator, possibly seed it, pick a templated Distribution, and then pass an instance of your URBG to the distribution to get a number... This is the C example rewritten using <random>:

// Truly random seed. 
std::mt19937 rng(std::random_device{}());

// Avoid constructing distribution all the time
std::uniform_int_distribution<> dist(0, 100'000);

// Generate 1'000 random numbers in range 0-100'000
for (size_t _ = 0; _ < 1'000; ++_) {
    std::cout << dist(rng) << '\n';
}

There is bit more code than there was with C, but bearably so, and most of the issues are fixed. The distribution will be uniform, all numbers in the desired interval are possible, and the code is thread-safe.

At second glance, <random> is awesome, even if there is a bit of boilerplate for simple operations. The decomposed and pluggable design means that you can customize your random numbers by replacing only a small part of the random number generation pipeline. The standard also provides a wide range of Random Number Engines and distributions[3], so you should be able to do most things you want out of the box. It even provides an abstraction for getting actually random numbers for seeding the generators, std::random_device.

At the third glance, when you've started using <random> extensively and started digging deeper, you will find out that every single part of it is deeply flawed, and the best solution is to avoid using it completely.

Distributions are nonportable

Did you notice that the text above said

most of the issues are fixed

and then did not talk about portability? That's because both of the snippets, the C one and the C++ one, share one issue. Even if you hardcode the seed, the snippets will give you different results on different platforms[4]. For bonus points, the results are not even guaranteed to be portable between different versions of the same standard library, as the standard library implementations are allowed to change how they implement std::uniform_int_distribution[5].

What this boils down to is that if you have repeatability requirements for your generated random numbers[6], then you cannot use the standard-provided distributions. Luckily, generating random numbers using <random> is properly decomposed, and you can "just" write your own distributions, and keep using the rest of <random>, right?

Well...

std::random_device might not be random, and there is no way to check

The C++ snippet uses std::random_device to generate some initial randomness to seed our instance of Mersenne Twister in the form of std::mt19937. The problem is that std::random_device is poorly specified, and inscrutable.

In theory, it should serve as an abstraction over some external source of entropy. In practice, an implementation is allowed to use any deterministic random number engine to implement it, e.g. a Mersenne twister, and there is no way to find out. There is a member function std::random_device::entropy(), which is in theory there to detect such case, but it does not work in practice.

The blame for this is shared between the standard and the implementations. The function's full signature is double entropy() const noexcept, and it is the return type that breaks it. The standard provides a definition of entropy[7], but it does not provide any sort of guidance on how to count entropy of an external source of randomness, or expected return values for different cases.

This, in turn, caused different implementations to do their own thing. We will take a look at the big three, MS's STL, libc++ and libstdc++.

MS's implementation handles this the best. It knows its random_device is just a thin wrapper over kernel's cryptographically secure random, so it always returns 32 and inlines the member function into the header to allow for constant propagation[8].

In order of sanity of implementation, libc++ is next, because it always just returns 0. This return value does not reflect reality, 4 out of 5 possible configurations[9] of libc++'s random_device use strong random backend, and the last one also provides strong random bytes unless the user deliberately sabotages themselves. The return value also makes libc++'s implementation of std::random_device::entropy useless, but at least it is obviously useless, so the user is not given false hopes and expectations. There is value in this.

The worst implementation of std::random_device::entropy can be found in libstdc++. The reason it is the worst is that it is not obviously useless, you have to think about it for a bit to figure out why the return value is useless. This is because, unlike libc++, libstdc++ can return non-zero values. In most configurations, libstdc++ always returns 0[10], but when it is configured to read from /dev/urandom (or /dev/random), it uses RNDGETENTCNT to check how much entropy the kernel thinks it has available and returns that to the user.

The underlying problem of this approach is TOCTOU. If you first check whether there is enough randomness[11], and only then ask for that randomness, then by the time you ask for the randomness it could've been depleted, and you cannot get it anymore.

At this point, we know that we will likely have to implement our own distributions, and either implement our own random_device, or detect which standard library we are compiling against, and hardcode versions that provide good random_device::operator() implementations. But at least we can still use all the different Random Number Engines provided by the standard library, right?

Well...

There is no way to seed a Random Number Engine properly

The Random Number Engines almost work. But if something only almost works, it is broken.

Let's go back to the first line of the C++ example.

std::mt19937 rng(std::random_device{}());

It seeds a specific version of Mersenne Twister with unsigned int worth of random data. Let's assume sizeof(unsigned int) == 4. The internal state of mt19937 is 2496 (624 * 4) bytes. Taken together, this means that for every state we can seed the rng into, there are \(2^{4984}\) states that we cannot seed the rng into.

This has some interesting implications. For example, the program below will never print 7[12].

int main() {
    std::mt19937 urbg(std::random_device{}());
    std::cout << urbg() << '\n';
}

Some output values also uniquely identify their seed. If I tell you that the code program printed 3046098682, then you can quickly[13] find the seed generated by random_device, and thus predict all future outputs of a Mersenne twister seeded in this way[14].

In theory, the standard provides a way to seed the Mersenne twister properly. The tool is called SeedSequence, and there is an implementation of it in the standard library, std::seed_seq. Once again, when you try to use it in practice, it breaks down.

std::seed_seq is essentially a wrapper over std::vector that you can give a bunch of randomness to, and then a random number engine can extract (stretched) randomness out. It is used like this:

auto rd_dev = std::random_device{};
std::seed_seq seq{rd_dev(), rd_dev(), rd_dev(), rd_dev()};
std::mt19937 urbg(seq);

This time we initialized our instance of mt19937 with 16 (4 * 4) bytes of randomness. Progress! There are two problems with this snippet though:

  1. There is no way to know how much randomness you have to provide to a RandomNumberEngine T, and thus how much randomness you have to feed into seed_seq.
  2. std::seed_seq is very tightly specified by the standard. The implementation forced by the standard is not a bijection[15].

A fun fact about 1. is that std::mersenne_twister_engine provides a member variable you can query to find out how much data it needs[16]. However, this is an accident of standardization, and no other standard-provided random number engine provides a way to retrieve this information.

The second problem means that even if you hardcode seed sizes of all random number engine types your program uses, you still couldn't use std::seed_seq for initialization, because it loses entropy... here is an example of this on Godbolt:

#include <array>
#include <iostream>
#include <random>

int main() {
    std::seed_seq seq1({0xf5e5b5c0, 0xdcb8e4b1}),
                  seq2({0xd34295df, 0xba15c4d0});

    std::array<uint32_t, 2> arr1, arr2;
    seq1.generate(arr1.begin(), arr1.end());
    seq2.generate(arr2.begin(), arr2.end());

    // prints 1 because seed_seq::generate is not a bijection
    std::cout << (arr1 == arr2) << '\n';
}

In other words, even if you write your own type that fulfils the SeedSequence named requirements, you have to hardcode the sizes of your Random Number Engine types somewhere.

Recap

To recapitulate, generating random numbers using C standard library has many problems, with some fixable at great programming effort, and other unfixable. If you are for some reason stuck with just the C library, you should definitely write your own.

Generating random numbers using C++ standard library fixes most of the problems of using the C library. However, the operative word here is most, and it introduces its own problems instead. In the end, whether you can successfully use <random> depends on your requirements.

  • If you need cross-platform reproducibility, then you cannot use standard-provided distributions at all, and you have to write your own.
  • If you need actually random data for whatever reason, you either have to write your own version of random_device, or hardcode a list of platforms + configurations where you can use std::random_device.
  • if you want to properly seed a Random Number Engine, you have to write your own SeedSequence, and also hardcode the required seed sizes of all your Random Number Engines.

My use cases for <random> usually require cross-platform reproducibility, need properly random seed values, and would prefer fully seeded RNEs. This means that I either have to write 90% of <random> on my own, or use a different implementation, such as Boost.Random or PCG random utilities...

And I am not the only one. When I was writing a couple of standardization proposals for fixing <random>, I made an informal Reddit poll asking people about their use of <random>. The absolute majority of people answered either that they have their own implementation, or use Boost.Random. Few people used other open source libraries, and very, very, very few people use the standard random.


That's it for this post. The next post explores possible avenues for fixing <random> and making it usable by more people in more domains.


  1. This is not a theoretical problem, e.g. MSVC defines RAND_MAX as 32767. ↩︎

  2. There are two problems. The first is that your standard library is allowed to call std::rand from arbitrary other functions. The other is that the other libraries in your binary are also allowed to call std::rand, and you have no way to tell them to lock a mutex before doing so. ↩︎

  3. Including some less known distributions, like Student's t-distribution. ↩︎

  4. Both these configurations share the same libc, so the result of std::rand will be the same. However, it obviously differs from MSVC's results, because MSVC's std::rand cannot return a number that big. ↩︎

  5. Microsoft's standard library implementation is in fact considering a change to a more performant implementation of std::uniform_int_distribution. ↩︎

  6. There are many different use cases for reproducible random generation. Fuzzing, procedural generation, networked clients, physics simulations, etc. ↩︎

  7. I am reasonably sure that standard library maintainers are smart enough to look it up on their own, but ¯\_(ツ)_/¯. ↩︎

  8. This does not mean you can use it for compile-time checking, but at least it is something. ↩︎

  9. Libc++ can either use getentropy, arc4random, nacl_secure_random, rand_s, or read from a file, like /dev/urandom. In theory the user can specify their own file, and thus get nonrandom data. ↩︎

  10. As an example, current versions of libstdc++ use rand_s when on Windows. rand_s is kernel-provided cryptographically secure random number generator, but libstdc++ still returns 0. ↩︎

  11. I do not subscribe to the view that we can use up kernel's entropy after it has initialized its pools properly, but I can see how a good-faith implementation can arrive here. ↩︎

  12. It actually cannot print over 30% of all possible values of a 4-byte unsigned integer. ↩︎

  13. It takes about 10 minutes on my desktop. ↩︎

  14. With a properly seeded Mersenne twister, you would have to observe 624 successive outputs to be able to predict all future generated numbers. ↩︎

  15. If you assume that std::seed_seq never gets more random data than a random number engine will request later, then it would suffice for it to be an injection. It is not an injection either. ↩︎

  16. You actually have to query more than one member variable and do some arithmetics, but it is possible to find out how much random data you need to feed a std::mersenne_twister_engine to initialize it fully. ↩︎

]]>
<![CDATA[Modern SAT solvers: fast, neat and underused (part 1.5 of N)]]>

In part 1 of this series, we built a Sudoku solver based on translating Sudoku to SAT and then giving the resulting SAT instance to a SAT solver. We also benchmarked our solver and found out that it, unsurprisingly, loses to the state of the art of Sudoku solvers. Since

]]>
https://codingnest.com/modern-sat-solvers-fast-neat-and-underused-part-1-5-of-n/64c28599a062e6148d2e46c4Fri, 27 Sep 2019 13:55:17 GMT

In part 1 of this series, we built a Sudoku solver based on translating Sudoku to SAT and then giving the resulting SAT instance to a SAT solver. We also benchmarked our solver and found out that it, unsurprisingly, loses to the state of the art of Sudoku solvers. Since then, I convinced[1] a couple of my friends to also write a C++ sudoku solver, and we can compare our solver against those.

We have 2 other solvers to compare our solver to, one written by Aleš Hrabalík, and one written by Ben Steffan (BiCapitalize on Twitter/Discord).

Implementations

Aleš's implementation can be found in his GitHub repo, and the algorithm itself is a simple backtracking solver. It has been savagely optimized though and relies heavily on CPU's native handling of lsb and popcnt instructions. According to Aleš, it took roughly 8 hours to write, but he has had previous experience with writing a Sudoku solver in C++.

Ben's implementation can also be found in his GitHub repo, and it is an implementation of Knuth's Algorithm X using Dancing Links. According to Ben, it took him roughly 20 hours to write, but he had no previous experience with writing a Sudoku solver.

Benchmark results

The benchmarked versions were commits 132c1d4f for Aleš's solver, 243f546d for Ben's solver and 4894ff6f for our SAT-based solver. The inputs used for benchmarking was the same set of 95 hard Sudokus as in part 1, on the same machine using the same compiler and environment.

These are the results:

3-plot

As we can see, Aleš's solver has the fastest single solution time, and also the fastest average solution time. However, it also has the longest tail, taking roughly 40ms for the slowest of the inputs. The other two solvers have significantly less runtime variance, and at this level of detail, are basically identical as far as their average runtime goes.

Let's zoom in a bit.

2-plot

Now we can see that the SAT-based solver performs slightly better, but the differences in both best and average performance are small enough as to be pointless. The most interesting part of comparing these two is that the SAT-based solver is comparatively more consistent and has a shorter tail.

Conclusion

We got 3 different Sudoku solvers written in "reasonable prototyping time", that is a day or two. We found out that the performance of our SAT-based solver is competitive with the other 2 dedicated Sudoku solver. We also found out that it took other people more time to write a direct Sudoku solver[2] than it took us to write a solver based on translating SAT to sudoku.


That is all for part 1.5.

Part 2 shows how to implement a SAT-based solver for master-key systems.


  1. Read as "pestered until they did it". ↩︎

  2. In the case of Ben's solver, it took significantly more time to implement it than it took to implement the SAT-based solver. ↩︎

]]>