Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
I tested this change on an aarch64 machine with GCC 14.1.1, and the benchmark was ~ the same, or slower: | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 27.32 | 36,601,145.02 | 0.2% | 201.00 | 80.58 | 2.494 | 3.00 | 0.0% | 11.01 | `SipHash_32b`PR | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 28.66 | 34,891,459.89 | 0.2% | 240.00 | 83.01 | 2.891 | 11.00 | 0.0% | 10.99 | `SipHash_32b`I tested on the same machine with Clang 18.1.6, and the performance was also ~ the same, or slower: | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 25.89 | 38,623,097.25 | 0.0% | 197.00 | 77.18 | 2.552 | 3.00 | 0.0% | 11.01 | `SipHash_32b`PR | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 27.00 | 37,036,922.01 | 0.2% | 197.00 | 80.44 | 2.449 | 3.00 | 0.0% | 10.92 | `SipHash_32b |
|
Thanks for checking @fanquake, I noticed that the coverage report also reported the same - seems the M1 processor is doing something differently. |
1dd479c to
2c58c9d
Compare
Also removed the `*((uint64_t*)` introduced in 619d569 since it seems to distort measurement (and no other benchmark uses this style)
2c58c9d to
b6dd181
Compare
…ial optimization issues 42066f4 Refactor SipHash_32b benchmark to improve accuracy and avoid optimization issues (Lőrinc) Pull request description: This PR stems from the discussions in #30317 (comment) The previous benchmark for `SipHash` was slightly less accurate in representing real-world usage and allowed for potential compiler optimizations that could invalidate the benchmark. This change aims to ensure the benchmark produces more realistic results. By modifying the initial values and only incrementing the bytes of `val`, the benchmark should reflects a more typical usage patterns - and prevent the compiler from optimizing away the calculations. ------- On my M1 processor the benchmark's speed changed significantly (but the CI seems to produce the same result as before): > cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_BENCH=ON && cmake --build build -j10 && ./build/src/bench/bench_bitcoin --filter=SipHash_32b --min-time=1000 Before: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 35.15 | 28,445,856.66 | 0.2% | 1.10 | `SipHash_32b` After (note that only the benchmark changed): | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 22.05 | 45,350,886.64 | 0.3% | 1.10 | `SipHash_32b` ACKs for top commit: maflcko: review ACK 42066f4 achow101: ACK 42066f4 hodlinator: ACK 42066f4 Tree-SHA512: 6bbe9d725d4c3396642e55ce48c31baa5339e56838d6d5fb377fb1069daa9292375e7020ceff7da0d78befffc1e984f717b5232217fe911989613480adaa937e
WIP, not finished yet.
Profiling

./bitcoind -reindex-chainstate -checkblocks=10000revealed that hashing is a bottleneck:I tried optimizing it, but the benchmark seemed off, it showed a 10% speedup which didn't replicate on CI - so I unified the bechmark to follow the format of other ones, which seems to have fixed the inconsistent behavior.
I didn't end up optimizing the code (has the exact same speed as before), but if I've already simplified it, I'll push it anyway.