vectorized CRC on ARM64

Started by John Naylor11 months ago18 messageshackers

[email protected]

11 months ago

We did something similar for x86 for v18, and here is some progress
towards Arm support.

0001: Like e2809e3a101 -- inline small constant inputs to compensate
for the fact that 0002 will do a runtime check even if the usual CRC
extension is targeted. There is a difference from x86, however: On Arm
we currently align on 8-byte boundaries before looping on 8-byte
chunks. That requirement would prevent loop unrolling. We could use
4-byte chunks to get around that, but it's not clear which way is
best. I've coded it so it's easy to try both ways.

0002: Like 3c6e8c12389 and in fact uses the same program to generate
the code, by specifying Neon instructions with the Arm "crypto"
extension instead. There are some interesting differences from x86
here as well:
- The upstream implementation chose to use inline assembly instead of
intrinsics for some reason. I initially thought that was a way to get
broader compiler support, but it turns out you still need to pass the
relevant flags to get the assembly to link.
- I only have Meson support for now, since I used MacOS on CI to test.
That OS and compiler combination apparently targets the CRC extension,
but the PMULL instruction runtime check uses Linux-only headers, I
believe, so previously I hacked the choose function to return true for
testing. The choose function in 0002 is untested in this form.
- On x86 it could be fairly costly to align on a cacheline boundary
before beginning the main loop so I elected to skip that for short-ish
inputs in PG18. On Arm the main loop uses 4 16-byte accumulators, so
the patch acts like upsteam and always aligns on 16-byte boundaries.

0003: An afterthought regarding the above-mentioned alignment, this is
an alternative preamble that might shave a couple cycles for 4-byte
aligned inputs, e.g. WAL.

--
John Naylor
Amazon Web Services

[email protected]

3 months ago

In reply to: John Naylor (#1)

Re: vectorized CRC on ARM64

On Wed, May 14, 2025 I wrote:

We did something similar for x86 for v18, and here is some progress
towards Arm support.

Coming back to this, since there's been recent interest in Arm support.

v2 is a rebase, with a few changes.

- I simplified it by leaving out the inlining for "assume CRC" builds,
since I wanted to avoid alignment considerations if I can. I think
always indirecting through a pointer will have less risk of
regressions in a realistic setting than for x86 since Arm chips
typically have low latency for carryless multiplication instructions.
With just a bit of code we can still use the direct call for small
constant inputs, so I did that to avoid regressions under WAL insert
lock.

- One coding idiom for a vector literal in the generated code was
giving pgindent indigestion, I so rewrote it using Neon intrinsics and
verified it in Godbolt.

0002: Like 3c6e8c12389 and in fact uses the same program to generate
the code, by specifying Neon instructions with the Arm "crypto"
extension instead. There are some interesting differences from x86
here as well:
- The upstream implementation chose to use inline assembly instead of
intrinsics for some reason. I initially thought that was a way to get
broader compiler support, but it turns out you still need to pass the
relevant flags to get the assembly to link.

To follow-up for curiosity's sake, [1]https://dougallj.github.io/applecpu/firestorm.html says that Apple chips can issue
PMULL + EOR as a single uop if they are next to each other in the
instruction stream.

- I only have Meson support for now, since I used MacOS on CI to test.
That OS and compiler combination apparently targets the CRC extension,
but the PMULL instruction runtime check uses Linux-only headers, I
believe, so previously I hacked the choose function to return true for
testing. The choose function in 0002 is untested in this form.

This is still true, but now the CI hack lives in a separate
not-for-commit patch for clarity.

autoconf support is a WIP, and I will share that after I do some
testing on an Arm Linux instance.

[1]: https://dougallj.github.io/applecpu/firestorm.html

--
John Naylor
Amazon Web Services

[email protected]

18 days ago

In reply to: John Naylor (#2)

Re: vectorized CRC on ARM64

Hi John

Thank yo for working on this. I had one question about the mixed use of intrinsics and inline asm here.

On Jan 12, 2026, at 1:27 AM, John Naylor <[email protected]> wrote:

On Wed, May 14, 2025 I wrote:

We did something similar for x86 for v18, and here is some progress
towards Arm support.

Coming back to this, since there's been recent interest in Arm support.

v2 is a rebase, with a few changes.

- I simplified it by leaving out the inlining for "assume CRC" builds,
since I wanted to avoid alignment considerations if I can. I think
always indirecting through a pointer will have less risk of
regressions in a realistic setting than for x86 since Arm chips
typically have low latency for carryless multiplication instructions.
With just a bit of code we can still use the direct call for small
constant inputs, so I did that to avoid regressions under WAL insert
lock.

- One coding idiom for a vector literal in the generated code was
giving pgindent indigestion, I so rewrote it using Neon intrinsics and
verified it in Godbolt.

0002: Like 3c6e8c12389 and in fact uses the same program to generate
the code, by specifying Neon instructions with the Arm "crypto"
extension instead. There are some interesting differences from x86
here as well:
- The upstream implementation chose to use inline assembly instead of
intrinsics for some reason. I initially thought that was a way to get
broader compiler support, but it turns out you still need to pass the
relevant flags to get the assembly to link.

Since the implementation already uses NEON intrinsics such as vld1q_u64, I was wondering why the pmull / pmull2 + eor helpers still need to be inline asm rather than intrinsics.

Is that due to compiler/toolchain support, or because the intrinsic-based version produced noticeably worse code?

To follow-up for curiosity's sake, [1] says that Apple chips can issue
PMULL + EOR as a single uop if they are next to each other in the
instruction stream.

- I only have Meson support for now, since I used MacOS on CI to test.
That OS and compiler combination apparently targets the CRC extension,
but the PMULL instruction runtime check uses Linux-only headers, I
believe, so previously I hacked the choose function to return true for
testing. The choose function in 0002 is untested in this form.

This is still true, but now the CI hack lives in a separate
not-for-commit patch for clarity.

autoconf support is a WIP, and I will share that after I do some
testing on an Arm Linux instance.

[1] https://dougallj.github.io/applecpu/firestorm.html

--
John Naylor
Amazon Web Services
<v2-0001-Compute-CRC32C-on-ARM-using-the-Crypto-Extension-.patch><v2-0002-Force-testing-on-MacOS-CI-XXX-not-for-commit.patch>

Regards
Haibo

[email protected]

17 days ago

In reply to: Haibo Yan (#3)

Re: vectorized CRC on ARM64

On Wed, Mar 18, 2026 at 10:34 AM Haibo Yan <[email protected]> wrote:

Hi John

Thank yo for working on this. I had one question about the mixed use of intrinsics and inline asm here.

Since the implementation already uses NEON intrinsics such as vld1q_u64, I was wondering why the pmull / pmull2 + eor helpers still need to be inline asm rather than intrinsics.

Is that due to compiler/toolchain support, or because the intrinsic-based version produced noticeably worse code?

I answered that in the email you replied to, re-quoted here:

To follow-up for curiosity's sake, [1] says that Apple chips can issue
PMULL + EOR as a single uop if they are next to each other in the
instruction stream.
[1] https://dougallj.github.io/applecpu/firestorm.html

I don't know if that's relevant for current server hardware, so it
could be pointless. I'm personally not a fan of inline assembly, but I
also didn't yet want to put in the effort to alter generated code. I
don't think it would be very hard to do, however.

--
John Naylor
Amazon Web Services

[email protected]

17 days ago

In reply to: John Naylor (#4)

Re: vectorized CRC on ARM64

On Tue, Mar 17, 2026 at 11:52 PM John Naylor <[email protected]>
wrote:

On Wed, Mar 18, 2026 at 10:34 AM Haibo Yan <[email protected]> wrote:

Hi John

Thank yo for working on this. I had one question about the mixed use of

intrinsics and inline asm here.

Since the implementation already uses NEON intrinsics such as vld1q_u64,

I was wondering why the pmull / pmull2 + eor helpers still need to be
inline asm rather than intrinsics.

Is that due to compiler/toolchain support, or because the

intrinsic-based version produced noticeably worse code?

I answered that in the email you replied to, re-quoted here:

To follow-up for curiosity's sake, [1] says that Apple chips can issue
PMULL + EOR as a single uop if they are next to each other in the
instruction stream.
[1] https://dougallj.github.io/applecpu/firestorm.html

I don't know if that's relevant for current server hardware, so it
could be pointless. I'm personally not a fan of inline assembly, but I
also didn't yet want to put in the effort to alter generated code. I
don't think it would be very hard to do, however.

Thanks, that makes sense as an explanation for why the inline asm is there
today. But it also sounds like this is more of a temporary implementation
choice than a conclusion that intrinsics are unsuitable. If so, I wonder
whether it would be better to treat an intrinsics-based version as the
preferred end state unless benchmarks show a clear regression.
Regards
Haibo

[email protected]

14 days ago

In reply to: Haibo Yan (#5)

Re: vectorized CRC on ARM64

On Thu, Mar 19, 2026 at 12:17 AM Haibo Yan <[email protected]> wrote:

On Tue, Mar 17, 2026 at 11:52 PM John Naylor <[email protected]> wrote:

I don't know if that's relevant for current server hardware, so it
could be pointless. I'm personally not a fan of inline assembly, but I
also didn't yet want to put in the effort to alter generated code. I
don't think it would be very hard to do, however.

Thanks, that makes sense as an explanation for why the inline asm is there today. But it also sounds like this is more of a temporary implementation choice than a conclusion that intrinsics are unsuitable.

I can see how my words imply that, but after a moment's thought I
still don't want to put in that effort without a good reason. For
starters, what I said above about "not very hard" may be wishful
thinking.

If so, I wonder whether it would be better to treat an intrinsics-based version as the preferred end state unless benchmarks show a clear regression.

To meet your criterion, we'd not only have to rewrite it correctly,
we'd have to test on multiple vendors of non-Apple hardware and
multiple compiler vendors/versions (at least where the binary output
is different) to prove we haven't caused a regression. I wouldn't
recommend anyone to accept that challenge as stated, since the
risk/reward ratio is just not favorable. Especially considering we're
2 1/2 weeks away from feature freeze.

--
John Naylor
Amazon Web Services

[email protected]

4 days ago

In reply to: John Naylor (#2)

Re: vectorized CRC on ARM64

I wrote:

autoconf support is a WIP, and I will share that after I do some
testing on an Arm Linux instance.

I've only checked paths with objdump and debugging printouts (no perf
testing), but this seems to work in v3. My main concern now is whether
it's a maintenance hazard to overwrite CFLAGS_CRC in a separate check.

In master, we can have one of:

CFLAGS_CRC=""
CFLAGS_CRC="-march=armv8-a+crc+simd"
CFLAGS_CRC="-march=armv8-a+crc"

...and then based on that we set either USE_ARMV8_CRC32C or
USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, and set PG_CRC32C_OBJS.

But below that, v3 runs a new test for pmull instructions with the
flag "-march=armv8-a+crc+simd+crypto" and if it links, it will reset
CFLAGS_CRC to that set of flags. That doesn't seem like the right
thing to do, but I don't see a good alternative. I suppose I could
sidestep that with function attributes, but that's not as well
supported. Another idea would be to turn the relevant line here

if test x"$Ac_cachevar" = x"yes"; then
CFLAGS_CRC="$1"
pgac_arm_pmull_intrinsics=yes
fi

...into CFLAGS_CRC="CFLAGS_CRC$1", where in this case $1 is just
"+crypto". That seems even more fragile, though.

--
John Naylor
Amazon Web Services

[email protected]

4 days ago

In reply to: John Naylor (#7)

Re: vectorized CRC on ARM64

On Tue, Mar 31, 2026 at 06:19:49PM +0700, John Naylor wrote:

I've only checked paths with objdump and debugging printouts (no perf
testing), but this seems to work in v3. My main concern now is whether
it's a maintenance hazard to overwrite CFLAGS_CRC in a separate check.

In master, we can have one of:

CFLAGS_CRC=""
CFLAGS_CRC="-march=armv8-a+crc+simd"
CFLAGS_CRC="-march=armv8-a+crc"

...and then based on that we set either USE_ARMV8_CRC32C or
USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, and set PG_CRC32C_OBJS.

But below that, v3 runs a new test for pmull instructions with the
flag "-march=armv8-a+crc+simd+crypto" and if it links, it will reset
CFLAGS_CRC to that set of flags. That doesn't seem like the right
thing to do, but I don't see a good alternative. I suppose I could
sidestep that with function attributes, but that's not as well
supported. Another idea would be to turn the relevant line here

if test x"$Ac_cachevar" = x"yes"; then
CFLAGS_CRC="$1"
pgac_arm_pmull_intrinsics=yes
fi

...into CFLAGS_CRC="CFLAGS_CRC$1", where in this case $1 is just
"+crypto". That seems even more fragile, though.

Appending +crypto to the current CFLAGS_CRC seems like the right thing to
do to me. I'm trying to understand why you are concerned about fragility.
I suppose someone could add something else between the existing checks and
the one you're adding that appends a different option or something, but
besides that, I'd think merely appending +crypto to the -march value
wouldn't invalidate previous tests or anything like that.

--
nathan

[email protected]

3 days ago

In reply to: Nathan Bossart (#8)

Re: vectorized CRC on ARM64

On Wed, Apr 1, 2026 at 1:21 AM Nathan Bossart <[email protected]> wrote:

On Tue, Mar 31, 2026 at 06:19:49PM +0700, John Naylor wrote:

In master, we can have one of:

CFLAGS_CRC=""
CFLAGS_CRC="-march=armv8-a+crc+simd"
CFLAGS_CRC="-march=armv8-a+crc"

...and then based on that we set either USE_ARMV8_CRC32C or
USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK, and set PG_CRC32C_OBJS.

But below that, v3 runs a new test for pmull instructions with the
flag "-march=armv8-a+crc+simd+crypto" and if it links, it will reset
CFLAGS_CRC to that set of flags. That doesn't seem like the right
thing to do, but I don't see a good alternative. I suppose I could
sidestep that with function attributes, but that's not as well
supported. Another idea would be to turn the relevant line here

if test x"$Ac_cachevar" = x"yes"; then
CFLAGS_CRC="$1"
pgac_arm_pmull_intrinsics=yes
fi

...into CFLAGS_CRC="CFLAGS_CRC$1", where in this case $1 is just
"+crypto". That seems even more fragile, though.

Appending +crypto to the current CFLAGS_CRC seems like the right thing to
do to me. I'm trying to understand why you are concerned about fragility.
I suppose someone could add something else between the existing checks and
the one you're adding that appends a different option or something, but
besides that, I'd think merely appending +crypto to the -march value
wouldn't invalidate previous tests or anything like that.

Maybe it's a low risk, but this stuff is awfully difficult to debug
when it goes wrong.

I don't think appending +crypto would work everywhere IIUC -- if the
packager set +crc in the CFLAGS, then CFLAGS_CRC="" so there is no
existing -march to put it on, and the PMULL check would fail. Maybe
that's okay if we call that out in the release notes, since that's
probably rare. Then we could check both with and without +crypto
tacked on.

I tried appending the new -march value, and that works since last one
wins. But that might have the same problem as above if the packager
put something special in CFLAGS for -march, that would get wiped out
by our new one.

--
John Naylor
Amazon Web Services

[email protected]

3 days ago

In reply to: John Naylor (#9)

Re: vectorized CRC on ARM64

On Wed, Apr 01, 2026 at 06:48:10PM +0700, John Naylor wrote:

I don't think appending +crypto would work everywhere IIUC -- if the
packager set +crc in the CFLAGS, then CFLAGS_CRC="" so there is no
existing -march to put it on, and the PMULL check would fail. Maybe
that's okay if we call that out in the release notes, since that's
probably rare. Then we could check both with and without +crypto
tacked on.

I tried appending the new -march value, and that works since last one
wins. But that might have the same problem as above if the packager
put something special in CFLAGS for -march, that would get wiped out
by our new one.

The other idea I had was to always add +crypto in the existing tests
(unless we're not setting CFLAGS_CRC), and then to just do the PMULL check
with whatever CFLAGS_CRC is set to, not bothering to try different values.
That doesn't fix the problem you mentioned in the quoted text, but maybe
it's a little sturdier.

... or maybe we should just use __attribute__((target("..."))) for the
PMULL stuff. That wouldn't work well for clang versions before 16, but it
at least wouldn't regress anything. They just wouldn't get PMULL support.

--
nathan

[email protected]

2 days ago

In reply to: Nathan Bossart (#10)

Re: vectorized CRC on ARM64

On Wed, Apr 1, 2026 at 10:24 PM Nathan Bossart <[email protected]> wrote:

... or maybe we should just use __attribute__((target("..."))) for the
PMULL stuff. That wouldn't work well for clang versions before 16, but it
at least wouldn't regress anything. They just wouldn't get PMULL support.

Okay, that works as far back as gcc 6.3, so v4 does it that way. The
attribute doesn't seem to be necessary on the inline helpers for
production builds, but they're needed to work with -O0.

Also
- removed the term 'intrinsics' from config variables, since we're not
checking those
- removed the crc intrinsic from the pmull tests
- fixed a failure to restore CFLAGS
- fixed it to work with +crc CFLAGS

For some reason, my CI builds with MacOS are failing on v3 (v2 skipped
the runtime check to get some exposure on CI) with the following, and
running CI from my Github account fails as well, so it's not a
temporary glitch. Adding a __linux__ guard to the runtime check didn't
help, so not yet sure what to make of it.

3/382 setup - postgresql:initdb_cache TIMEOUT 300.51s killed by
signal 15 SIGTERM

--
John Naylor
Amazon Web Services

[email protected]

2 days ago

In reply to: John Naylor (#11)

Re: vectorized CRC on ARM64

On Thu, Apr 02, 2026 at 08:16:27PM +0700, John Naylor wrote:

For some reason, my CI builds with MacOS are failing on v3 (v2 skipped
the runtime check to get some exposure on CI) with the following, and
running CI from my Github account fails as well, so it's not a
temporary glitch. Adding a __linux__ guard to the runtime check didn't
help, so not yet sure what to make of it.

I think the new pg_comp_crc32_choose() is infinitely recursing on macOS
because USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK is not defined but
pg_crc32c_armv8_available() returns false. If I trace through that
function, I see that it's going straight to the

#else
return false;
#endif

at the end. And sure enough, both HAVE_ELF_AUX_INFO and HAVE_GETAUXVAL
aren't defined in pg_config.h. I think we might need to use sysctlbyname()
to determine PMULL support on macOS, but at this stage of the development
cycle, I would probably lean towards just compiling in the sb8
implementation.

--
nathan

[email protected]

2 days ago

In reply to: Nathan Bossart (#12)

Re: vectorized CRC on ARM64

On Thu, Apr 02, 2026 at 10:53:24AM -0500, Nathan Bossart wrote:

I think the new pg_comp_crc32_choose() is infinitely recursing on macOS
because USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK is not defined but
pg_crc32c_armv8_available() returns false. If I trace through that
function, I see that it's going straight to the

#else
return false;
#endif

at the end. And sure enough, both HAVE_ELF_AUX_INFO and HAVE_GETAUXVAL
aren't defined in pg_config.h. I think we might need to use sysctlbyname()
to determine PMULL support on macOS, but at this stage of the development
cycle, I would probably lean towards just compiling in the sb8
implementation.

Hm. On second thought, that probably regresses macOS builds because it was
presumably using the armv8 path without runtime checks before...

--
nathan

[email protected]

1 day ago

In reply to: Nathan Bossart (#13)

Re: vectorized CRC on ARM64

On Thu, Apr 2, 2026 at 11:17 PM Nathan Bossart <[email protected]> wrote:

On Thu, Apr 02, 2026 at 10:53:24AM -0500, Nathan Bossart wrote:

I think the new pg_comp_crc32_choose() is infinitely recursing on macOS
because USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK is not defined but
pg_crc32c_armv8_available() returns false. If I trace through that
function, I see that it's going straight to the

#else
return false;
#endif

at the end. And sure enough, both HAVE_ELF_AUX_INFO and HAVE_GETAUXVAL

Ah of course.

aren't defined in pg_config.h. I think we might need to use sysctlbyname()
to determine PMULL support on macOS, but at this stage of the development
cycle, I would probably lean towards just compiling in the sb8
implementation.

Hm. On second thought, that probably regresses macOS builds because it was
presumably using the armv8 path without runtime checks before...

I went with the following for v5, and it passes MacOS on my Github CI:

+  /* set fallbacks */
+#ifdef USE_ARMV8_CRC32C
+  /* On e.g. MacOS, our runtime feature detection doesn't work */
+  pg_comp_crc32c = pg_comp_crc32c_armv8;
+#else
+  pg_comp_crc32c = pg_comp_crc32c_sb8;
+#endif
+ [...crc and pmull checks]

That should keep scalar hardware support working, but now it'll only
use direct calls for constant inputs.

I also did some benchmarking on an ARM Neoverse N1 / gcc 8.3
(attached). There the vector loop still works well all the way down to
the minimum input size of 64 bytes, and on long inputs it's almost
twice as fast as scalar. For reproduceability, I slightly modified the
benchmark we used last year, to make sure the input is aligned
(attached but not for CI). In the end, I want to add a length check so
that inputs smaller than 80 bytes go straight to the scalar path.
Above 80, after alignment adjustments in the preamble, that still
guarantees at least one loop iteration in the vector path.

--
John Naylor
Amazon Web Services

[email protected]

1 day ago

In reply to: John Naylor (#14)

Re: vectorized CRC on ARM64

On Fri, Apr 03, 2026 at 03:22:59PM +0700, John Naylor wrote:

I went with the following for v5, and it passes MacOS on my Github CI:
+  /* set fallbacks */
+#ifdef USE_ARMV8_CRC32C
+  /* On e.g. MacOS, our runtime feature detection doesn't work */
+  pg_comp_crc32c = pg_comp_crc32c_armv8;
+#else
+  pg_comp_crc32c = pg_comp_crc32c_sb8;
+#endif
+ [...crc and pmull checks]
That should keep scalar hardware support working, but now it'll only
use direct calls for constant inputs.

v5 LGTM

--
nathan

[email protected]

about 4 hours ago

In reply to: Nathan Bossart (#15)

Re: vectorized CRC on ARM64

On Fri, Apr 3, 2026 at 8:54 PM Nathan Bossart <[email protected]> wrote:

v5 LGTM

Thanks for looking! Pushed with a minor comment tweak and removal of
the CFLAGS save-and-restore dance since we don't need it anymore.
Let's see what the buildfarm thinks.

--
John Naylor
Amazon Web Services

[email protected]

about 3 hours ago

In reply to: John Naylor (#16)

Re: vectorized CRC on ARM64

On Sat, Apr 04, 2026 at 08:52:34PM +0700, John Naylor wrote:

Thanks for looking! Pushed with a minor comment tweak and removal of
the CFLAGS save-and-restore dance since we don't need it anymore.
Let's see what the buildfarm thinks.

Ha, I think koel is going to complain about your comment that talks about
pgindent...

diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index b404e6c373e..5fa57fb4927 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -162,8 +162,8 @@ pg_comp_crc32c_pmull(pg_crc32c crc, const void *data, size_t len)
         }

         /*
-         * pgindent complained of unmatched parens, so the following has
-         * been re-written with intrinsics:
+         * pgindent complained of unmatched parens, so the following has been
+         * re-written with intrinsics:
          *
          * x0 = veorq_u64((uint64x2_t) {crc0, 0}, x0);
          */

--
nathan

[email protected]

about 3 hours ago

In reply to: Nathan Bossart (#17)

Re: vectorized CRC on ARM64

On Sat, Apr 4, 2026 at 9:37 PM Nathan Bossart <[email protected]> wrote:

On Sat, Apr 04, 2026 at 08:52:34PM +0700, John Naylor wrote:

Thanks for looking! Pushed with a minor comment tweak and removal of
the CFLAGS save-and-restore dance since we don't need it anymore.
Let's see what the buildfarm thinks.

Ha, I think koel is going to complain about your comment that talks about
pgindent...

:-)

--
John Naylor
Amazon Web Services