Skip to content

TEMP: Test overhead of feature checking#6255

Draft
hazzlim wants to merge 1 commit intomicrosoft:mainfrom
hazzlim:feat-check-overhead
Draft

TEMP: Test overhead of feature checking#6255
hazzlim wants to merge 1 commit intomicrosoft:mainfrom
hazzlim:feat-check-overhead

Conversation

@hazzlim
Copy link
Copy Markdown
Contributor

@hazzlim hazzlim commented Apr 15, 2026

Speedup (or rather slowdown), this PR versus baseline, measured on a Neoverse N2 system.

MSVC Speedup
r<std::uint8_t>/3449 0.969
r<std::uint8_t>/63 0.675
r<std::uint8_t>/31 0.722
r<std::uint8_t>/15 0.767
r<std::uint8_t>/7 0.598
r<std::uint16_t>/3449 0.978
r<std::uint16_t>/63 0.804
r<std::uint16_t>/31 0.67
r<std::uint16_t>/15 0.753
r<std::uint16_t>/7 0.717
r<std::uint32_t>/3449 1
r<std::uint32_t>/63 0.754
r<std::uint32_t>/31 0.666
r<std::uint32_t>/15 0.667
r<std::uint32_t>/7 0.722
r<std::uint64_t>/3449 0.814
r<std::uint64_t>/63 0.76
r<std::uint64_t>/31 0.613
r<std::uint64_t>/15 0.7
r<std::uint64_t>/7 0.41
rc<std::uint8_t>/3449 0.939
rc<std::uint8_t>/63 0.587
rc<std::uint8_t>/31 0.583
rc<std::uint8_t>/15 0.608
rc<std::uint8_t>/7 0.649
rc<std::uint16_t>/3449 0.979
rc<std::uint16_t>/63 0.637
rc<std::uint16_t>/31 0.551
rc<std::uint16_t>/15 0.513
rc<std::uint16_t>/7 0.563
rc<std::uint32_t>/3449 0.982
rc<std::uint32_t>/63 0.681
rc<std::uint32_t>/31 0.581
rc<std::uint32_t>/15 0.525
rc<std::uint32_t>/7 0.516
rc<std::uint64_t>/3449 0.975
rc<std::uint64_t>/63 0.732
rc<std::uint64_t>/31 0.651
rc<std::uint64_t>/15 0.528
rc<std::uint64_t>/7 0.478

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
1 pipeline(s) were filtered out due to trigger conditions.
1 pipeline(s) require an authorized user to comment /azp run to run.

@jaykang10
Copy link
Copy Markdown
Contributor

Speedup (or rather slowdown), this PR versus baseline, measured on a Neoverse N2 system.

MSVC Speedup
r<std::uint8_t>/3449 0.969
r<std::uint8_t>/63 0.675
r<std::uint8_t>/31 0.722
r<std::uint8_t>/15 0.767
r<std::uint8_t>/7 0.598
r<std::uint16_t>/3449 0.978
r<std::uint16_t>/63 0.804
r<std::uint16_t>/31 0.67
r<std::uint16_t>/15 0.753
r<std::uint16_t>/7 0.717
r<std::uint32_t>/3449 1
r<std::uint32_t>/63 0.754
r<std::uint32_t>/31 0.666
r<std::uint32_t>/15 0.667
r<std::uint32_t>/7 0.722
r<std::uint64_t>/3449 0.814
r<std::uint64_t>/63 0.76
r<std::uint64_t>/31 0.613
r<std::uint64_t>/15 0.7
r<std::uint64_t>/7 0.41
rc<std::uint8_t>/3449 0.939
rc<std::uint8_t>/63 0.587
rc<std::uint8_t>/31 0.583
rc<std::uint8_t>/15 0.608
rc<std::uint8_t>/7 0.649
rc<std::uint16_t>/3449 0.979
rc<std::uint16_t>/63 0.637
rc<std::uint16_t>/31 0.551
rc<std::uint16_t>/15 0.513
rc<std::uint16_t>/7 0.563
rc<std::uint32_t>/3449 0.982
rc<std::uint32_t>/63 0.681
rc<std::uint32_t>/31 0.581
rc<std::uint32_t>/15 0.525
rc<std::uint32_t>/7 0.516
rc<std::uint64_t>/3449 0.975
rc<std::uint64_t>/63 0.732
rc<std::uint64_t>/31 0.651
rc<std::uint64_t>/15 0.528
rc<std::uint64_t>/7 0.478

It looks like the runtime feature check IsProcessorFeaturePresent introduces overhead. Since we need to verify whether the Windows OS supports ISAs like SVE, which require additional OS support, we cannot avoid this runtime overhead.

@hazzlim mentioned there was a discussion about caching the result of IsProcessorFeaturePresent, as follows:

note that when implementing feature detection directly in `vector_algorithms.cpp`, you can't use magic statics (i.e., static local variables). You have to resort to either DCL-like singletons or an init function called by the runtime during early initialization.

I think we could use IsProcessorFeaturePresent to initialize __isa_available and __isa_enabled in vcruntime, allowing us to cache ISA information.
@StephanTLavavej, could you please share your thoughts on this?

@StephanTLavavej
Copy link
Copy Markdown
Member

I can add variables to VCRuntime, but the flag-style behavior of IsProcessorFeaturePresent is the easiest to directly express. The linear behavior (where each setting 1, 2, 3, 4 is strictly more powerful) might make sense for certain coarse-grained features (NEON-only, SVE, SVE2?) but I'm not enough of an expert to figure out whether that would be useful, and the flags would presumably be strictly more powerful?

@StephanTLavavej StephanTLavavej added ARM64 Related to the ARM64 architecture uncharted Excluded from the Status Chart labels Apr 15, 2026
@StephanTLavavej StephanTLavavej moved this from Initial Review to Work In Progress in STL Code Reviews Apr 15, 2026
@hazzlim
Copy link
Copy Markdown
Contributor Author

hazzlim commented Apr 15, 2026

I can add variables to VCRuntime, but the flag-style behavior of IsProcessorFeaturePresent is the easiest to directly express. The linear behavior (where each setting 1, 2, 3, 4 is strictly more powerful) might make sense for certain coarse-grained features (NEON-only, SVE, SVE2?) but I'm not enough of an expert to figure out whether that would be useful, and the flags would presumably be strictly more powerful?

I think the flags mirroring the IsProcessorFeaturePresent flags probably makes the most sense.

Strictly increasing (NEON-only, SVE, SVE2, SVE2p1) would honestly cover most of the use cases I foresee, but it's probably worth keeping the fine-grained resolution. (And less of a headache - it's often quite fiddly tricky tracking down from which architecture tick features become mandatory and which feature's presence strictly implies presence of the other etc.)

@jaykang10
Copy link
Copy Markdown
Contributor

I can add variables to VCRuntime, but the flag-style behavior of IsProcessorFeaturePresent is the easiest to directly express. The linear behavior (where each setting 1, 2, 3, 4 is strictly more powerful) might make sense for certain coarse-grained features (NEON-only, SVE, SVE2?) but I'm not enough of an expert to figure out whether that would be useful, and the flags would presumably be strictly more powerful?

Thanks for your kind comment, @StephanTLavavej
I agree with @hazzlim.
In the future, we may see more custom Arm cores for Windows on Arm, each with different combinations of extensions. It would be beneficial to keep the flag-style behavior of IsProcessorFeaturePresent aligned with Arm extensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64 Related to the ARM64 architecture uncharted Excluded from the Status Chart

Projects

Status: Work In Progress

Development

Successfully merging this pull request may close these issues.

3 participants