-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add Neon implementation of minmax #5963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3119a3a to
c92dcef
Compare
stl/inc/xutility
Outdated
| _INLINE_VAR constexpr bool _Is_64bit_int_on_arm64_v = false; | ||
| #endif // ^^^ defined(_M_ARM64) ^^^ | ||
|
|
||
| template <class _Iter, class _Pr, bool _Support64BitIntOnArm = false> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that we expose this quirk less. Instead of adding template parameter, you could do one of this:
- Just repeat the
_Is_min_max_iterators_safe<_Iter> && _Is_predicate_less<_Iter, _Pr>formula in_Is_min_max_value_optimization_safe - Extract common base for
_Is_min_max_optimization_safeand_Is_min_max_value_optimization_safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone with the first suggestion, which seemed like the simplest thing!
stl/src/vector_algorithms.cpp
Outdated
| return vreinterpretq_s64_u64(vcgtq_u64(vreinterpretq_u64_s64(_First), vreinterpretq_u64_s64(_Second))); | ||
| } | ||
|
|
||
| static _Vec_t _Min(const _Vec_t _First, const _Vec_t _Second, _Vec_t _Mask) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const _Mask .
I observe it is interesting that both ISA have same limitation of having usual vector max for 32-bit elements, but having to blend for 64bit elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, It's an interesting quirk! Done.
stl/src/vector_algorithms.cpp
Outdated
| return _Min(_First, _Second, _Cmp_gt_u(_First, _Second)); | ||
| } | ||
|
|
||
| static _Vec_t _Max(const _Vec_t _First, const _Vec_t _Second, _Vec_t _Mask) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto const _Mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| _Advance_bytes(_First, sizeof(_Ty)); | ||
| } | ||
|
|
||
| #pragma loop(no_vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bug fix, or just a performance thing?
In either cases may worth commenting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment to indicate the reason for doing it. Essentially, we don't want to auto-vectorize the scalar tail after the manual vectorization because we'll just have a bunch of dead autovec code, and unecessary extra conditional checks for how many elements are left. (MSVC does auto-vectorize this tail, at least on ARM64.)
|
#5949 has been merged, so this should be ready to be revised. As a reminder, draft mode disables PR checks when commits are pushed. Merely moving a PR out of draft mode won't trigger checks, so you should mark as "ready for review" and then push commits to trigger PR checks. |
Add and enable a Neon implementation of minmax on ARM64 platforms. The existing code is refactored to support an unrolled version for sufficiently large inputs.
This is stacked on #5949.
Performance numbers: