Add Neon implementation of minmax #5963

hazzlim · 2025-12-15T12:22:40Z

Add and enable a Neon implementation of minmax on ARM64 platforms. The
existing code is refactored to support an unrolled version for
sufficiently large inputs.

This is stacked on #5949.

Performance numbers:

Benchmark	Speedup MSVC	Speedup Clang
bm<uint8_t, Op::Min_val>/8021	26.191	1.073
bm<uint8_t, Op::Min_val>/63	2.833	0.931
bm<uint8_t, Op::Max_val>/8021	24.444	1.04
bm<uint8_t, Op::Max_val>/63	3.209	1.004
bm<uint8_t, Op::Both_val>/8021	66.888	28.622
bm<uint8_t, Op::Both_val>/63	2.219	2.946
bm<uint16_t, Op::Min_val>/8021	13.457	1.023
bm<uint16_t, Op::Min_val>/31	2.25	1.023
bm<uint16_t, Op::Max_val>/8021	13.457	1.023
bm<uint16_t, Op::Max_val>/31	2.588	1.023
bm<uint16_t, Op::Both_val>/8021	38.111	14.178
bm<uint16_t, Op::Both_val>/31	2.324	2.689
bm<uint32_t, Op::Min_val>/8021	23.837	7.554
bm<uint32_t, Op::Min_val>/15	3.896	3
bm<uint32_t, Op::Max_val>/8021	24.318	7.495
bm<uint32_t, Op::Max_val>/15	3.987	2.921
bm<uint32_t, Op::Both_val>/8021	12.5	8.182
bm<uint32_t, Op::Both_val>/15	2.338	2.857
bm<uint64_t, Op::Min_val>/8021	2.01	0.684
bm<uint64_t, Op::Min_val>/7	1.374	1.5
bm<uint64_t, Op::Max_val>/8021	2.115	0.7
bm<uint64_t, Op::Max_val>/7	1.495	1.5
bm<uint64_t, Op::Both_val>/8021	1.029	0.741
bm<uint64_t, Op::Both_val>/7	0.967	1.321
bm<int8_t, Op::Min_val>/8021	26.191	1.073
bm<int8_t, Op::Min_val>/63	3.103	1.111
bm<int8_t, Op::Max_val>/8021	26.111	1.087
bm<int8_t, Op::Max_val>/63	3.286	1.167
bm<int8_t, Op::Both_val>/8021	66.957	28
bm<int8_t, Op::Both_val>/63	1.87	2.574
bm<int16_t, Op::Min_val>/8021	12.866	0.99
bm<int16_t, Op::Min_val>/31	1.952	0.978
bm<int16_t, Op::Max_val>/8021	12.866	1.023
bm<int16_t, Op::Max_val>/31	1.933	0.996
bm<int16_t, Op::Both_val>/8021	28.667	14.178
bm<int16_t, Op::Both_val>/31	2.066	2.42
bm<int32_t, Op::Min_val>/8021	24.405	7.595
bm<int32_t, Op::Min_val>/15	3.983	2.921
bm<int32_t, Op::Max_val>/8021	24.405	7.667
bm<int32_t, Op::Max_val>/15	4.075	2.921
bm<int32_t, Op::Both_val>/8021	14.659	8.182
bm<int32_t, Op::Both_val>/15	2.442	2.921
bm<int64_t, Op::Min_val>/8021	2.16	0.735
bm<int64_t, Op::Min_val>/7	1.6	1.64
bm<int64_t, Op::Max_val>/8021	2.07	0.713
bm<int64_t, Op::Max_val>/7	1.495	1.605
bm<int64_t, Op::Both_val>/8021	1.23	0.761
bm<int64_t, Op::Both_val>/7	0.946	1.35
bm<float, Op::Min_val>/8021	8.928	4.167
bm<float, Op::Min_val>/15	1.971	1.338
bm<float, Op::Max_val>/8021	9.126	4.148
bm<float, Op::Max_val>/15	2.017	1.307
bm<float, Op::Both_val>/8021	4.767	3.855
bm<float, Op::Both_val>/15	0.954	0.933
bm<double, Op::Min_val>/8021	4.563	2.182
bm<double, Op::Min_val>/7	0.854	0.76
bm<double, Op::Max_val>/8021	4.362	2.121
bm<double, Op::Max_val>/7	0.937	0.767
bm<double, Op::Both_val>/8021	2.118	1.964
bm<double, Op::Both_val>/7	0.899	0.47

AlexGuteniev · 2025-12-15T12:38:35Z

stl/inc/xutility

+_INLINE_VAR constexpr bool _Is_64bit_int_on_arm64_v = false;
+#endif // ^^^ defined(_M_ARM64) ^^^
+
+template <class _Iter, class _Pr, bool _Support64BitIntOnArm = false>


I'd prefer that we expose this quirk less. Instead of adding template parameter, you could do one of this:

Just repeat the _Is_min_max_iterators_safe<_Iter> && _Is_predicate_less<_Iter, _Pr> formula in _Is_min_max_value_optimization_safe

Extract common base for _Is_min_max_optimization_safe and _Is_min_max_value_optimization_safe

I've gone with the first suggestion, which seemed like the simplest thing!

AlexGuteniev · 2025-12-15T12:42:34Z

stl/src/vector_algorithms.cpp

+                return vreinterpretq_s64_u64(vcgtq_u64(vreinterpretq_u64_s64(_First), vreinterpretq_u64_s64(_Second)));
+            }
+
+            static _Vec_t _Min(const _Vec_t _First, const _Vec_t _Second, _Vec_t _Mask) noexcept {


const _Mask .

I observe it is interesting that both ISA have same limitation of having usual vector max for 32-bit elements, but having to blend for 64bit elements.

Yep, It's an interesting quirk! Done.

AlexGuteniev · 2025-12-15T12:42:48Z

stl/src/vector_algorithms.cpp

+                return _Min(_First, _Second, _Cmp_gt_u(_First, _Second));
+            }
+
+            static _Vec_t _Max(const _Vec_t _First, const _Vec_t _Second, _Vec_t _Mask) noexcept {


Ditto const _Mask

AlexGuteniev · 2025-12-15T12:43:49Z

stl/src/vector_algorithms.cpp

                _Advance_bytes(_First, sizeof(_Ty));
            }

+#pragma loop(no_vector)


Is this a bug fix, or just a performance thing?

In either cases may worth commenting

I've added a comment to indicate the reason for doing it. Essentially, we don't want to auto-vectorize the scalar tail after the manual vectorization because we'll just have a bunch of dead autovec code, and unecessary extra conditional checks for how many elements are left. (MSVC does auto-vectorize this tail, at least on ARM64.)

StephanTLavavej · 2026-01-08T17:48:53Z

#5949 has been merged, so this should be ready to be revised.

As a reminder, draft mode disables PR checks when commits are pushed. Merely moving a PR out of draft mode won't trigger checks, so you should mark as "ready for review" and then push commits to trigger PR checks.

Add and enable a Neon implementation of minmax on ARM64 platforms. The existing code is refactored to support an unrolled version for sufficiently large inputs.

…_safe

github-project-automation bot added this to STL Code Reviews Dec 15, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Dec 15, 2025

hazzlim force-pushed the minmax-val-pr branch from 3119a3a to c92dcef Compare December 15, 2025 12:24

AlexGuteniev reviewed Dec 15, 2025

View reviewed changes

StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Dec 18, 2025

StephanTLavavej moved this from Initial Review to Work In Progress in STL Code Reviews Jan 8, 2026

hazzlim added 8 commits January 9, 2026 01:31

Add Neon implementation of minmax

0fdf9e2

Add and enable a Neon implementation of minmax on ARM64 platforms. The existing code is refactored to support an unrolled version for sufficiently large inputs.

Use strict types and conversions for all Neon in _Sorting namespace

38165c3

Add _Sign_correction

1571345

inspect _Traits::_Has_unsigned_cmp

9f0039d

Add const qualifier to _Mask parameters

578dbaa

Add comment regarding preventing autovec of scalar tail

ed516ff

Chain together ARM64 and ARM64EC preprocessor conditionals.

9bcd65f

Refactor _Is_min_max_optimization_safe/_Is_min_max_value_optimization…

87b0fa8

…_safe

hazzlim marked this pull request as ready for review January 9, 2026 01:42

hazzlim requested a review from a team as a code owner January 9, 2026 01:42

hazzlim force-pushed the minmax-val-pr branch from c92dcef to 87b0fa8 Compare January 9, 2026 01:44

hazzlim added 3 commits January 9, 2026 02:03

Fix x64 build

b68b39c

Wrap max/min of unrolled accumulators in if constexpr blocks

e73b2fc

Make vector tail loop condition more explicit (and fix shadowing)

69f80eb

StephanTLavavej self-assigned this Jan 9, 2026

StephanTLavavej moved this from Work In Progress to Initial Review in STL Code Reviews Jan 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Neon implementation of minmax #5963

Add Neon implementation of minmax #5963

hazzlim commented Dec 15, 2025

Uh oh!

AlexGuteniev Dec 15, 2025

Uh oh!

hazzlim Jan 9, 2026

Uh oh!

AlexGuteniev Dec 15, 2025

Uh oh!

hazzlim Jan 9, 2026

Uh oh!

AlexGuteniev Dec 15, 2025

Uh oh!

hazzlim Jan 9, 2026

Uh oh!

AlexGuteniev Dec 15, 2025

Uh oh!

hazzlim Jan 9, 2026

Uh oh!

StephanTLavavej commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Neon implementation of minmax #5963

Are you sure you want to change the base?

Add Neon implementation of minmax #5963

Conversation

hazzlim commented Dec 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StephanTLavavej commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants