Add Neon implementation of `find` for ARM64 targets #6003

hazzlim · 2026-01-09T15:00:17Z

This PR vectorize find using Neon 🚀

For char/uint8_t we just dispatch to memchr (as is done currently), as this is already well optimized using Neon.

Benchmark results (figures are relative speedup compared to existing code):

Benchmark	MSVC SU	Clang SU
bm<char, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/8021/3056	9.375	9.199
bm<char, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/63/62	4.39	4.3
bm<char, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/31/30	2.511	2.581
bm<char, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/15/14	1.491	1.398
bm<char, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/7/6	0.752	0.773
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/8021/3056	2.338	2.237
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/63/62	1.405	1.366
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/31/30	1.187	1.145
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/15/14	1.095	0.931
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/7/6	1.023	0.769
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/8021/3056	2.332	2.143
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/63/62	1.368	1.319
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/31/30	1.199	1.106
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/15/14	1.147	1.022
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/7/6	1.17	0.926
bm<wchar_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/8021/3056	5.455	5.331
bm<wchar_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/63/62	3.738	3.814
bm<wchar_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/31/30	2.554	2.653
bm<wchar_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/15/14	1.564	1.775
bm<wchar_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/7/6	0.999	1.023
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/8021/3056	4	4.103
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/63/62	3.178	3.409
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/31/30	2.4	2.783
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/15/14	1.886	1.894
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/7/6	1.362	1.179
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/8021/3056	4.027	4
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/63/62	2.985	4
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/31/30	2.281	3.03
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/15/14	1.721	2.016
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/7/6	1.105	1.235
bm<char32_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/8021/3056	2.727	2.686
bm<char32_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/63/62	2.613	3.273
bm<char32_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/31/30	2.236	2.705
bm<char32_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/15/14	1.655	1.947
bm<char32_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/7/6	1.099	1.207
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/8021/3056	2.023	2.024
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/63/62	1.948	2.119
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/31/30	1.837	1.941
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/15/14	1.581	1.51
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/7/6	1.235	1.179

AlexGuteniev · 2026-01-10T10:09:57Z

stl/src/vector_algorithms.cpp

+                return vgetq_lane_u64(vpaddq_u64(_Cmp, _Cmp), 0);
+            }
+
+            static uint8x16_t _Not_q(const uint8x16_t _Val) {


noexcept

other occurrences

AlexGuteniev · 2026-01-10T10:10:45Z

stl/src/vector_algorithms.cpp

-#ifdef _M_ARM64EC
+#ifdef _M_ARM64
+        struct _Find_traits_1 {
+            static uint8x16_t _Load_q(const void* _Ptr) noexcept {


top level const (and in other occurrences

AlexGuteniev · 2026-01-10T10:35:04Z

stl/src/vector_algorithms.cpp

+            }
+
+            static uint64x2_t _Not_q(const uint64x2_t _Val) {
+                return vreinterpretq_u64_u8(vmvnq_u8(vreinterpretq_u8_u64(_Val)));


Question: why is it preferrable to negate when in vector, rather than bits?

Add Neon implementation of find for ARM64 targets

8027e53

hazzlim requested a review from a team as a code owner January 9, 2026 15:00

github-project-automation bot added this to STL Code Reviews Jan 9, 2026

github-project-automation bot moved this to Initial Review in STL Code Reviews Jan 9, 2026

StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Jan 9, 2026

StephanTLavavej self-assigned this Jan 9, 2026

AlexGuteniev reviewed Jan 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Neon implementation of `find` for ARM64 targets #6003

Add Neon implementation of `find` for ARM64 targets #6003

hazzlim commented Jan 9, 2026 •

edited

Loading

Uh oh!

AlexGuteniev Jan 10, 2026

Uh oh!

AlexGuteniev Jan 10, 2026

Uh oh!

AlexGuteniev Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Neon implementation of find for ARM64 targets #6003

Are you sure you want to change the base?

Add Neon implementation of find for ARM64 targets #6003

Conversation

hazzlim commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexGuteniev Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Neon implementation of `find` for ARM64 targets #6003

Add Neon implementation of `find` for ARM64 targets #6003

hazzlim commented Jan 9, 2026 •

edited

Loading