Skip to content

Conversation

@hazzlim
Copy link
Contributor

@hazzlim hazzlim commented Jan 9, 2026

This PR vectorize find using Neon 🚀

For char/uint8_t we just dispatch to memchr (as is done currently), as this is already well optimized using Neon.

Benchmark results (figures are relative speedup compared to existing code):

Benchmark MSVC SU Clang SU
bm<char, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/8021/3056 9.375 9.199
bm<char, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/63/62 4.39 4.3
bm<char, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/31/30 2.511 2.581
bm<char, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/15/14 1.491 1.398
bm<char, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/7/6 0.752 0.773
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/8021/3056 2.338 2.237
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/63/62 1.405 1.366
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/31/30 1.187 1.145
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/15/14 1.095 0.931
bm<uint16_t, not_highly_aligned_allocator, Op::FindSized>/7/6 1.023 0.769
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/8021/3056 2.332 2.143
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/63/62 1.368 1.319
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/31/30 1.199 1.106
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/15/14 1.147 1.022
bm<wchar_t, not_highly_aligned_allocator, Op::StringFind>/7/6 1.17 0.926
bm<wchar_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/8021/3056 5.455 5.331
bm<wchar_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/63/62 3.738 3.814
bm<wchar_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/31/30 2.554 2.653
bm<wchar_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/15/14 1.564 1.775
bm<wchar_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/7/6 0.999 1.023
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/8021/3056 4 4.103
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/63/62 3.178 3.409
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/31/30 2.4 2.783
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/15/14 1.886 1.894
bm<uint32_t, not_highly_aligned_allocator, Op::FindSized>/7/6 1.362 1.179
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/8021/3056 4.027 4
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/63/62 2.985 4
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/31/30 2.281 3.03
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/15/14 1.721 2.016
bm<char32_t, not_highly_aligned_allocator, Op::StringFind>/7/6 1.105 1.235
bm<char32_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/8021/3056 2.727 2.686
bm<char32_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/63/62 2.613 3.273
bm<char32_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/31/30 2.236 2.705
bm<char32_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/15/14 1.655 1.947
bm<char32_t, not_highly_aligned_allocator, Op::StringFindNotFirstOne>/7/6 1.099 1.207
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/8021/3056 2.023 2.024
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/63/62 1.948 2.119
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/31/30 1.837 1.941
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/15/14 1.581 1.51
bm<uint64_t, not_highly_aligned_allocator, Op::FindSized>/7/6 1.235 1.179

@hazzlim hazzlim requested a review from a team as a code owner January 9, 2026 15:00
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Jan 9, 2026
@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Jan 9, 2026
@StephanTLavavej StephanTLavavej self-assigned this Jan 9, 2026
return vgetq_lane_u64(vpaddq_u64(_Cmp, _Cmp), 0);
}

static uint8x16_t _Not_q(const uint8x16_t _Val) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noexcept

  • other occurrences

#ifdef _M_ARM64EC
#ifdef _M_ARM64
struct _Find_traits_1 {
static uint8x16_t _Load_q(const void* _Ptr) noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

top level const (and in other occurrences

}

static uint64x2_t _Not_q(const uint64x2_t _Val) {
return vreinterpretq_u64_u8(vmvnq_u8(vreinterpretq_u8_u64(_Val)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why is it preferrable to negate when in vector, rather than bits?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64 Related to the ARM64 architecture performance Must go faster

Projects

Status: Initial Review

Development

Successfully merging this pull request may close these issues.

3 participants