-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: improve performance of spark hex function
#19738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| let hex_string = hex_encode(bytes, lowercase); | ||
| Ok(hex_string) | ||
| /// Generic hex encoding for int64 type | ||
| fn hex_encode_int64<I>(iter: I, len: usize) -> Result<ColumnarValue, DataFusionError> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this doesnt need to be generic since its only used for int64 arrays
| for v in iter { | ||
| if let Some(num) = v { | ||
| buffer.clear(); | ||
| hex_int64(num, &mut buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a little interesting how we have this buffer across iterations here, but then inside hex_int64 we have another buffer, which is used to copy into this outer buffer. Can we unify them somehow?
| buffer.clear(); | ||
| hex_int64(num, &mut buffer); | ||
| unsafe { | ||
| builder.append_value(from_utf8_unchecked(&buffer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be good to add a safety comment, even if its just something like
SAFETY: all chars are from
HEX_CHARS_UPPERwhich are valid ascii
| while n != 0 { | ||
| i -= 1; | ||
| let digest = (n & 0xF) as u8; | ||
| temp[i] = HEX_CHARS_UPPER[digest as usize]; | ||
| n >>= 4; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make a performance difference if we unroll this loop manually, given we know the upper limit of iterations needed?
Which issue does this PR close?
Rationale for this change
The current
hexfunction implementation usesformat!macro andStringArray::fromiterator pattern, which causes:Stringviaformat!What changes are included in this PR?
This PR optimizes the
hexencoding by:format!("{num:X}")with a fast lookup table approachStringBuilderBenchmark Results
Are these changes tested?
Yes. Existing units and sqllogictest tests pass. New benchmarks added.
Are there any user-facing changes?
No.