Skip to content

Conversation

@lyne7-sc
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

The current hex function implementation uses format! macro and StringArray::from iterator pattern, which causes:

  1. Per-element String allocations: Each value allocates a new String via format!
  2. Unnecessary conversions: Multiple intermediate conversions between types
  3. Inefficient dictionary type handling: Collects all values into vectors before building the result

What changes are included in this PR?

This PR optimizes the hex encoding by:

  • Replacing format!("{num:X}") with a fast lookup table approach
  • Building results directly using StringBuilder
  • Reusing a single vec buffer per iteration to avoid re-allocation
  • Optimizing dictionary array handling by building results iteratively

Benchmark Results

Group Size Before After Speedup
hex_binary 1024 89.9 µs 51.9 µs 1.73x
hex_binary 4096 385.2 µs 218.7 µs 1.76x
hex_binary 8192 741.6 µs 451.6 µs 1.64x
hex_int64 1024 32.0 µs 12.4 µs 2.57x
hex_int64 4096 132.4 µs 59.7 µs 2.22x
hex_int64 8192 258.5 µs 120.6 µs 2.14x
hex_int64_dict 1024 75.2 µs 12.4 µs 6.04x
hex_int64_dict 4096 313.2 µs 60.5 µs 5.18x
hex_int64_dict 8192 614.7 µs 129.0 µs 4.76x
hex_utf8 1024 88.5 µs 53.5 µs 1.66x
hex_utf8 4096 357.6 µs 211.1 µs 1.69x
hex_utf8 8192 698.7 µs 424.8 µs 1.64x

Are these changes tested?

Yes. Existing units and sqllogictest tests pass. New benchmarks added.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the spark label Jan 11, 2026
let hex_string = hex_encode(bytes, lowercase);
Ok(hex_string)
/// Generic hex encoding for int64 type
fn hex_encode_int64<I>(iter: I, len: usize) -> Result<ColumnarValue, DataFusionError>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this doesnt need to be generic since its only used for int64 arrays

for v in iter {
if let Some(num) = v {
buffer.clear();
hex_int64(num, &mut buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a little interesting how we have this buffer across iterations here, but then inside hex_int64 we have another buffer, which is used to copy into this outer buffer. Can we unify them somehow?

buffer.clear();
hex_int64(num, &mut buffer);
unsafe {
builder.append_value(from_utf8_unchecked(&buffer));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be good to add a safety comment, even if its just something like

SAFETY: all chars are from HEX_CHARS_UPPER which are valid ascii

Comment on lines +128 to +133
while n != 0 {
i -= 1;
let digest = (n & 0xF) as u8;
temp[i] = HEX_CHARS_UPPER[digest as usize];
n >>= 4;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make a performance difference if we unroll this loop manually, given we know the upper limit of iterations needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants