Add additional benchmarks for utf8view comparison kernels#7351
Add additional benchmarks for utf8view comparison kernels#7351alamb merged 5 commits intoapache:mainfrom
Conversation
|
alamb
left a comment
There was a problem hiding this comment.
Thank you @zhuqi-lucas --- this looks great. I can't wait to see the performance improvements.
I left some style comments. My biggest suggestion for this PR is to follow the pattern of the rest of the benchmarks in this function which are organized by operation rather than by data type I think
So for example, put the new like tests with longer strings next to the existing like tests in
arrow-rs/arrow/benches/comparison_kernels.rs
Lines 228 to 232 in a2cc426
arrow/benches/comparison_kernels.rs
Outdated
| criterion_group!(benches, add_benchmark); | ||
| // Generate a string array that meets the requirements: | ||
| // All strings start with "test", followed by a tail, ensuring the total length is greater than 12 bytes. | ||
| fn make_custom_string_array(size: usize, tail_len: usize, rng: &mut StdRng) -> Vec<Option<String>> { |
There was a problem hiding this comment.
Maybe we could call this long_strings or something as it doesn't really make a string array
There was a problem hiding this comment.
Added it to the original benchmark, thanks @alamb.
arrow/benches/comparison_kernels.rs
Outdated
| .collect() | ||
| } | ||
|
|
||
| fn add_custom_string_benchmarks(c: &mut Criterion) { |
There was a problem hiding this comment.
I found "custom" somewhat vague. Perhaps we could use the term long instead
let long_strings = make_long_string_array(SIZE, 12, &mut rng);
let long_string_array = StringArray::from(long_strings);
let long_string_view = StringViewArray::from_iter(long_string_array.iter());
arrow/benches/comparison_kernels.rs
Outdated
| let custom_string_view = StringViewArray::from_iter(custom_string_array.iter()); | ||
|
|
||
| // Benchmark the eq operation for utf8 (StringArray) | ||
| c.bench_function("eq custom utf8", |b| { |
There was a problem hiding this comment.
Can we use a name that includes the types of arguments as well as the operator?
Maybe something like
| c.bench_function("eq custom utf8", |b| { | |
| c.bench_function("eq utf8 utf8view long", |b| { |
And then similarly below
arrow/benches/comparison_kernels.rs
Outdated
| (0..size) | ||
| .map(|_| { | ||
| // Generate the tail: use visible ASCII characters (32 to 126) to ensure a valid UTF-8 string. | ||
| let tail: String = (0..tail_len) |
There was a problem hiding this comment.
It probably doesn't matter in this case, but you can avoid at least one allocation like this:
// Generate the tail: use visible ASCII characters (32 to 126) to ensure a valid UTF-8 string.
let s: String = "test".chars()
.chain((0..tail_len).map(|_| rng.random_range(32u8..127u8) as char))
.collect();
Some(s)
Thank you @alamb for review, it makes sense to keep the original format for new benchmark, i will address the comments. |
|
Updated the result for the changed code, we need to optimize the compare operation first which is the most regression for same prefix long strings: 📌 UTF-8 vs UTF-8 View Benchmark Comparison
We can currently only adding those testing, it's enough for us to improve the code and testing again. Because the compare_unchecked function is which we want to improve, it mostly used by all compare function for stringview, other function such as like or regex related, we can improve and testing after it. |
alamb
left a comment
There was a problem hiding this comment.
Thank you @zhuqi-lucas -- looks great to me
I agree Thanks again for pushing this work forward |
Which issue does this PR close?
Rationale for this change
Add reproducer cases which the Utf8View will slower than Utf8
What changes are included in this PR?
Add reproducer cases which the Utf8View will slower than Utf8
Are there any user-facing changes?
No