Add additional benchmarks for utf8view comparison kernels by zhuqi-lucas · Pull Request #7351 · apache/arrow-rs

zhuqi-lucas · 2025-03-28T10:25:21Z

Which issue does this PR close?

part of StringArrayView(Utf8View) slower cases compare to StringArray(Utf8) #7350

Rationale for this change

Add reproducer cases which the Utf8View will slower than Utf8

What changes are included in this PR?

Add reproducer cases which the Utf8View will slower than Utf8

Are there any user-facing changes?

No

zhuqi-lucas · 2025-03-28T10:27:26Z

Benchmark	Utf8 Time (µs)	Utf8View Time (µs)	Slower (%)
eq custom	97.426 - 97.940	210.93 - 214.13	116%
neq custom	100.27 - 102.04	211.39 - 212.46	111%
gt custom	84.338 - 84.634	196.90 - 197.52	133%
gt custom scalar	99.384 - 99.817	181.69 - 182.25	83%
gt custom scalar long string	100.32 - 100.65	212.39 - 212.92	112%
like custom scalar	89.111 - 90.811	177.80 - 180.97	100%
nlike custom scalar	87.976 - 89.682	98.116 - 101.02	12%
ilike custom scalar	199.82 - 200.68	312.25 - 314.25	56%
nilike custom scalar	198.95 - 199.67	232.68 - 234.71	17%

alamb

Thank you @zhuqi-lucas --- this looks great. I can't wait to see the performance improvements.

I left some style comments. My biggest suggestion for this PR is to follow the pattern of the rest of the benchmarks in this function which are organized by operation rather than by data type I think

So for example, put the new like tests with longer strings next to the existing like tests in

arrow-rs/arrow/benches/comparison_kernels.rs

Lines 228 to 232 in a2cc426

    
           // StringArray: LIKE benchmarks 
        
           c.bench_function("like_utf8 scalar equals", |b| { 
        
               b.iter(|| bench_like_utf8_scalar(&arr_string, "xxxx")) 
        
           });

alamb · 2025-03-28T15:54:15Z

arrow/benches/comparison_kernels.rs

-criterion_group!(benches, add_benchmark);
+// Generate a string array that meets the requirements:
+// All strings start with "test", followed by a tail, ensuring the total length is greater than 12 bytes.
+fn make_custom_string_array(size: usize, tail_len: usize, rng: &mut StdRng) -> Vec<Option<String>> {


Maybe we could call this long_strings or something as it doesn't really make a string array

Added it to the original benchmark, thanks @alamb.

alamb · 2025-03-28T15:56:12Z

arrow/benches/comparison_kernels.rs

+        .collect()
+}
+
+fn add_custom_string_benchmarks(c: &mut Criterion) {


I found "custom" somewhat vague. Perhaps we could use the term long instead

let long_strings = make_long_string_array(SIZE, 12, &mut rng); let long_string_array = StringArray::from(long_strings); let long_string_view = StringViewArray::from_iter(long_string_array.iter());

alamb · 2025-03-28T15:57:32Z

arrow/benches/comparison_kernels.rs

+    let custom_string_view = StringViewArray::from_iter(custom_string_array.iter());
+
+    // Benchmark the eq operation for utf8 (StringArray)
+    c.bench_function("eq custom utf8", |b| {


Can we use a name that includes the types of arguments as well as the operator?

Maybe something like

Suggested change

c.bench_function("eq custom utf8", |b| {

c.bench_function("eq utf8 utf8view long", |b| {

And then similarly below

alamb · 2025-03-28T16:03:24Z

arrow/benches/comparison_kernels.rs

+    (0..size)
+        .map(|_| {
+            // Generate the tail: use visible ASCII characters (32 to 126) to ensure a valid UTF-8 string.
+            let tail: String = (0..tail_len)


It probably doesn't matter in this case, but you can avoid at least one allocation like this:

// Generate the tail: use visible ASCII characters (32 to 126) to ensure a valid UTF-8 string. let s: String = "test".chars() .chain((0..tail_len).map(|_| rng.random_range(32u8..127u8) as char)) .collect(); Some(s)

zhuqi-lucas · 2025-03-29T03:21:13Z

Thank you @zhuqi-lucas --- this looks great. I can't wait to see the performance improvements.

I left some style comments. My biggest suggestion for this PR is to follow the pattern of the rest of the benchmarks in this function which are organized by operation rather than by data type I think

So for example, put the new like tests with longer strings next to the existing like tests in

arrow-rs/arrow/benches/comparison_kernels.rs

Lines 228 to 232 in a2cc426

// StringArray: LIKE benchmarks

c.bench_function("like_utf8 scalar equals", |b| {

b.iter(|| bench_like_utf8_scalar(&arr_string, "xxxx"))

});

Thank you @alamb for review, it makes sense to keep the original format for new benchmark, i will address the comments.

zhuqi-lucas · 2025-03-29T05:20:46Z

Updated the result for the changed code, we need to optimize the compare operation first which is the most regression for same prefix long strings:

📌 UTF-8 vs UTF-8 View Benchmark Comparison

Benchmark	StringArray (UTF-8)	StringViewArray (UTF-8 View)	Performance Change
eq long same prefix strings	294.95 µs	514.73 µs	StringArray is ~1.7x faster
neq long same prefix strings	296.12 µs	517.70 µs	StringArray is ~1.75x faster
lt long same prefix strings	331.46 µs	493.17 µs	StringArray is ~1.5x faster
long same prefix strings like_utf8 scalar equals	181.78 µs	196.91 µs	StringArray is ~8.3% faster
long same prefix strings like_utf8 scalar contains	1.7854 ms	1.8405 ms	StringArray is ~3% faster
long same prefix strings like_utf8 scalar ends with	583.86 µs	594.73 µs	StringArray is ~1.9% faster
long same prefix strings like_utf8 scalar starts with	664.69 µs	672.16 µs	StringArray is ~1.1% faster
long same prefix strings like_utf8 scalar complex	590.86 µs	604.81 µs	StringArray is ~2.3% faster
long same prefix strings like_utf8view scalar equals	181.78 µs	196.91 µs	StringArray is ~8.3% faster
long same prefix strings like_utf8view scalar contains	1.7854 ms	1.8405 ms	StringArray is ~3% faster
long same prefix strings like_utf8view scalar ends with	583.86 µs	594.73 µs	StringArray is ~1.9% faster
long same prefix strings like_utf8view scalar starts with	664.69 µs	672.16 µs	StringArray is ~1.1% faster
long same prefix strings like_utf8view scalar complex	590.86 µs	604.81 µs	StringArray is ~2.3% faster

We can currently only adding those testing, it's enough for us to improve the code and testing again. Because the compare_unchecked function is which we want to improve, it mostly used by all compare function for stringview, other function such as like or regex related, we can improve and testing after it.

alamb

Thank you @zhuqi-lucas -- looks great to me

alamb · 2025-03-30T12:17:29Z

We can currently only adding those testing, it's enough for us to improve the code and testing again. Because the compare_unchecked function is which we want to improve, it mostly used by all compare function for stringview, other function such as like or regex related, we can improve and testing after it.

I agree

Thanks again for pushing this work forward

zhuqi-lucas added 2 commits March 28, 2025 18:22

Add reproducer cases which the Utf8View will slower than Utf8

c559942

fix fmt

0ce8dd0

github-actions bot added the arrow Changes to the arrow crate label Mar 28, 2025

zhuqi-lucas mentioned this pull request Mar 28, 2025

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… apache/datafusion#15447

Merged

zhuqi-lucas changed the title ~~Slower utf8view reproducer~~ Slower utf8view cases reproducer Mar 28, 2025

alamb reviewed Mar 28, 2025

View reviewed changes

alamb changed the title ~~Slower utf8view cases reproducer~~ Add additional benchmarks for utf8view comparison kernels Mar 28, 2025

zhuqi-lucas added 3 commits March 29, 2025 11:42

Add the original format for new benchmark

2b7d5e4

fmt

d679d4a

Fix clippy

591a9b7

alamb approved these changes Mar 30, 2025

View reviewed changes

alamb merged commit 99b2b3b into apache:main Mar 30, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional benchmarks for utf8view comparison kernels#7351

Add additional benchmarks for utf8view comparison kernels#7351
alamb merged 5 commits intoapache:mainfrom
zhuqi-lucas:slower_utf8view_reproducer

zhuqi-lucas commented Mar 28, 2025 •

edited by alamb

Loading

Uh oh!

zhuqi-lucas commented Mar 28, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Mar 28, 2025

Uh oh!

zhuqi-lucas Mar 29, 2025

Uh oh!

alamb Mar 28, 2025

Uh oh!

alamb Mar 28, 2025

Uh oh!

alamb Mar 28, 2025

Uh oh!

zhuqi-lucas commented Mar 29, 2025

Uh oh!

zhuqi-lucas commented Mar 29, 2025 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb commented Mar 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// StringArray: LIKE benchmarks

	c.bench_function("like_utf8 scalar equals", \|b\| {
	b.iter(\|\| bench_like_utf8_scalar(&arr_string, "xxxx"))
	});

	c.bench_function("eq custom utf8", \|b\| {
	c.bench_function("eq utf8 utf8view long", \|b\| {

Conversation

zhuqi-lucas commented Mar 28, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

zhuqi-lucas commented Mar 28, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Mar 29, 2025

Uh oh!

zhuqi-lucas commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 UTF-8 vs UTF-8 View Benchmark Comparison

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhuqi-lucas commented Mar 28, 2025 •

edited by alamb

Loading

zhuqi-lucas commented Mar 29, 2025 •

edited

Loading