Implement a special-case lookup for ascii grapheme categories. by cessen · Pull Request #79 · unicode-rs/unicode-segmentation

cessen · 2020-02-14T12:24:25Z

This speeds up processing even for many non-ascii texts, since
they often still use ascii-range punctuation and whitespace.

This speeds up processing even for many non-ascii texts, since they often still use ascii-range punctuation and whitespace.

cessen · 2020-02-14T12:35:21Z

I took a crack at special-casing the ASCII range as discussed in #77, and it definitely improves performance across the board. I tried two different implementations: one based on a 128-element lookup table, and another based on a few if statements. This PR is for the if statement approach.

Here are the bench results, including results for both approaches:

No ascii special-casing:

test graphemes_arabic      ... bench:     893,766 ns/iter (+/- 68,494) = 56 MB/s
test graphemes_english     ... bench:     989,329 ns/iter (+/- 42,635) = 50 MB/s
test graphemes_hindi       ... bench:     933,926 ns/iter (+/- 30,817) = 53 MB/s
test graphemes_japanese    ... bench:     727,820 ns/iter (+/- 37,069) = 69 MB/s
test graphemes_korean      ... bench:   1,083,001 ns/iter (+/- 33,362) = 46 MB/s
test graphemes_mandarin    ... bench:     520,048 ns/iter (+/- 10,724) = 97 MB/s
test graphemes_russian     ... bench:     842,000 ns/iter (+/- 32,343) = 60 MB/s
test graphemes_source_code ... bench:   1,060,247 ns/iter (+/- 42,456) = 47 MB/s

Table:

test graphemes_arabic      ... bench:     692,838 ns/iter (+/- 34,703) = 72 MB/s
test graphemes_english     ... bench:   1,018,363 ns/iter (+/- 48,135) = 48 MB/s
test graphemes_hindi       ... bench:     826,851 ns/iter (+/- 39,035) = 59 MB/s
test graphemes_japanese    ... bench:     739,136 ns/iter (+/- 68,602) = 68 MB/s
test graphemes_korean      ... bench:   1,000,320 ns/iter (+/- 85,934) = 50 MB/s
test graphemes_mandarin    ... bench:     522,754 ns/iter (+/- 24,797) = 96 MB/s
test graphemes_russian     ... bench:     717,681 ns/iter (+/- 25,272) = 71 MB/s
test graphemes_source_code ... bench:   1,008,405 ns/iter (+/- 31,694) = 49 MB/s

If branching:

test graphemes_arabic      ... bench:     663,521 ns/iter (+/- 36,100) = 75 MB/s
test graphemes_english     ... bench:     927,851 ns/iter (+/- 35,652) = 53 MB/s
test graphemes_hindi       ... bench:     801,972 ns/iter (+/- 35,350) = 61 MB/s
test graphemes_japanese    ... bench:     716,374 ns/iter (+/- 40,118) = 70 MB/s
test graphemes_korean      ... bench:     964,766 ns/iter (+/- 47,644) = 51 MB/s
test graphemes_mandarin    ... bench:     503,575 ns/iter (+/- 15,838) = 100 MB/s
test graphemes_russian     ... bench:     665,667 ns/iter (+/- 27,923) = 76 MB/s
test graphemes_source_code ... bench:     959,008 ns/iter (+/- 66,405) = 52 MB/s

Turns out that the branching approach beats the table approach, at least on my machine. The table implementation wasn't completely naive either: I made sure to avoid duplicate bounds checks (although they probably would have been optimized out anyway) by using the slice get() method to roll the bounds check and ascii-range check into one, so the whole thing was just a single branch and an array lookup.

The if-branching approach is also just way easier to maintain and check for correctness, because you don't have to wade through an array of 128 grapheme categories.

cessen · 2020-02-14T12:39:44Z

(Also, sorry for the commit message typo. I can amend the commit if it's important to you.)

Manishearth

Oh, right, for graphemes there's not much of a need for tables, good call.

The tables vs if branching might be a bigger deal if we want to do this for word/line/sentence breaking.

src/grapheme.rs

Manishearth · 2020-02-14T20:46:23Z

Thanks!

cessen · 2020-02-15T00:53:03Z

Awesome! Thanks for the clean up and merge!

Implement a special-case lookup for ascii grapeheme categories.

c9aa6fa

This speeds up processing even for many non-ascii texts, since they often still use ascii-range punctuation and whitespace.

cessen changed the title ~~Implement a special-case lookup for ascii grapeheme categories.~~ Implement a special-case lookup for ascii grapheme categories. Feb 14, 2020

Manishearth approved these changes Feb 14, 2020

View reviewed changes

src/grapheme.rs Outdated Show resolved Hide resolved

Manishearth reviewed Feb 14, 2020

View reviewed changes

src/grapheme.rs Outdated Show resolved Hide resolved

Apply suggestions from code review

945dbb6

Manishearth merged commit fbba2a6 into unicode-rs:master Feb 14, 2020

Manishearth mentioned this pull request Feb 14, 2020

Try out caching and ASCII special casing for other segmenters #80

Open

cessen deleted the ascii_grapheme_optimization branch February 15, 2020 00:52

zachwolfe mentioned this pull request Jan 14, 2021

Revisit my special case for ASCII in the lexer dusklang/dusk#91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a special-case lookup for ascii grapheme categories.#79

Implement a special-case lookup for ascii grapheme categories.#79
Manishearth merged 2 commits intounicode-rs:masterfrom
cessen:ascii_grapheme_optimization

cessen commented Feb 14, 2020

Uh oh!

cessen commented Feb 14, 2020

Uh oh!

cessen commented Feb 14, 2020

Uh oh!

Manishearth left a comment

Uh oh!

Uh oh!

Uh oh!

Manishearth commented Feb 14, 2020

Uh oh!

cessen commented Feb 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cessen commented Feb 14, 2020

Uh oh!

cessen commented Feb 14, 2020

Uh oh!

cessen commented Feb 14, 2020

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Manishearth commented Feb 14, 2020

Uh oh!

cessen commented Feb 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants