Implement a special-case lookup for ascii grapheme categories.#79
Implement a special-case lookup for ascii grapheme categories.#79Manishearth merged 2 commits intounicode-rs:masterfrom cessen:ascii_grapheme_optimization
Conversation
This speeds up processing even for many non-ascii texts, since they often still use ascii-range punctuation and whitespace.
|
I took a crack at special-casing the ASCII range as discussed in #77, and it definitely improves performance across the board. I tried two different implementations: one based on a 128-element lookup table, and another based on a few if statements. This PR is for the if statement approach. Here are the bench results, including results for both approaches: Turns out that the branching approach beats the table approach, at least on my machine. The table implementation wasn't completely naive either: I made sure to avoid duplicate bounds checks (although they probably would have been optimized out anyway) by using the slice The if-branching approach is also just way easier to maintain and check for correctness, because you don't have to wade through an array of 128 grapheme categories. |
|
(Also, sorry for the commit message typo. I can amend the commit if it's important to you.) |
Manishearth
left a comment
There was a problem hiding this comment.
Oh, right, for graphemes there's not much of a need for tables, good call.
The tables vs if branching might be a bigger deal if we want to do this for word/line/sentence breaking.
|
Thanks! |
|
Awesome! Thanks for the clean up and merge! |
This speeds up processing even for many non-ascii texts, since
they often still use ascii-range punctuation and whitespace.