Unicode sentence boundaries#24
Conversation
Passes all tests in the examples provided here: http://www.unicode.org/Public/9.0.0/ucd/auxiliary/SentenceBreakTest.txt
|
Sorry for letting this stagnate! I'm rather busy to review this right now, but will try to get to it soon! Sentence boundaries is something I've wanted implemented here for a while 😄 |
|
(still no time to look at this, apologies. Really hope to get to it soon) |
|
This PR looks quite good and it would be great to have this functionality. Could I help in any way ? |
|
@rth I can split this out into another crate if required? |
|
@tomcumming I would really like to use this implementation (and compare it with other sentences splitting approaches) in rth/vtext#51 . Having this implementation in the Any chance @Manishearth that you would have some review bandwidth for this, or could suggest someone who could review it? |
|
@rth mind doing a review yourself as well? I can also try and review, but I don't think I'd be able to give this a proper thorough review and would feel more comfortable if more people have gone through it. |
|
Sure, I'll try to review it in the next few days. |
Manishearth
left a comment
There was a problem hiding this comment.
Code looks correct! Mostly want more documentation.
|
@Manishearth @rth I have updated the PR including requested changes |
|
Looks good! @rth want to do a second review? |
rth
left a comment
There was a problem hiding this comment.
Thanks a lot for the review @Manishearth !
I went through the code in more detail, I find it quite readable and I don't really have anything to add. (Though I am fairly new to rust and don't know that much about Unicode segmentation specs).
I can confirm that src/tables.rs and src/testdata.rs in this PR can be re-generated in their current state with the included python scripts, but they require setting,
scripts/unicode.py
- os.system("curl -O http://www.unicode.org/Public/UNIDATA/%s"
+ os.system("curl -O http://www.unicode.org/Public/9.0.0/ucd/%s"as otherwise data for latest Unicode 12.0 is downloaded.
|
Fixing the URL for test data should probably be another PR |
Yes, I'll do it. Thanks @tomcumming I don't have any other comments. |
|
Thank you! I'll push a release soonish |
|
Published 1.3.0. Thanks for the work on this, and sorry for the delay in reviewing! |
This is an implementation of the sentence breaks specification, including changes to the python files to grab the sentence break test data.
I welcome any advice for improving.