fold: process streams as bytes, not strings, to handle non-utf8 data#8241
fold: process streams as bytes, not strings, to handle non-utf8 data#8241RenjiSann merged 2 commits intouutils:mainfrom
Conversation
|
GNU testsuite comparison: |
|
GNU testsuite comparison: |
RenjiSann
left a comment
There was a problem hiding this comment.
Thank you for your contribution !
A few questions:
- What locale did you use to perform your tests ?
- Can you check there is no discrepancy with GNU's
foldwithLC_ALLbeingen_US,en_US.UTF-8,CandC.UTF-8?
A few remarks on unwraps, but otherwise it looks good to me !
|
As an extra remark, please stash your "clippy fixes" commit in the first one. Thanks ! |
ffd3838 to
faa6a9b
Compare
Here's the locale I was using for all tests:
I've tried all three of my new tests with those locales and they all give the exact same output as each other (and as the coreutils I also consolidated all the code changes into one commit and re-pushed, so the commits history should be simpler now. |
|
GNU testsuite comparison: |
|
Thank you for your contribution ! |
This fixes #8227 by making
foldprocess its input as bytes, rather than strings. Because it was reading input as a string, anything that wasn't valid UTF-8 (including valid Latin 1-encoded data, as the bug report has) would cause an error. GNU'sfoldappears to operate on bytes, so this improves its compatibility as well.I didn't need to change any tests and I added three that work on non-UTF8 files.
There is a change in behavior that isn't covered by the tests: Unicode isn't "properly" handled anymore. Take this example, where "test.input" contains these emoji:
Before my change:
And after:
That looks like a regression, but it matches GNU
foldbehavior: