fold: process streams as bytes, not strings, to handle non-utf8 data by phinjensen · Pull Request #8241 · uutils/coreutils

phinjensen · 2025-06-22T04:15:43Z

This fixes #8227 by making fold process its input as bytes, rather than strings. Because it was reading input as a string, anything that wasn't valid UTF-8 (including valid Latin 1-encoded data, as the bug report has) would cause an error. GNU's fold appears to operate on bytes, so this improves its compatibility as well.

I didn't need to change any tests and I added three that work on non-UTF8 files.

There is a change in behavior that isn't covered by the tests: Unicode isn't "properly" handled anymore. Take this example, where "test.input" contains these emoji:

🐕‍🦺🐕‍🦺🐕‍🦺

Before my change:

% fold -w1 test.input
🐕

🦺
🐕

🦺
🐕

🦺

And after:

% fold -w1 test.input

That looks like a regression, but it matches GNU fold behavior:

% diff <(fold -w1 test.input) <(gfold -w1 test.input)
%

github-actions · 2025-06-22T08:51:01Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

github-actions · 2025-06-22T13:47:36Z

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)

RenjiSann

Thank you for your contribution !

A few questions:

What locale did you use to perform your tests ?
Can you check there is no discrepancy with GNU's foldwith LC_ALL being en_US, en_US.UTF-8, C and C.UTF-8 ?

A few remarks on unwraps, but otherwise it looks good to me !

src/uu/fold/src/fold.rs

RenjiSann · 2025-06-22T23:41:07Z

As an extra remark, please stash your "clippy fixes" commit in the first one.
It is useful to split implementation and tests, and we shall keep both commits in the main branch, if we ever need to revert/check something, but clippy fixes are not useful to keep track of.

Thanks !

phinjensen · 2025-06-23T03:08:01Z

* What locale did you use to perform your tests ?

Here's the locale I was using for all tests:

% locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

* Can you check there is no discrepancy with GNU's `fold`with `LC_ALL` being `en_US`, `en_US.UTF-8`, `C` and `C.UTF-8` ?

I've tried all three of my new tests with those locales and they all give the exact same output as each other (and as the coreutils fold).

I also consolidated all the code changes into one commit and re-pushed, so the commits history should be simpler now.

github-actions · 2025-06-23T06:55:01Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

RenjiSann · 2025-06-23T09:03:35Z

Thank you for your contribution !

RenjiSann reviewed Jun 22, 2025

View reviewed changes

src/uu/fold/src/fold.rs Outdated Show resolved Hide resolved

src/uu/fold/src/fold.rs Outdated Show resolved Hide resolved

src/uu/fold/src/fold.rs Outdated Show resolved Hide resolved

src/uu/fold/src/fold.rs Outdated Show resolved Hide resolved

phinjensen added 2 commits June 22, 2025 20:57

fold: handle non-utf8 streams

cad7d0f

tests/fold: add tests for non-utf8 streams

faa6a9b

phinjensen force-pushed the fold-non-utf8 branch from ffd3838 to faa6a9b Compare June 23, 2025 03:05

phinjensen requested a review from RenjiSann June 23, 2025 03:09

RenjiSann merged commit b8228fb into uutils:main Jun 23, 2025
116 of 117 checks passed

BrewTestBot mentioned this pull request Sep 6, 2025

uutils-coreutils 0.2.0 Homebrew/homebrew-core#236403

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fold: process streams as bytes, not strings, to handle non-utf8 data#8241

fold: process streams as bytes, not strings, to handle non-utf8 data#8241
RenjiSann merged 2 commits intouutils:mainfrom
phinjensen:fold-non-utf8

phinjensen commented Jun 22, 2025

Uh oh!

github-actions bot commented Jun 22, 2025

Uh oh!

github-actions bot commented Jun 22, 2025

Uh oh!

RenjiSann left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RenjiSann commented Jun 22, 2025

Uh oh!

phinjensen commented Jun 23, 2025

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

Uh oh!

RenjiSann commented Jun 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

phinjensen commented Jun 22, 2025

Uh oh!

github-actions bot commented Jun 22, 2025

Uh oh!

github-actions bot commented Jun 22, 2025

Uh oh!

RenjiSann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RenjiSann commented Jun 22, 2025

Uh oh!

phinjensen commented Jun 23, 2025

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

Uh oh!

RenjiSann commented Jun 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants