Is there 'strings' command for utf-8? [closed]

Question

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.

Closed 12 years ago.

Improve this question

The linux command strings looks for ASCII strings in a binary file. Are there any command line tools to show UTF-8 strings in a binary file?

related: sourceware.org/bugzilla/show_bug.cgi?id=27551 | sourceware.org/bugzilla/show_bug.cgi?id=9984 Here's one implementation: github.com/getreu/stringsext "stringsext is a Unicode enhancement of the GNU strings tool with additional functionalities: stringsext recognizes Cyrillic, Arabic, CJKV characters and other scripts in all supported multi-byte-encodings" — Ciro Santilli OurBigBook.com
– Ciro Santilli OurBigBook.com, Commented Mar 14, 2024 at 20:12

hek2mgl · Accepted Answer · 2020-03-03 22:20:18Z

8

The strings command supports the --encoding option. Check the man page.

But however, I failed to extract UTF-8 strings using any possible option value. Currently searching their mailing list. will update this if I find more help

edited Mar 3, 2020 at 22:20

answered Jul 4, 2013 at 12:45

hek2mgl

159k31 gold badges263 silver badges280 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Alastair McCormack Over a year ago

UTF-8 characters are variable byte width, which won't work with the fixed width pattern matching nature of strings.

12431234123412341234123 Over a year ago

On my Debian 9, strings -e S works with UTF-8. strings version: 2.28, LANG=de_CH.UTF-8.

hek2mgl Over a year ago

@12431234123412341234123 Thanks for the comment! I'll test it later and update the answer if I can reproduce it

DevSolar Over a year ago

@hek2mgl: Indeed. The 127 chars that require only 7 bits to encode. Coincidentially, the 127 chars that UTF-8 encodes with one byte, because the most significant bit set is UTF-8's way to tell if any given byte is part of a multibyte sequence.

DevSolar Over a year ago

Can't make much sense of that last comment. The characters which UTF-8 encodes in one byte are code points U+0000 through U+007F. These are identical with the same range (0x00 through 0x7f) of ASCII-7, which incidentially is all of ASCII-7. For anything beyond that, like "Ä", UTF-8 uses two bytes or more (whereas ISO/IEC 8859 / "extended ASCII" uses the range of 0x80..0xff to encode additional characters in a collection of one-byte encodings). "UTF-8 characters which are encoded in one byte, which is more than ASCII" do not exist.

|

Collectives™ on Stack Overflow

Is there 'strings' command for utf-8? [closed]

1 Answer 1

11 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Related