The linux command strings looks for ASCII strings in a binary file.
Are there any command line tools to show UTF-8 strings in a binary file?
-
no, there is not.Weihang Jian– Weihang Jian2016-09-16 03:55:59 +00:00Commented Sep 16, 2016 at 3:55
-
related: sourceware.org/bugzilla/show_bug.cgi?id=27551 | sourceware.org/bugzilla/show_bug.cgi?id=9984 Here's one implementation: github.com/getreu/stringsext "stringsext is a Unicode enhancement of the GNU strings tool with additional functionalities: stringsext recognizes Cyrillic, Arabic, CJKV characters and other scripts in all supported multi-byte-encodings"Ciro Santilli OurBigBook.com– Ciro Santilli OurBigBook.com2024-03-14 20:12:39 +00:00Commented Mar 14, 2024 at 20:12
Add a comment
|
1 Answer
The strings command supports the --encoding option. Check the man page.
But however, I failed to extract UTF-8 strings using any possible option value. Currently searching their mailing list. will update this if I find more help
11 Comments
Alastair McCormack
UTF-8 characters are variable byte width, which won't work with the fixed width pattern matching nature of
strings.12431234123412341234123
On my Debian 9,
strings -e S works with UTF-8. strings version: 2.28, LANG=de_CH.UTF-8.hek2mgl
@12431234123412341234123 Thanks for the comment! I'll test it later and update the answer if I can reproduce it
DevSolar
@hek2mgl: Indeed. The 127 chars that require only 7 bits to encode. Coincidentially, the 127 chars that UTF-8 encodes with one byte, because the most significant bit set is UTF-8's way to tell if any given byte is part of a multibyte sequence.
DevSolar
Can't make much sense of that last comment. The characters which UTF-8 encodes in one byte are code points U+0000 through U+007F. These are identical with the same range (0x00 through 0x7f) of ASCII-7, which incidentially is all of ASCII-7. For anything beyond that, like "Ä", UTF-8 uses two bytes or more (whereas ISO/IEC 8859 / "extended ASCII" uses the range of 0x80..0xff to encode additional characters in a collection of one-byte encodings). "UTF-8 characters which are encoded in one byte, which is more than ASCII" do not exist.
|