Fix #70091: Phar does not mark UTF-8 filenames in ZIP archives#6630
Closed
cmb69 wants to merge 3 commits intophp:PHP-7.4from
Closed
Fix #70091: Phar does not mark UTF-8 filenames in ZIP archives#6630cmb69 wants to merge 3 commits intophp:PHP-7.4from
cmb69 wants to merge 3 commits intophp:PHP-7.4from
Conversation
The default encoding of filenames in a ZIP archive is IBM Code Page 437. Phar, however, only supports UTF-8 filenames. Therefore we have to mark non ASCII filenames as being stored in UTF-8 by setting the general purpose bit 11 (the language encoding flag). The effect of not setting this bit for non ASCII filenames can be seen in popular tools like 7-Zip and UnZip, but not when extracting the archives via ext/phar (which is agnostic to the filename encoding), or via ext/zip (which guesses the encoding). Thus we add a somewhat brittle low-level test case.
nikic
reviewed
Jan 25, 2021
ext/phar/zip.c
Outdated
| memcpy(central.datestamp, local.datestamp, sizeof(local.datestamp)); | ||
| PHAR_SET_16(central.filename_len, entry->filename_len + (entry->is_dir ? 1 : 0)); | ||
| PHAR_SET_16(local.filename_len, entry->filename_len + (entry->is_dir ? 1 : 0)); | ||
| if (!is_ascii(entry->filename, entry->filename_len)) { |
Member
There was a problem hiding this comment.
Just wondering, would just unconditionally setting the flag be fine? ASCII and UTF-8 are the same when only ASCII characters are used.
Member
Author
There was a problem hiding this comment.
I don't see a real problem doing this unconditionally; if a ZIP tool doesn't cater to that flag, there still shouldn't be a difference regarding ASCII only filenames. OTOH, setting the flag conditionally, wouldn't cause any behavioral change for ASCII only filenames.
Member
Author
There was a problem hiding this comment.
I pushed a commit which would set the flag unconditionally. I'm fine with either solution.
Member
There was a problem hiding this comment.
Always setting the flag is less code, so if that works, let's go for it :)
nikic
approved these changes
Jan 26, 2021
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The default encoding of filenames in a ZIP archive is IBM Code Page
437. Phar, however, only supports UTF-8 filenames. Therefore we have
to mark non ASCII filenames as being stored in UTF-8 by setting the
general purpose bit 11 (the language encoding flag).
The effect of not setting this bit for non ASCII filenames can be seen
in popular tools like 7-Zip and UnZip, but not when extracting the
archives via ext/phar (which is agnostic to the filename encoding), or
via ext/zip (which guesses the encoding). Thus we add a somewhat
brittle low-level test case.