When dealing with attachments encoded via uuencoding (Content-transfer-encoding is uuencode or x-uuencode), mail-parser treats them as text, as can be seen in parse() (mailparser.py:378):
if transfer_encoding == "base64" or (
transfer_encoding == "quoted-\
printable" and "application" in mail_content_type):
...
else:
payload = ported_string(p.get_payload(decode=True), encoding=charset)
log.debug("Filename {!r} part {!r} is not binary".format(filename, i))
Within the else block, the payload is correctly decoded with p.get_payload(decode=True), but then passed to ported_string() which attempts to encode the returned bytes to UTF-8 in utils.py:85:
def ported_string(raw_data, encoding='utf-8', errors='ignore'):
...
try:
return six.text_type(raw_data, encoding).strip()
except (LookupError, UnicodeDecodeError):
return six.text_type(raw_data, "utf-8", errors).strip()
Since errors are ignored, encoding doesn't fail, but returns a attachment stripped of all bytes that can't be encoded in utf-8 (that can be easily verified by attempting to write that binary to disk with write_attachments).
I encountered this issue while porting SpamScope to Python3, which has a test test_store_samples_unicode_error that parses and saves a uuencoded attachment. According to the test, the resulting file should have a MD5 checksum of 2ea90c996ca28f751d4841e6c67892b8. That test passes with Python2, because the incorrectly parsed payload does indeed have that hash. However, with Python3 the hash changes due to differences in unicode handling. However, the correct checksum is actually 4f2cf891e7cfb349fca812091f184ecc.
When dealing with attachments encoded via uuencoding (
Content-transfer-encodingisuuencodeorx-uuencode), mail-parser treats them as text, as can be seen inparse()(mailparser.py:378):Within the
elseblock, the payload is correctly decoded withp.get_payload(decode=True), but then passed toported_string()which attempts to encode the returned bytes to UTF-8 inutils.py:85:Since
errorsare ignored, encoding doesn't fail, but returns a attachment stripped of all bytes that can't be encoded in utf-8 (that can be easily verified by attempting to write that binary to disk withwrite_attachments).I encountered this issue while porting SpamScope to Python3, which has a test
test_store_samples_unicode_errorthat parses and saves a uuencoded attachment. According to the test, the resulting file should have a MD5 checksum of2ea90c996ca28f751d4841e6c67892b8. That test passes with Python2, because the incorrectly parsed payload does indeed have that hash. However, with Python3 the hash changes due to differences in unicode handling. However, the correct checksum is actually4f2cf891e7cfb349fca812091f184ecc.