Skip to content

UUencoded attachment parsing #80

@sylencecc

Description

@sylencecc

When dealing with attachments encoded via uuencoding (Content-transfer-encoding is uuencode or x-uuencode), mail-parser treats them as text, as can be seen in parse() (mailparser.py:378):

if transfer_encoding == "base64" or (
  transfer_encoding == "quoted-\
  printable" and "application" in mail_content_type):
    ...
else:
  payload = ported_string(p.get_payload(decode=True), encoding=charset)
  log.debug("Filename {!r} part {!r} is not binary".format(filename, i))

Within the else block, the payload is correctly decoded with p.get_payload(decode=True), but then passed to ported_string() which attempts to encode the returned bytes to UTF-8 in utils.py:85:

def ported_string(raw_data, encoding='utf-8', errors='ignore'):
...
  try:
    return six.text_type(raw_data, encoding).strip()
  except (LookupError, UnicodeDecodeError):
    return six.text_type(raw_data, "utf-8", errors).strip()

Since errors are ignored, encoding doesn't fail, but returns a attachment stripped of all bytes that can't be encoded in utf-8 (that can be easily verified by attempting to write that binary to disk with write_attachments).

I encountered this issue while porting SpamScope to Python3, which has a test test_store_samples_unicode_error that parses and saves a uuencoded attachment. According to the test, the resulting file should have a MD5 checksum of 2ea90c996ca28f751d4841e6c67892b8. That test passes with Python2, because the incorrectly parsed payload does indeed have that hash. However, with Python3 the hash changes due to differences in unicode handling. However, the correct checksum is actually 4f2cf891e7cfb349fca812091f184ecc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions