Skip to content

Commit 4305454

Browse files
author
martin.v.loewis
committed
Issue #5915: Implement PEP 383, Non-decodable Bytes in
System Character Interfaces. git-svn-id: http://svn.python.org/projects/python/branches/py3k@72313 6015fed2-1504-0410-9fe1-9d1591cc4771
1 parent 1835bf9 commit 4305454

15 files changed

Lines changed: 726 additions & 289 deletions

File tree

Doc/library/codecs.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -322,6 +322,8 @@ and implemented by all standard Python codecs:
322322
| ``'backslashreplace'`` | Replace with backslashed escape sequences |
323323
| | (only for encoding). |
324324
+-------------------------+-----------------------------------------------+
325+
| ``'utf8b'`` | Replace byte with surrogate U+DCxx. |
326+
+-------------------------+-----------------------------------------------+
325327

326328
In addition, the following error handlers are specific to a single codec:
327329

@@ -333,7 +335,7 @@ In addition, the following error handlers are specific to a single codec:
333335
+------------------+---------+--------------------------------------------+
334336

335337
.. versionadded:: 3.1
336-
The ``'surrogates'`` error handler.
338+
The ``'utf8b'`` and ``'surrogates'`` error handlers.
337339

338340
The set of allowed values can be extended via :meth:`register_error`.
339341

Doc/library/os.rst

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,30 @@ the :mod:`os` module, but using them is of course a threat to portability!
5151
``'ce'``, ``'java'``.
5252

5353

54+
.. _os-filenames:
55+
56+
File Names, Command Line Arguments, and Environment Variables
57+
-------------------------------------------------------------
58+
59+
In Python, file names, command line arguments, and environment
60+
variables are represented using the string type. On some systems,
61+
decoding these strings to and from bytes is necessary before passing
62+
them to the operating system. Python uses the file system encoding to
63+
perform this conversion (see :func:`sys.getfilesystemencoding`).
64+
65+
.. versionchanged:: 3.1
66+
On some systems, conversion using the file system encoding may
67+
fail. In this case, Python uses the ``utf8b`` encoding error
68+
handler, which means that undecodable bytes are replaced by a
69+
Unicode character U+DCxx on decoding, and these are again
70+
translated to the original byte on encoding.
71+
72+
73+
The file system encoding must guarantee to successfully decode all
74+
bytes below 128. If the file system encoding fails to provide this
75+
guarantee, API functions may raise UnicodeErrors.
76+
77+
5478
.. _os-procinfo:
5579

5680
Process Parameters
@@ -688,12 +712,8 @@ Files and Directories
688712

689713
.. function:: getcwd()
690714

691-
Return a string representing the current working directory. On Unix
692-
platforms, this function may raise :exc:`UnicodeDecodeError` if the name of
693-
the current directory is not decodable in the file system encoding. Use
694-
:func:`getcwdb` if you need the call to never fail. Availability: Unix,
695-
Windows.
696-
715+
Return a string representing the current working directory.
716+
Availability: Unix, Windows.
697717

698718
.. function:: getcwdb()
699719

@@ -800,10 +820,8 @@ Files and Directories
800820
entries ``'.'`` and ``'..'`` even if they are present in the directory.
801821
Availability: Unix, Windows.
802822

803-
This function can be called with a bytes or string argument. In the bytes
804-
case, all filenames will be listed as returned by the underlying API. In the
805-
string case, filenames will be decoded using the file system encoding, and
806-
skipped if a decoding error occurs.
823+
This function can be called with a bytes or string argument, and returns
824+
filenames of the same datatype.
807825

808826

809827
.. function:: lstat(path)

Include/unicodeobject.h

Lines changed: 29 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,7 @@ typedef PY_UNICODE_TYPE Py_UNICODE;
198198
# define PyUnicode_FromStringAndSize PyUnicodeUCS2_FromStringAndSize
199199
# define PyUnicode_FromUnicode PyUnicodeUCS2_FromUnicode
200200
# define PyUnicode_FromWideChar PyUnicodeUCS2_FromWideChar
201+
# define PyUnicode_FSConverter PyUnicodeUCS2_FSConverter
201202
# define PyUnicode_GetDefaultEncoding PyUnicodeUCS2_GetDefaultEncoding
202203
# define PyUnicode_GetMax PyUnicodeUCS2_GetMax
203204
# define PyUnicode_GetSize PyUnicodeUCS2_GetSize
@@ -296,6 +297,7 @@ typedef PY_UNICODE_TYPE Py_UNICODE;
296297
# define PyUnicode_FromStringAndSize PyUnicodeUCS4_FromStringAndSize
297298
# define PyUnicode_FromUnicode PyUnicodeUCS4_FromUnicode
298299
# define PyUnicode_FromWideChar PyUnicodeUCS4_FromWideChar
300+
# define PyUnicode_FSConverter PyUnicodeUCS4_FSConverter
299301
# define PyUnicode_GetDefaultEncoding PyUnicodeUCS4_GetDefaultEncoding
300302
# define PyUnicode_GetMax PyUnicodeUCS4_GetMax
301303
# define PyUnicode_GetSize PyUnicodeUCS4_GetSize
@@ -693,25 +695,6 @@ PyAPI_FUNC(PyObject *) _PyUnicode_AsDefaultEncodedString(
693695
PyObject *unicode,
694696
const char *errors);
695697

696-
/* Decode a null-terminated string using Py_FileSystemDefaultEncoding.
697-
698-
If the encoding is supported by one of the built-in codecs (i.e., UTF-8,
699-
UTF-16, UTF-32, Latin-1 or MBCS), otherwise fallback to UTF-8 and replace
700-
invalid characters with '?'.
701-
702-
The function is intended to be used for paths and file names only
703-
during bootstrapping process where the codecs are not set up.
704-
*/
705-
706-
PyAPI_FUNC(PyObject*) PyUnicode_DecodeFSDefault(
707-
const char *s /* encoded string */
708-
);
709-
710-
PyAPI_FUNC(PyObject*) PyUnicode_DecodeFSDefaultAndSize(
711-
const char *s, /* encoded string */
712-
Py_ssize_t size /* size */
713-
);
714-
715698
/* Returns a pointer to the default encoding (normally, UTF-8) of the
716699
Unicode object unicode and the size of the encoded representation
717700
in bytes stored in *size.
@@ -1252,6 +1235,33 @@ PyAPI_FUNC(int) PyUnicode_EncodeDecimal(
12521235
const char *errors /* error handling */
12531236
);
12541237

1238+
/* --- File system encoding ---------------------------------------------- */
1239+
1240+
/* ParseTuple converter which converts a Unicode object into the file
1241+
system encoding, using the PEP 383 error handler; bytes objects are
1242+
output as-is. */
1243+
1244+
PyAPI_FUNC(int) PyUnicode_FSConverter(PyObject*, void*);
1245+
1246+
/* Decode a null-terminated string using Py_FileSystemDefaultEncoding.
1247+
1248+
If the encoding is supported by one of the built-in codecs (i.e., UTF-8,
1249+
UTF-16, UTF-32, Latin-1 or MBCS), otherwise fallback to UTF-8 and replace
1250+
invalid characters with '?'.
1251+
1252+
The function is intended to be used for paths and file names only
1253+
during bootstrapping process where the codecs are not set up.
1254+
*/
1255+
1256+
PyAPI_FUNC(PyObject*) PyUnicode_DecodeFSDefault(
1257+
const char *s /* encoded string */
1258+
);
1259+
1260+
PyAPI_FUNC(PyObject*) PyUnicode_DecodeFSDefaultAndSize(
1261+
const char *s, /* encoded string */
1262+
Py_ssize_t size /* size */
1263+
);
1264+
12551265
/* --- Methods & Slots ----------------------------------------------------
12561266
12571267
These are capable of handling Unicode objects and strings on input

Lib/test/test_codecs.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1516,6 +1516,34 @@ def test_unicode_escape(self):
15161516
self.assertEquals(codecs.raw_unicode_escape_decode(r"\u1234"), ("\u1234", 6))
15171517
self.assertEquals(codecs.raw_unicode_escape_decode(br"\u1234"), ("\u1234", 6))
15181518

1519+
class Utf8bTest(unittest.TestCase):
1520+
1521+
def test_utf8(self):
1522+
# Bad byte
1523+
self.assertEqual(b"foo\x80bar".decode("utf-8", "utf8b"),
1524+
"foo\udc80bar")
1525+
self.assertEqual("foo\udc80bar".encode("utf-8", "utf8b"),
1526+
b"foo\x80bar")
1527+
# bad-utf-8 encoded surrogate
1528+
self.assertEqual(b"\xed\xb0\x80".decode("utf-8", "utf8b"),
1529+
"\udced\udcb0\udc80")
1530+
self.assertEqual("\udced\udcb0\udc80".encode("utf-8", "utf8b"),
1531+
b"\xed\xb0\x80")
1532+
1533+
def test_ascii(self):
1534+
# bad byte
1535+
self.assertEqual(b"foo\x80bar".decode("ascii", "utf8b"),
1536+
"foo\udc80bar")
1537+
self.assertEqual("foo\udc80bar".encode("ascii", "utf8b"),
1538+
b"foo\x80bar")
1539+
1540+
def test_charmap(self):
1541+
# bad byte: \xa5 is unmapped in iso-8859-3
1542+
self.assertEqual(b"foo\xa5bar".decode("iso-8859-3", "utf8b"),
1543+
"foo\udca5bar")
1544+
self.assertEqual("foo\udca5bar".encode("iso-8859-3", "utf8b"),
1545+
b"foo\xa5bar")
1546+
15191547

15201548
def test_main():
15211549
support.run_unittest(
@@ -1543,6 +1571,7 @@ def test_main():
15431571
CharmapTest,
15441572
WithStmtTest,
15451573
TypesTest,
1574+
Utf8bTest,
15461575
)
15471576

15481577

Lib/test/test_os.py

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
import unittest
88
import warnings
99
import sys
10+
import shutil
1011
from test import support
1112

1213
# Tests creating TESTFN
@@ -698,9 +699,44 @@ def test_setregid(self):
698699
self.assertRaises(os.error, os.setregid, 0, 0)
699700
self.assertRaises(OverflowError, os.setregid, 1<<32, 0)
700701
self.assertRaises(OverflowError, os.setregid, 0, 1<<32)
702+
703+
class Pep383Tests(unittest.TestCase):
704+
filenames = [b'foo\xf6bar', 'foo\xf6bar'.encode("utf-8")]
705+
706+
def setUp(self):
707+
self.fsencoding = sys.getfilesystemencoding()
708+
sys.setfilesystemencoding("utf-8")
709+
self.dir = support.TESTFN
710+
self.bdir = self.dir.encode("utf-8", "utf8b")
711+
os.mkdir(self.dir)
712+
self.unicodefn = []
713+
for fn in self.filenames:
714+
f = open(os.path.join(self.bdir, fn), "w")
715+
f.close()
716+
self.unicodefn.append(fn.decode("utf-8", "utf8b"))
717+
718+
def tearDown(self):
719+
shutil.rmtree(self.dir)
720+
sys.setfilesystemencoding(self.fsencoding)
721+
722+
def test_listdir(self):
723+
expected = set(self.unicodefn)
724+
found = set(os.listdir(support.TESTFN))
725+
self.assertEquals(found, expected)
726+
727+
def test_open(self):
728+
for fn in self.unicodefn:
729+
f = open(os.path.join(self.dir, fn))
730+
f.close()
731+
732+
def test_stat(self):
733+
for fn in self.unicodefn:
734+
os.stat(os.path.join(self.dir, fn))
701735
else:
702736
class PosixUidGidTests(unittest.TestCase):
703737
pass
738+
class Pep383Tests(unittest.TestCase):
739+
pass
704740

705741
def test_main():
706742
support.run_unittest(
@@ -714,7 +750,8 @@ def test_main():
714750
ExecTests,
715751
Win32ErrorTests,
716752
TestInvalidFD,
717-
PosixUidGidTests
753+
PosixUidGidTests,
754+
Pep383Tests
718755
)
719756

720757
if __name__ == "__main__":

Misc/NEWS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ What's New in Python 3.1 beta 1?
1212
Core and Builtins
1313
-----------------
1414

15+
- Implement PEP 383, Non-decodable Bytes in System Character Interfaces.
16+
1517
- Issue #5890: in subclasses of 'property' the __doc__ attribute was
1618
shadowed by classtype's, even if it was None. property now
1719
inserts the __doc__ into the subclass instance __dict__.

Modules/_io/fileio.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@ fileio_init(PyObject *oself, PyObject *args, PyObject *kwds)
245245
return -1;
246246

247247
stringobj = PyUnicode_AsEncodedString(
248-
u, Py_FileSystemDefaultEncoding, NULL);
248+
u, Py_FileSystemDefaultEncoding, "utf8b");
249249
Py_DECREF(u);
250250
if (stringobj == NULL)
251251
return -1;

0 commit comments

Comments
 (0)