Current py3k stage and next steps

Antoine Pitrou solipsis at pitrou.net
Fri Jun 25 22:21:40 UTC 2010


Matt Mackall <mpm <at> selenic.com> writes:
> 
> The tricky part is this:
> 
> ui.write() and the like are used to handle three kinds of data:
> 
> - utf-8 encoded metadata that's been transcoded to the local encoding
> - internal ASCII messages that may or may not go through gettext()
> before being present to the user in the local encoding
> - raw byte data that is presented to the user byte-for-byte as-is
> 
> In the last case, it's unacceptable to do any form of transcoding even
> if we knew what encoding the data was in (which we don't and which is
> not possible in the general case).

The 3.x IO subsystem is layered: sys.stdout is a text (unicode) layer, but you
can access sys.stdout.buffer which is the underlying buffered bytes layer. Of
course, it is better to call flush() in-between (even though it might not appear
necessary in interactive use):

>>> sys.stdout.encoding
'UTF-8'
>>> sys.stdout.write("some unicode text: é\n")
some unicode text: é
21
>>> sys.stdout.buffer.write(b"some undecodable bytes: \x00\xff\n")
some undecodable bytes: �
27

So, basically, ui.write() could be able to output both unicode strings and
bytestrings, if you are willing to let Python handle the encoding of unicode
strings to the output device.

> If Unicode had, say, a codeplane to represent "unknown byte 0x??" such
> that arbitrary byte strings could round-trip losslessly to Unicode, none
> of this would be a problem (except for overhead). But since that's not
> possible, Unicode strings are a bad fit for much of what Mercurial
> does. 

Python 3.x does allow you to roundtrip all data through unicode and back to
bytes, losslessly: by using the "surrogateescape" error handler. It translates
all undecodable bytes to lone unicode surrogates, and does the reverse operation
when encoding:

>>> b"valid UTF-8: \xc3\xa9 ; invalid UTF-8: \xff".decode("utf-8",
"surrogateescape")
'valid UTF-8: é ; invalid UTF-8: \udcff'
>>> 'valid UTF-8: é ; invalid UTF-8: \udcff'.encode("utf-8", "surrogateescape")
b'valid UTF-8: \xc3\xa9 ; invalid UTF-8: \xff'

Moreover, the same unicode string cannot be produced by a legal UTF-8 sequence:

>>> 'valid UTF-8: é ; invalid UTF-8: \udcff'.encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position
32: surrogates not allowed

... which ensures that the transformation is completely bijective.

(the relevant PEP is PEP 383: http://www.python.org/dev/peps/pep-0383/)

Of course, this not may be right for all use cases. But if your concern is to be
able to integrate arbitrary bytes data in an unicode-processing step without
having to fear UnicodeErrors, it can be a good fit.

(FWIW, it also allows filename-returning functions such as os.listdir() to
always return unicode strings without failing on undecodeable filenames)

Regards

Antoine.





More information about the Mercurial-devel mailing list