[PATCH 4 of 8] encode all output in stdio encoding
Andrey
grooz-work at gorodok.net
Mon Nov 20 20:04:04 UTC 2006
On 21 November 2006 (Tue) 01:15, Matt Mackall wrote:
> On Mon, Nov 20, 2006 at 04:34:20PM -0200, Alexis S. L. Carvalho wrote:
> > Thus spake Alexis S. L. Carvalho:
> > > Thus spake Andrey:
> > > > I should better have written something like 'encode all output in
> > > > stdio encoding, if not already encoded' in commit message. :) That
> > > > ui.ui.encode() function leaves all non-Unicode strings untouched, so
> > > > hg cat works as expected.
> > >
> > > It prints a traceback with hg log --patch with a revision that changes
> > > the encoding of a file.
> >
> > Hmm... ok, it doesn't even get to ui.write - the current log code puts
> > all strings in a list and does a ui.write("".join(strings)). This
> > patchset changes some of these strings from str's to unicode's, and so
> > the "".join() raises an exception when it fails to convert the patch to
> > a unicode.
>
> This is a great example of why having a mix of Unicode and regular
> strings in an app travelling the same paths is generally Not A Good
> Idea. Especially as one of our primary concerns as an SCM is to pass
> all data through the system unmangled.
>
> Regular strings never throw exceptions. Functions that were written to
> work on regular strings will explode in unexpected places when passed
> unicode strings. That's bad. And retrofitting code to accept both is
> complicated.
>
> Especially given that we generally _don't_ know the encoding of the
> data we're manipulating. As far as I know, Unicode doesn't have an
> encoding that says "I don't know what this is, it might be binary for
> all I know, don't complain, and when you encode it back to 8-bit, it
> must be exactly identical."
>
> Going the other way, manipulations on regular encoded strings will
> generally work. Operations that fail are things like upper(), lower(),
> grep with mismatched encodings, and truncation that happens to chop
> inside a character. And their failure modes are relatively harmless.
> For instance, about the only significant user of lower is log -k,
> which will continue to work roughly as advertised.
Indeed, it was a bad idea to treat Unicode and byte strings in the same way.
But that does not mean we should not use Unicode at all. We just have to
clearly distinguish between Unicode data and byte data. For example, log
messages are obviously Unicode data, and so are user names, because they
represent textual information and their exact byte representation is
unimportant. And contents of revision controlled files (and thus diffs and
grep results) is byte data and it is a good idea to use byte strings for it.
The problems arise when we are trying to treat byte strings as Unicode
strings and vice versa. For example, byte strings must be sent to (or read
from) the terminal as-is, while Unicode strings have to be encoded with
proper encoding before output and decoded after input. And every
UnicodeDecodeException says that something is going wrong with our encodings
and needs to be fixed. For example, Unicode data is not properly encoded
before writing it to stdout. If we use UTF-8 byte strings instead of Unicode,
we will not get any exceptions, but the bugs will not vanish by themselves.
They will just get less obvious. So I'd say that using UTF-8 byte strings is
burying our heads in the sand. And in fact, Python 3000 is going to get rid
of old goot byte strings at all (Unicode strings will be used by default,
there will be also mutable 'bytes' type without any string-like methods,
which will never be coerced to Unicode automatically).
Andrey
More information about the Mercurial-devel
mailing list