[PATCH 4 of 8] encode all output in stdio encoding

Matt Mackall mpm at selenic.com
Mon Nov 20 21:14:02 UTC 2006


On Tue, Nov 21, 2006 at 02:04:04AM +0600, Andrey wrote:
> On 21 November 2006 (Tue) 01:15, Matt Mackall wrote:
> > On Mon, Nov 20, 2006 at 04:34:20PM -0200, Alexis S. L. Carvalho wrote:
> > > Thus spake Alexis S. L. Carvalho:
> > > > Thus spake Andrey:
> > > > > I should better have written something like 'encode all output in
> > > > > stdio encoding, if not already encoded' in commit message. :) That
> > > > > ui.ui.encode() function leaves all non-Unicode strings untouched, so
> > > > > hg cat works as expected.
> > > >
> > > > It prints a traceback with hg log --patch with a revision that changes
> > > > the encoding of a file.
> > >
> > > Hmm...  ok, it doesn't even get to ui.write - the current log code puts
> > > all strings in a list and does a ui.write("".join(strings)).  This
> > > patchset changes some of these strings from str's to unicode's, and so
> > > the "".join() raises an exception when it fails to convert the patch to
> > > a unicode.
> >
> > This is a great example of why having a mix of Unicode and regular
> > strings in an app travelling the same paths is generally Not A Good
> > Idea. Especially as one of our primary concerns as an SCM is to pass
> > all data through the system unmangled.
> >
> > Regular strings never throw exceptions. Functions that were written to
> > work on regular strings will explode in unexpected places when passed
> > unicode strings. That's bad. And retrofitting code to accept both is
> > complicated.
> >
> > Especially given that we generally _don't_ know the encoding of the
> > data we're manipulating. As far as I know, Unicode doesn't have an
> > encoding that says "I don't know what this is, it might be binary for
> > all I know, don't complain, and when you encode it back to 8-bit, it
> > must be exactly identical."
> >
> > Going the other way, manipulations on regular encoded strings will
> > generally work. Operations that fail are things like upper(), lower(),
> > grep with mismatched encodings, and truncation that happens to chop
> > inside a character. And their failure modes are relatively harmless.
> > For instance, about the only significant user of lower is log -k,
> > which will continue to work roughly as advertised.
> 
> Indeed, it was a bad idea to treat Unicode and byte strings in the same way. 
> But that does not mean we should not use Unicode at all.

No, what it means is that its usage should be as isolated as possible
from the rest of the system. 

There are three options:

a) use all wide (unicode) strings internally
b) mix wide and regular strings internally
c) convert all wide strings to regular strings as quickly as possible

In some ways (a) is the best answer, but I'm afraid it's right out
because there's no way to represent arbitrary binary bytes in Unicode
in a way that survives a round trip.

Option (b) is prohibitive because of automatic coersion of regular
strings to Unicode. Everywhere two strings are used together is a
potential exception. Not because there's an actual logic bug but
because there's no way to coerce binary data to Unicode. This isn't a
one-time expense either, it's continuing ongoing complexity.

Option (c) is admittedly not perfect, but its imperfections are ones
we can live with and work around more easily.

> And in fact, Python 3000 is going to get rid 
> of old goot byte strings at all (Unicode strings will be used by default, 
> there will be also mutable 'bytes' type without any string-like methods, 
> which will never be coerced to Unicode automatically).

Well thankfully that's still 994 years off.

-- 
Mathematics is the supreme nostalgia of our time.



More information about the Mercurial-devel mailing list