[PATCH 2 of 3] revset: transcode revsets to UTF-8

Matt Mackall mpm at selenic.com
Tue Nov 16 19:33:57 UTC 2010


On Tue, 2010-11-16 at 09:57 +0100, Martin Geisler wrote:
> Matt Mackall <mpm at selenic.com> writes:
> 
> > On Tue, 2010-11-16 at 00:43 +0100, Martin Geisler wrote:
> >
> >> Well, it works just the same for commit:
> >> 
> >>   % echo >> a.txt && hg commit -m bøb
> >>   transaction abort!
> >>   rollback completed
> >>   abort: decoding near 'bøb': 'ascii' codec can't decode byte 0xf8 in
> >>   position 1: ordinal not in range(128)!
> >> 
> >>   % echo >> a.txt && LC_ALL=en_US.UTF-8 hg commit -m bøb
> >>   transaction abort!
> >>   rollback completed
> >>   abort: decoding near 'bøb': 'utf8' codec can't decode byte 0xf8 in
> >>   position 1: invalid start byte!
> >> 
> >>   % echo >> a.txt && LC_ALL=en_US.ISO8859-1 hg commit -m bøb
> >> 
> >> This has always seems quite right to me: we take the bytes given by
> >> the user and decode them using his locale. If we cannot do this, then
> >> we abort and give the user a chance to fix things.
> >
> > Uh, yes? The above matches precisely with this:
> >
> >> > which can be summed up as "restrict encoding-aware code to the
> >> > smallest set possible". If revset can't look up non-ASCII branch
> >> > names in a Latin1 locale, then that means that branch lookup is
> >> > broken, not that revsets needs to become encoding-aware.
> 
> I thought you meant that it should be possible to lookup non-ASCII
> branch names in an ASCII locale.

I did.

>  That was what I tried to illustrate
> above: if my locale is X and I enter a commit message in an incompatible
> encoding Y, then I get an error.

Commit messages are not lookups.

> That is how I expect it to work and I thought Dan's patch made it work
> the same for revsets so that 'branch(bøb)' raises an error when you are
> in an ASCII locale.

It should, but he did more than that. He pushed UTF-8 up the code stack
higher than it ought to be.

> > In particular, the only piece of code that gives a damn about
> > transcoding the commit message is this ONE LINE right here:
> >
> > http://www.selenic.com/hg/file/cc4e13c92dfa/mercurial/changelog.py#l215
> >
> > (Ok, there's a matching line on 177 for reading commits.)
> >
> > Compare this to alternately transcoding in every single path where we
> > can receive a commit message from the user (import, mq, commit, commit
> > -m, rebase, etc.) and reversing it everywhere we show one (hgweb, log,
> > export, summary, etc.) and then think about this again:
> >
> > "restrict encoding-aware code to the smallest set possible"
> 
> Yes, of course -- of course I agree that there should be only a few
> places that are responsible for decoding the user's bytes.
> 
> > The inverse of this statement is:
> >
> > "be encoding-agnostic wherever possible"
> >
> > (By the way, this reminds me of something I recently spotted with
> > sys.setdefaultencoding("undefined"):
> >
> > http://www.selenic.com/hg/file/cc4e13c92dfa/mercurial/minirst.py#l26
> >
> > The substs table always consists (as it should) of non-Unicode ASCII
> > strings that get promoted to Unicode, so the transcoding is
> > unnecessary. If transcoding -were- necessary, this code would break,
> > because the default encoding for Unicode promotion is ASCII. Ergo,
> > this code is over-engineered.)
> >
> >> > Related: how should lookup work for names that can't be represented
> >> > in the local charset work? Answer: if hg branches shows "caf?"
> >> > rather than "café", then I should be able to "hg up caf?".
> >> 
> >> That sounds bad to me -- the immediate question that arises is what
> >> to do if there is a branch named 'caf?' with a "real" question mark?
> >
> > Bah. You're being a purist.
> 
> I just want to start by making things correct... it feels wrong to me
> that we would start guessing what the user really meant.
> 
> > The intersection of users using ? (Q) and non-ASCII names (U) is going
> > to be negligible, because both sets will be pretty small. And the
> > number of collisions those users experience is going to be vanishingly
> > small (C). The utility of checking out non-ASCII branchnames is larger
> > by definition: U > Q and U > C.
> 
> I'm not sure what these equations should tell me?

If 1% of users are using ids with '?' and 1% of users are using ids with
non-ASCII, then the set of users who are using both is the intersection,
which will be less than or equal. If their idiocy is uncorrelated, then
we'd expect .01% to have both. Of this .01%, the set who actually
experience collisions (café vs caf?) will be quite a bit smaller still,
like .0001%.

So the set of people using non-ASCII ids (1%) and who might get some
value out of being able to check out such ids in ASCII locales is far
larger than the set who experience collisions and thus get a
(well-deserved) surprise.

-- 
Mathematics is the supreme nostalgia of our time.





More information about the Mercurial-devel mailing list