xml style doesn't generate valid xml

Matt Mackall mpm at selenic.com
Wed Nov 24 20:31:48 UTC 2010


On Wed, 2010-11-24 at 19:17 +0000, Haszlakiewicz, Eric wrote:
> >-----Original Message-----
> >From: Matt Mackall [mailto:mpm at selenic.com]
> >
> >On Tue, 2010-11-23 at 23:51 +0000, Haszlakiewicz, Eric wrote:
> >> >-----Original Message-----
> >> >From: Matt Mackall [mailto:mpm at selenic.com]
> >> >
> >> >The thing is: no one knows what's in the extra field. It could be
> >> >binary, it could be ASCII, it could be UTF-8, it could even be a mix.
> >>
> >> huh?  Since the output I got for the extra field didn't change when I
> >> switched my locale around I figured that it was being treated as a
> >> binary field.  How does mercurial decide whether to treat it as binary
> >> or ASCII or UTF-8?
> >
> >extra["transplant_whatever"] -> binary
> >extra["branch"] -> UTF-8.
> >
> >The log code doesn't know anything about that and assumes a) everything
> >is potentially binary and b) it's ok to print it anyway, because that's
> >what the user asked for.
> 
>    Except from what I can tell, mercurial does not treat
> extra["branch"] as UTF-8.  Regardless of what I set the encoding to,
> the output of hg log has the _same_ value for extra["branch"] and
> there is no conversion to the local encoding that happens.
> 
> hg init x && cd x
> xchar=$(perl -e 'printf "%c", 205;')
> HGENCODING=cp1251 hg branch "branch${xchar}"
> HGENCODING=cp1251 hg ci -m "log msg ${xchar}"
> HGENCODING=cp1251 hg log --style xml --debug > log1
> HGENCODING=utf8 hg log --style xml --debug > log2
> diff log1 log2

Yes, exactly.

The extra field is OPAQUE to the log command. log makes no attempt to
apply any semantics AT ALL to any fields in extra because it is
impossible to do so with any sort of consistency.

> In other words, the "extra" elements, regardless of which specific key
> within extra you are talking about, are a low level, un-encoded view
> of the information and it isn't appropriate to use fromlocal() to
> fiddle with it.

Generally, you're right, it would be incorrect to apply fromlocal()
here.

However, there are two philosophical impedance mismatchs at play here.
First, XML really wants characters and Mercurial has only bytes. Second,
unlike filenames and file contents, the extra field is metadata that
Mercurial owns and manages and ought to have asserted an encoding on.
Internally, it uses UTF-8 for extra data, but various extensions
(*coughtransplant*) have decided to stick binary ids in there.

But if extra['foo']="abcd", we have no moral qualms about passing it off
to XML as ASCII, even though we explicitly don't know whether it's
nominally Latin1 or a binary encoding of 1633837924 that happens to look
like ASCII. 

Similarly, if we have a string that _cleanly decodes_ as UTF-8, we
should have no qualms about presenting it as such. Unlike Latin1, where
every byte string is a valid Latin1 string, the probability of a random
byte string being valid UTF-8 starts at 50% for one byte strings and
drops off exponentially (just like ASCII).


(I also have a secret plan here.. I want xmlescape (and the html
equivalent!) to work properly with log messages and user names, and I've
arranged for that to actually work losslessly with fromlocal() in my
tree by caching the original UTF-8 value in tolocal().)

-- 
Mathematics is the supreme nostalgia of our time.





More information about the Mercurial mailing list