Getting http://mercurial.selenic.com/wiki/FixUtf8Extension as a part of hgsubversion

Tom Anderson tom.anderson at e2x.co.uk
Wed Oct 19 13:59:59 UTC 2011


2011/10/19 Martin Geisler <mg at aragost.com>:

> You ask why Subversion can work on Windows/Linux and the answer is
> simple: they have chosen to transcode the filenames to and from
> Unicode. Mercurial has chosen to not do this.
>
> It is a tradeoff: by transcoding we would support some filenames better
> on the two systems, but break some build tools. We would also have to
> deal with a lot more bug reports about encoding problems when the
> transcoding fails.
>
> As an example, if you have a repository with a file called "罗勇刚.txt",
> then I can make a clone to my Latin-1 Linux box and I can see the file
> today. If Mercurial would try to transcode the file into Latin-1, then
> the checkout would fail. Depending on what I need to do with the file,
> failing might be good or bad.

Hold on - at the moment, when you try to check out Luo's file, you'll
get a file whose name is just complete gibberish. If Luo uses UTF-8,
the bytes are e7 bd 97 e5 8b 87 e5 88 9a, which in ISO 8859-1 gives
"ç½?å??å??", where the question marks are characters my machine
doesn't know. Are you seriously suggesting that this is in any way
useful, let alone correct behaviour?

What happens if you check in some files with entirely alphabetical
names on your Latin-1 box, and i check them out on my EBCDIC machine?
There, the names can be perfectly accurately represented at either
end. If Mercurial treats names as byte strings, doesn't that mean i
will get gibberish again?

You are absolutely right that this is a tough problem, because not
every filename that someone might write can be represented correctly
on everyone else's filesystem. But there are many filenames which can
be represented correctly on a great many peoples' filesystems which
Mercurial gets wrong.

I read the EncodingStrategy page on the wiki. It seems that the only
real argument for treating filenames as bytes is the "makefile
problem". The comment that "non-ASCII filenames are not reliably
portable between systems in general" is hokum. In essence, this means
that the Mercurial project made an early decision that it cared more
about supporting broken unix build tools than it did about supporting
users of non-ASCII languages. That's fine, but it's a decision that
the project should be open about.

tom

-- 
Tom Anderson         |                e2x Ltd, 1 Norton Folgate, London E1 6DB
(e) tom at e2x.co.uk    |    (m) +44 (7960) 989794    |    (f) +44 (20) 7100 3749



More information about the Mercurial mailing list