--[SPAM]--Re: A proposal on solve encoding problem on Windows.

Tom Anderson tom.anderson at e2x.co.uk
Wed Oct 26 08:21:52 UTC 2011


On 26 October 2011 08:39, Jouni Airaksinen <Jouni.Airaksinen at descom.fi> wrote:

> I've been following this conversation and it seems to me that people are
> trying to overengineer this?
>
> Why can't old repositories work as-is and new repositories could be
> created to use storing in UTF-8 regardless of platform the operations are
> done?

That would obviously require knowing the encoding being used by the
local platform, so that filenames could be transcoded on commit.

I believe the objection to this idea is that Mercurial can't
accurately know what that encoding is. Unix systems mostly do use
UTF-8, as it happens, but that is not necessarily the case - they
could use Latin-1, or any other encoding. Because the encoding in use
is not recorded anywhere definitive, Mercurial can't tell for sure. It
would have to do something like assume UTF-8 unless configured
otherwise, which leads to the situation where someone using a Latin-1
system forgets to configure that, and they end up committing files
with corrupt names.

That said, i think this is essentially the right solution, but it
should be optional, not the default. I have a very wordy email sitting
in my drafts about this, but basically, i think we should have two
modes, set on a per-repository basis, and burned into the repository
when created:

- Old mode, in which, repositories use a 'passthrough' encoding for
both the filesystem and the repository; the same bytes are used in
both places (if you like, think of this as, if you'll excuse some
Java, char decode(byte b) {return (char)b & 0xff;} byte encode(char
ch) {return (byte)ch;}). This reproduces the current behaviour.
- New mode, in which repositories use UTF-8 for the repository, and a
local encoding for the filesystem. That could be configured in hgrc in
the usual way. If it was not configured, Mercurial could either (a)
guess an encoding based on some combination of system settings (the
LANG environment variable on Unix, don't know about Windows), a survey
of the bytes in some local filenames, whatever, or (b) refuse to
commit (like how it won't commit until you specify a username). The
former would be easier on users; the latter would be safer and more
Pythonic ("In the face of ambiguity, refuse the temptation to
guess."). It could perhaps follow a canny middle way: as long as any
path being committed appears to be plain ASCII (which works for
everyone except users of EBCDIC and PETSCII machines - not many of
them around), then guess that it's ASCII, but if it has any high bits
set, throw a strop and demand to be configured properly.

I assume old mode would have to be the default, so as not to trip up
users currently depending on that behaviour. Personally, i would like
to see the new mode be the default, but i don't think that will fly.

You could actually convert a repository between these states as long
as it only contained ASCII or UTF-8 filenames, i think.

tom

-- 
Tom Anderson         |                e2x Ltd, 1 Norton Folgate, London E1 6DB
(e) tom at e2x.co.uk    |    (m) +44 (7960) 989794    |    (f) +44 (20) 7100 3749



More information about the Mercurial mailing list