Unicode support request.

Matt Mackall mpm at selenic.com
Tue Oct 18 18:49:58 UTC 2011


On Tue, 2011-10-18 at 11:20 -0700, Snidely wrote:
> On Oct 18, 9:13 am, Augie Fackler <duri... at gmail.com> wrote:
> > On Oct 18, 2011, at 5:13 AM, 罗勇刚(Yonggang Luo) wrote:
> >
> > > The mercurial has poor support of Unicode on Windows platform, and maybe
> > > also (Mac OS). So I request for that, especially about the File Path
> > > Encoding.
> > > At the current time, it's fully dependent on Window's Native encoding,
> > > that's really bad idea, and not working at all.
> > > So I propose to using UTF8 as the default encoding of file-path.
> >
> > Filenames are opaque bytes on Linux as well as most unixes (FreeBSD,
> Solaris, etc). Only OS X and Windows have this brain damage about
> reencoding filenames into different bytes, and it causes problems. As
> a specific example, make(1) is oblivious to the notion of encodings,
> so if you start mangling the bytes that come in and out you run the
> risk of breaking people's build scripts. This has _actually happened_
> to Subversion users, and it ends up being a nightmare when you
> encounter such a problem.
> >
> > Trying to fix this in Mercurial is likely to cause more headaches than it's worth.
> 
> 
> Out of curiosity, is the problem doing a UTF-16 -> UTF-8 conversion,
> or that you only get UTF-16 on NTFS filesystems? 

The fundamental problem is that Unix treats filenames as _bytes_ but
Windows treats filenames as _characters_. This is a deep conceptual
divide and no amount of throwing Unicode at it will bridge it.

This doesn't mean we're not going to try to do anything about it, it
just means there is no perfect fix. Realistically, people working
cross-platform have two reasonable choices: stick to ASCII (works
perfectly today) or use UTF-8 (needs help on Windows).

So the best fix we can do is something similar in approach to fixutf8:
translate filenames that happen to be stored in Mercurial in UTF-8
to/from wide characters on Windows. This is tricky, because it can't
change the behavior of repos using Latin1 and Windows-1251.

We've talked about this for years. But there seems to be a shortage of
people who actually a) understand the real constraints of the problem
and b) hack on Windows. 

-- 
Mathematics is the supreme nostalgia of our time.





More information about the Mercurial mailing list