Unicode support for non-unicode locales

Shun-ichi GOTO shunichi.goto at gmail.com
Tue Oct 9 05:59:31 UTC 2007


My advocation is that filename encoding conversion is the right thing.

Matt, you may assume "any byte sequence can be accepted as filename in
ASCII/latin-1 world, so it is better for users and tools", but it is
not true in other world like UTF-8 file-system and Shift_JIS
file-system because of of validation of byte sequence and sanitization
of byte sequence.

The filename is certainly the user data in the meaning of sequence of
characters, but not not of bytes. Byte sequence of filename is just a
representation on the specific file-system.

I wanna say loudly, untouched filename does not always solve the
things, and it breaks the file-system.


2007/10/9, Matt Mackall <mpm at selenic.com>:
> Again, what happens if someone does a checkout in an ASCII/latin-1
> locale? That's most of the computing world. The answer is: your
> Russian characters are not just mangled, they're completely LOST. In
> fact, you probably won't be able to check out your project at all
> because filename "??????" will collide with filename "??????".

One of the answer is preventing extraction of those files with error.
Of course, in this case, ASCII user cannot checkout that repository.
But that would not be the matter because project owner does not
assumes those users. It's policy (or rule) of the project.  The
problem around non-ASCII filename is responsibility of the project,
not of the SCM. SCM should do the right thing, not attempting the
convenience for tools.

If you worry about this, hg can have an option to tell the file-system
encoding to enforce some encoding to use on local file-system.


Not all the hg users want to make the world wide repository.  If we
want to publish the repository for all the users in the world, it is
best that we use only ASCII filename.  It is the principle since old
days.


> This fix might work fine for special cases like going from one Russian
> or Japanese encoding to another, but in general, it makes a bad
> problem worse. It's much better overall for data to be "corrupted" by
> "passing it through untouched".

No, "passing it through untouched" makes things worse for any
languages.  As I said first, using untouched byte sequence does not
solve the issue of Makefile you mentioned.

It goes back to the nightmare with CVS...

-- 
Shun-ichi GOTO



More information about the Mercurial-devel mailing list