Unicode support for non-unicode locales

Shun-ichi GOTO shunichi.goto at gmail.com
Mon Oct 8 16:02:01 UTC 2007


2007/10/8, Matt Mackall <mpm at selenic.com>:
> Does it make the corresponding changes to your project's Makefile,
> etc., as well? What happens if someone does a checkout in an
> ASCII/latin-1 locale?
>
> Filenames, just like their contents, are the users' data. Our mandate
> is to preserve that data exactly.

As Japanese user, I cannot agree that treating filename as raw byte
data like file content.

1st point:
We should not handle filenames as byte data because it consists on a
rule of the OS/FileSystem.  For example, Shift_JIS (MBCS) is used on
Japanese Windows and Shift_JIS allows '\' char (0x5c) as 2nd byte.
As you know, the path separator on Windows is also '\' character.
It means that path string cannot be handled as simple byte sequence.
If we treat filename as raw byte data, some filename might be broken
in path operation. So the Python code shold handle filename as unicode
characters by decoding.

2nd point:
I want to handle non-ASCII filename on different platforms.  In Japan,
4 encodings (Shift_JIS, EUC-JP, JIS and UTF-8) are used for filename
encoding. Generally, old UNIX uses EUC-JP or JIS code, recent UNIX
uses EUC-JP or UTF-8, and Windows uses Shift_JIS (and unicode is used
internaly). If mercurial holds filename as raw bytes with specific
encoding, this repository (which contains kanji filename) might be
used only single encoding platform.  In other words, the repository is
not portable.

Subversion can handle better by holding filename as UTF-8 and it is
very useful for us. We can checkout the Japanese file names on both
UNIX (EUC-JP or UTF-8) and Windows (Shift_JIS), and also on UTF-8
platforms which is not Japanese locale.

# Of course, there's a Makefile issue as you said, but it is another
# issue.

And also, holding raw bytes filename may allow to make strange named
file on non-Japanese platforms when "hg update".  However, if mercurial
holds filename as UTF-8, we could sanitize or warn it on encoding to
locale filename encoding.

So, I strongly hope that mercurial holds filename as UTF-8 and
encode/decode to/from file-system encoding.

-- 
Shun-ichi GOTO



More information about the Mercurial-devel mailing list