Features in Mercurial 1.6
Paul Boddie
paul.boddie at biotek.uio.no
Wed Jun 16 10:03:02 UTC 2010
Lester Caine wrote:
> And Yonggang I am sure is not the only programmer who will NOT be
> using ASCII/English as his primary language? At then end of the day,
> the majority of the world does not use ASCII as their base and it is
> about time we who do finally accept that and ensure unicode is
> transparent everywhere.
Sure, but as others have pointed out, Mercurial can handle filenames
containing non-ASCII characters; it just doesn't attempt to interpret
the raw filename in any particular way. On Linux-based (and other)
operating systems, many (if not most) tools don't seek to know what any
given filename's actual characters are, leaving the matter of presenting
filenames for user interpretation according to the user's locale.
Naturally, this can be a problem: if one person on a system has a locale
using, say, ISO-8859-1, and another person has a locale using UTF-8,
then when the first person uses their file manager of choice to look at
the second user's filenames, they'll see lots of "weird" characters
which don't represent the true nature of the names, and the second
person will probably see lots of question marks in the file manager's
representation of the first user's filenames because those filenames
will contain invalid byte sequences when interpreted using the UTF-8
encoding.
In short, Linux/Unix filesystems typically don't attempt to impose a
universally agreed and established interpretation of a filename; as far
as the filesystem is concerned, the filenames are all just collections
of bytes. This can be a pain for the reasons already noted: if you start
with a filesystem with filenames encoded in ISO-8859-1, and you then
consider switching to UTF-8 (which is what everyone recommends nowadays)
then even if you go and rename all the files to make sense in UTF-8, you
can't be sure that some file somewhere (particularly a configuration
file) doesn't contain some filename encoded in ISO-8859-1 that then gets
read and used to try and find that referenced file.
I suppose programs could be configured to always store filename
information in a form that precisely defines the interpretation of the
filename, either using an implicit encoding (filenames are stored using
UTF-8, for example) or an explicit or well-defined encoding (which is
what XML provides, for example), and such programs could then use other
information to know how to convert filenames into the encoding that the
filesystem is using.
Paul
More information about the Mercurial
mailing list