Features in Mercurial 1.6

Paul Boddie paul.boddie at biotek.uio.no
Wed Jun 16 10:03:02 UTC 2010


Lester Caine wrote:
> And Yonggang I am sure is not the only programmer who will NOT be 
> using ASCII/English as his primary language? At then end of the day, 
> the majority of the world does not use ASCII as their base and it is 
> about time we who do finally accept that and ensure unicode is 
> transparent everywhere.

Sure, but as others have pointed out, Mercurial can handle filenames 
containing non-ASCII characters; it just doesn't attempt to interpret 
the raw filename in any particular way. On Linux-based (and other) 
operating systems, many (if not most) tools don't seek to know what any 
given filename's actual characters are, leaving the matter of presenting 
filenames for user interpretation according to the user's locale. 
Naturally, this can be a problem: if one person on a system has a locale 
using, say, ISO-8859-1, and another person has a locale using UTF-8, 
then when the first person uses their file manager of choice to look at 
the second user's filenames, they'll see lots of "weird" characters 
which don't represent the true nature of the names, and the second 
person will probably see lots of question marks in the file manager's 
representation of the first user's filenames because those filenames 
will contain invalid byte sequences when interpreted using the UTF-8 
encoding.

In short, Linux/Unix filesystems typically don't attempt to impose a 
universally agreed and established interpretation of a filename; as far 
as the filesystem is concerned, the filenames are all just collections 
of bytes. This can be a pain for the reasons already noted: if you start 
with a filesystem with filenames encoded in ISO-8859-1, and you then 
consider switching to UTF-8 (which is what everyone recommends nowadays) 
then even if you go and rename all the files to make sense in UTF-8, you 
can't be sure that some file somewhere (particularly a configuration 
file) doesn't contain some filename encoded in ISO-8859-1 that then gets 
read and used to try and find that referenced file.

I suppose programs could be configured to always store filename 
information in a form that precisely defines the interpretation of the 
filename, either using an implicit encoding (filenames are stored using 
UTF-8, for example) or an explicit or well-defined encoding (which is 
what XML provides, for example), and such programs could then use other 
information to know how to convert filenames into the encoding that the 
filesystem is using.

Paul



More information about the Mercurial mailing list