Initial support of Unicode filenames
Victor Stinner
victor.stinner at haypocalc.com
Thu Nov 3 15:13:23 UTC 2011
Le Jeudi 3 Novembre 2011 13:19:04 vous avez écrit :
> Also, I'll see "Sweet crêpe recipe.txt" on my Latin-1 system.
I'm not sure that you understood. If we store the filename as Unicode in
Mercurial, a checkout will encode filenames to the locale encoding when
creating files.
You have the Unicode string u"Sweet crêpe recipe.txt". If you locale encoding
is latin1, Mercurial will create the file:
>>> u"Sweet crêpe recipe.txt".encode('latin1')
'Sweet cr\xeape recipe.txt'
If you locale encoding is UTF-8, if creates the file:
>>> u"Sweet crêpe recipe.txt".encode('UTF-8')
'Sweet cr\xc3\xaape recipe.txt'
If you list the directory content using the "ls" command: the locale is
decoded from locale encoding and you will get back your ê (U+00EA).
> > If this issue does really matter, we may add workarounds like encoding
> > the unencodable characters to something encoding. E.g. replace "ê"
> > (U+00EA) by "%EA" (3 characters encodable to ASCII), Mac OS X and
> > Gnome use this trick somewhere (I am not sure).
>
> We'll need to recognize the file again for 'hg status' purposes. So it's
> probably no good to encode the "ê" by "%EA" unless we also start
> decoding all "%EA" into "ê" characters.
If we replace non-encodable ê character by %EA, we also have to replace %EA
again with ê (U+00EA).
I don't really like the idea of using a custom "encoding" scheme (UTF-7,
base64, punycode or anything else) because it just moves the problem to
somewhere else. For example, if another file refers "Sweet crêpe recipe.txt"
file, it will fail to find the file.
If your locale encoding is unable to encode all filenames: change your locale.
If you cannot change the locale on your computer, use another computer.
> That would again be a serious change compared to what we do today.
Does the problem really exist? I'm not sure that people with ASCII locale
encoding manipulate repositories with non-ASCII filenames.
Why do you focus on the worst case, whereas Mercurial fails completly on the
most common case? The common case is to have two encodings able to encode all
characters that you are using, but using different bytes, and so you get
mojibake.
The mojibake is already is big problem, because if a file refers to "Sweet
crêpe recipe.txt" file, it does also fail to find the file.
Mercurial does have a problem today, and I don't see how moving to Unicode
would make the situation worse.
> I would really like to see Mercurial do transcoding of filenames.
What are you calling "transcoding"? If the filename is stored as UTF-8, I
consider that the filename type is Unicode. So you never *transcode* filenames.
You *decode* filenames when you add a new file, you *encode* filenames when you
do a checkout.
Please see the Definitions chapiter of my Unicode book to avoid confusion:
http://www.haypocalc.com/tmp/unicode-2011-07-20/html/definitions.html
> I've deployed Mercurial at Swiss customers
> and they immediatedly ran into problems with their unlauts.
Latin1 is able to encode latin letters with umlauts.
Victor
More information about the Mercurial-devel
mailing list