weird filenames on windows

Alexander Belchenko bialix at ukr.net
Wed Jan 16 19:44:07 UTC 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Mackall пишет:
| On Wed, 2008-01-16 at 09:17 +0200, Alexander Belchenko wrote:
|> j w пишет:
|> | On Jan 15, 2008 10:09 AM, Matt Mackall <mpm at selenic.com> wrote:
|> |>> Currently, if I try to add that directory, it cuts out with:
|> |>>     abort: The system cannot find the file specified: mydir\Bilgisayar
|> |>> ve İletişim
|> |> That's quite mysterious. We got the filename from the operating system
|> |> and then handed it back exactly as we received it. What encoding are you
|> |> using? Can you tell us what the exact bytestring of that filename is?
|> |>
|> |
|> | Ok, I looked at this some more.
|> | It may not be a mercurial problem per se, since it affects other tools
|> | (well, at least GnuWin32 tools)
|> | It looks like just a codepage issue.
|>
|> I'm sure this "issue" is directly related to unix nature of these tools and therefore to the habit
|> of looking at filesystem names as bytestrings. It's OK on Linux, but *completely* wrong on Windows,
|> where all filenames should be interpreted as unicode strings.
|
| You'll have to explain why simply passing through data unmodified
| creates a problem, because it's a mystery to me. If a directory listing
| contains bytecode <x> and I ask to open a file with bytecode <x>, it
| should work.

Really?

You again thinking about Windows world from Unix point of view.
On Windows filenames internally stored in unicode-like representation, not exact unicode (it's
called MBCS -- multibyte character set -- on Windows) but for simplicity let's think it's unicode.

Windows has 2 form of Windows API to get filenames back from directory listing.
First form returns ANSI strings (i.e. plain 8-bit strings).
Second form returns Unicode strings (not MBCS).

The main question is: how to represent unicode characters from different alphabets, e.g. Cyrillic
and Turkish or Japanese, -- in the same 8-bit string where each byte represent one char of string?
Please, answer on this question at least for yourself.

And now the example.

Let's try to create LATIN SMALL LETTER A WITH GRAVE, Unicode char U+00E0, on the Russian Windows.
Russian Windows primarily intended to use Cyrillic alphabet, so their primary ANSI codepage (cp1251)
does not contains latin letters with diacritics. (Because my mother tongue is Russian I therefore
use Russian Windows.)

Here the python sesion:

C:\Temp\6>python
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
|>> import os
|>> os.listdir('.')
[]
|>> open(u'\u00E0', 'wb').close()
|>> os.listdir(u'.')
[u'\xe0']
|>> os.listdir('.')
['a']
|>> hex(ord(os.listdir('.')[0]))
'0x61'
|>>
|>> name = os.listdir('.')[0]
|>> open(name, 'rb')
Traceback (most recent call last):
~  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'a'
|>>
|>> name = os.listdir(u'.')[0]
|>> open(name, 'rb')
<open file u'\xe0', mode 'rb' at 0x00AA1698>
|>>

If you think that we actually don't create A with GRAVE, let's looking at the screenshot of Windows
Explorer. See attachment.

Any questions?

| And indeed it does for most people (in particular, people
| using the prebuilt binaries).

Most people never created file names in the foreign encoding/foreign language to their Windows version.
Per example, I have Russian version of Windows, and most of the time I'm working with Russian file
names if not English, and never with Turkish, or French or Japanese. But if you'll starting to mix
languages -- you'll have very serious problems in handling working trees unless you're decide to
using unicode on Windows. That's how it's working inside Bazaar.

| It's in fact much more likely that j w's problem is caused by something
| in his setup (probably GnuWin32) attempting to interpret filenames as
| unicode strings and not doing it consistently. But it won't be
| Mercurial, because Mercurial knows better than to even try.

Of course, you should know better what Mercurial internally doing.
I don't looking in the sources closely yet.
I'm actually Bazaar hacker.

It's not really funny to see that Bazaar still suffer from performance problem that Mercurial
successfully resolve from the beginning; but in the same time Windows support in Mercurial still
buggy while Bazaar resolve most of such problems a long time ago. Both system have almost dual set
of problems. And each system has their own weaknesses.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHjl6HzYr338mxwCURAhvQAKCPdydB4RWHN6+9SMij0aTVBCxsMACfUUJw
iFnesKWrRqFzAxw5FJrb5WU=
=nNpM
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: a-with-grave.png
Type: image/png
Size: 19207 bytes
Desc: not available
URL: <http://lists.mercurial-scm.org/pipermail/mercurial/attachments/20080116/ca4ba3b7/attachment-0003.png>


More information about the Mercurial mailing list