Unicode support for non-unicode locales
Micah Cowan
micah at cowan.name
Tue Oct 9 20:59:51 UTC 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Matt Mackall wrote:
> On Tue, Oct 09, 2007 at 02:59:31PM +0900, Shun-ichi GOTO wrote:
>> Matt, you may assume "any byte sequence can be accepted as filename in
>> ASCII/latin-1 world, so it is better for users and tools", but it is
>> not true in other world like UTF-8 file-system and Shift_JIS
>> file-system because of of validation of byte sequence and sanitization
>> of byte sequence.
>
> You're absolutely right about Shift_JIS. (At the same time, you're
> absolutely wrong about UTF-8.)
In what way is he "absolutely wrong about UTF-8"? As soon as you have a
byte with a high-bit set, in UTF-8 you know automatically that (1) there
is at least one more byte with a high-bit set nearby, (2) it is part of
a series of N bytes starting with a byte with the N highest bits set
(followed by an unset bit), followed by N-1 bytes with the highest bit
set and the next-highest unset.
So UTF-8 can not accept "any byte sequence": Any high-bit-set bytes that
don't follow this specification, are not valid in UTF-8. For instance,
most latin1 names are not valid UTF-8 strings. What the system and
various tools will do with such a thing, who can say?
Also, how is "any byte sequence" valid in even the ASCII/latin-1 world?
What happens when your UTF-8 filename includes bytes that in latin-1
would be interpreted as control characters?
IMO, file encodings are the business of the developers; filename
encodings are Mercurial's responsibility, at least to the degree that it
can do something about it.
- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFHC+u77M8hyUobTrERCDezAJ0Ygj70DHW1JeOjAkj9gWbZiv62yACePYQ7
jfAWTswlyePIeiiwIQd/Rr8=
=swkE
-----END PGP SIGNATURE-----
More information about the Mercurial-devel
mailing list