Unicode support in log messages and file names
Andrey
grooz-work at gorodok.net
Sat Nov 11 22:24:06 UTC 2006
> Do you really want to handle characters internally as 32-bit quantities?
> In practice, this will quadruple string overhead and break almost all of
> the ordinary string handling routines?
>
> I've been through this in several systems now. Using UTF-8 encoding
> internally is the right answer.
Python string handling routines work perfectly with unicode strings without
noticeable performance overhead. And moreover, they usually DO NOT work for
UTF-8 byte strings. For example, s[:3] or s.upper() won't work for UTF-8
strings containing non-lating (multibyte) characters.
> Note also: it is insufficient to say "UNICODE UTF-8". You also need to
> specify the normalization.
>
> The normalization that is almost universally adopted for UNICODE is
> normalization C.
I am not sure normalization is nessessary for us if all we want is just to
have non-latin log messages displayed correctly. :) Still we can use
unicodedata.normalize() for that.
More information about the Mercurial-devel
mailing list