Unicode support in log messages and file names
Christian Boos
cboos at neuf.fr
Sat Nov 11 23:08:22 UTC 2006
Andrey wrote:
>> Do you really want to handle characters internally as 32-bit quantities?
>> In practice, this will quadruple string overhead and break almost all of
>> the ordinary string handling routines?
>>
>> I've been through this in several systems now. Using UTF-8 encoding
>> internally is the right answer.
>>
>
> Python string handling routines work perfectly with unicode strings without
> noticeable performance overhead. And moreover, they usually DO NOT work for
> UTF-8 byte strings. For example, s[:3] or s.upper() won't work for UTF-8
> strings containing non-lating (multibyte) characters.
>
I was going to answer something like that, too.
Before its version 0.10, Trac used UTF-8 strings and we had a several
weird issues due to the usage of "the ordinary string handling
routines", which work on bytes and do know _nothing_ about multi-byte
characters... All these issues went away by using unicode internally.
Also, most of the Python builds are using 16-bit quantities for storing
unicode characters, so the overhead is not that important, if this was
ever a concern.
-- Christian
More information about the Mercurial-devel
mailing list