Unicode support in log messages and file names

Christian Boos cboos at neuf.fr
Sat Nov 11 23:08:22 UTC 2006


Andrey wrote:
>> Do you really want to handle characters internally as 32-bit quantities?
>> In practice, this will quadruple string overhead and break almost all of
>> the ordinary string handling routines?
>>
>> I've been through this in several systems now. Using UTF-8 encoding
>> internally is the right answer.
>>     
>
> Python string handling routines work perfectly with unicode strings without 
> noticeable performance overhead. And moreover, they usually DO NOT work for 
> UTF-8 byte strings. For example, s[:3] or s.upper() won't work for UTF-8 
> strings containing non-lating (multibyte) characters.
>   

I was going to answer something like that, too.
Before its version 0.10, Trac used UTF-8 strings and we had a several 
weird issues due to the usage of "the ordinary string handling 
routines", which work on bytes and do know _nothing_ about multi-byte 
characters... All these issues went away by using unicode internally.

Also, most of the Python builds are using 16-bit quantities for storing 
unicode characters, so the overhead is not that important, if this was 
ever a concern.

-- Christian



More information about the Mercurial-devel mailing list