encoding.py bug?

Matt Mackall mpm at selenic.com
Sun Apr 22 20:43:15 UTC 2012


On Sun, 2012-04-22 at 15:48 -0400, Cesar Mena wrote:
> hi,
> 
> in encoding.py, stable branch, up to date, shouldn't the first attempt at
> encoding catch UnicodeEncodeError, as opposed to UnicodeDecodeError?
> 
> ie,
> 
> diff --git a/mercurial/encoding.py b/mercurial/encoding.py
> --- a/mercurial/encoding.py
> +++ b/mercurial/encoding.py
> @@ -169,7 +169,7 @@
>      "best-effort encoding-aware case-folding of local string s"
>      try:
>          return s.encode('ascii').lower()
> -    except UnicodeDecodeError:
> +    except UnicodeEncodeError:

Let's check:

        >>> "not äscii".encode('ascii')
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
        position 4: ordinal not in range(128)
        
So the current code is correct. What's happening here? Here's a clue:

        >>> "not äscii".encode('utf-8')
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
        position 4: ordinal not in range(128)
        
Note it's still complaining about 'ascii'. When we try to encode() a
byte string, we implicitly try to decode() the string first to a Unicode
object using the "default" encoding, before trying to encode it to the
target encoding.

Ok, so what is the default encoding? In Python 2, as shipped, it's
always "ascii" unless someone has adventurously changed it to something
else globally:

        http://hg.python.org/releasing/2.7.3/file/7bb96963d067/Lib/site.py#l487

Why are we doing this in the first place? Because Unicode case-folding
is ridiculously slow and most of the time we're folding ASCII strings:

        $ python -m timeit -c
        '"blahblah".decode("utf-8").lower().encode("ascii")'
        1000000 loops, best of 3: 1.42 usec per loop
        
This is faster:

        $ python -m timeit -c '"blahblah".encode("ascii").lower()'
        1000000 loops, best of 3: 0.55 usec per loop
        
Explicitly decoding and encoding is slow again:

        $ python -m timeit -c
        '"blahblah".decode("ascii").encode("ascii").lower()'
        1000000 loops, best of 3: 0.96 usec per loop
        
In an ideal world, we could do:

        if s.isascii():
            return s.lower()
        
..which would be much faster:
        
        $ python -m timeit -s 's = "blahblah"' -c 'if s.isalpha():
        s.lower()'
        10000000 loops, best of 3: 0.163 usec per loop

But I suppose we should protect against people messing with the default
encoding, probably by changing the exception type to UnicodeError.

-- 
Mathematics is the supreme nostalgia of our time.





More information about the Mercurial mailing list