encoding.py bug?
Cesar Mena
cesar.mena at gmail.com
Mon Apr 23 01:38:56 UTC 2012
On Sun, Apr 22, 2012 at 4:43 PM, Matt Mackall <mpm at selenic.com> wrote:
> On Sun, 2012-04-22 at 15:48 -0400, Cesar Mena wrote:
> > hi,
> >
> > in encoding.py, stable branch, up to date, shouldn't the first attempt at
> > encoding catch UnicodeEncodeError, as opposed to UnicodeDecodeError?
> >
> > ie,
> >
> > diff --git a/mercurial/encoding.py b/mercurial/encoding.py
> > --- a/mercurial/encoding.py
> > +++ b/mercurial/encoding.py
> > @@ -169,7 +169,7 @@
> > "best-effort encoding-aware case-folding of local string s"
> > try:
> > return s.encode('ascii').lower()
> > - except UnicodeDecodeError:
> > + except UnicodeEncodeError:
>
> Let's check:
>
> >>> "not äscii".encode('ascii')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
> position 4: ordinal not in range(128)
>
> So the current code is correct. What's happening here? Here's a clue:
>
> >>> "not äscii".encode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
> position 4: ordinal not in range(128)
>
> Note it's still complaining about 'ascii'. When we try to encode() a
> byte string, we implicitly try to decode() the string first to a Unicode
> object using the "default" encoding, before trying to encode it to the
> target encoding.
>
> Ok, so what is the default encoding? In Python 2, as shipped, it's
> always "ascii" unless someone has adventurously changed it to something
> else globally:
>
>
> http://hg.python.org/releasing/2.7.3/file/7bb96963d067/Lib/site.py#l487
>
> Why are we doing this in the first place? Because Unicode case-folding
> is ridiculously slow and most of the time we're folding ASCII strings:
>
> $ python -m timeit -c
> '"blahblah".decode("utf-8").lower().encode("ascii")'
> 1000000 loops, best of 3: 1.42 usec per loop
>
> This is faster:
>
> $ python -m timeit -c '"blahblah".encode("ascii").lower()'
> 1000000 loops, best of 3: 0.55 usec per loop
>
> Explicitly decoding and encoding is slow again:
>
> $ python -m timeit -c
> '"blahblah".decode("ascii").encode("ascii").lower()'
> 1000000 loops, best of 3: 0.96 usec per loop
>
> In an ideal world, we could do:
>
> if s.isascii():
> return s.lower()
>
> ..which would be much faster:
>
> $ python -m timeit -s 's = "blahblah"' -c 'if s.isalpha():
> s.lower()'
> 10000000 loops, best of 3: 0.163 usec per loop
>
> But I suppose we should protect against people messing with the default
> encoding, probably by changing the exception type to UnicodeError.
>
> --
> Mathematics is the supreme nostalgia of our time.
>
thank you for the detailed explanation. patch to catch UnicodeError instead
submitted to stable.
-cm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurial-scm.org/pipermail/mercurial/attachments/20120422/7bdb9eb5/attachment-0002.html>
More information about the Mercurial
mailing list