[PATCH 1 of 2] encoding: make utf8b encoder more robust (issue4927)
Matt Mackall
mpm at selenic.com
Fri Nov 6 17:09:35 UTC 2015
On Fri, 2015-11-06 at 23:03 +0900, Yuya Nishihara wrote:
> On Wed, 4 Nov 2015 22:46:26 +0900, Yuya Nishihara wrote:
> > We might be possible to use the error handler to map invalid chars
> > to \udcxx,
> > but I've never tried it and it seems the handler table is global.
> >
> > https://docs.python.org/2.7/library/codecs.html#codecs.register_err
> > or
>
> Catching error won't work if the source string contains a valid
> surrogate-
> encoded sequence.
>
> >>> s = u'\udc00'.encode('utf-8')
> >>> encoding.toutf8b(s)
> '\xed\xb0\x80' # should be '\xed\xb3\xad\xed\xb2\xb0\xed\xb2\x80'
> ?
> >>> encoding.fromutf8b(encoding.toutf8b(s))
> '\x00'
Don't worry, I've got a stack of changes to fix this that handles a
thorough fuzz-testing.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial-devel
mailing list