[PATCH 2 of 3] templater: replace jsonescape in main json templater (issue4926)
Matt Mackall
mpm at selenic.com
Thu Jan 14 18:49:27 UTC 2016
On Thu, 2016-01-14 at 22:12 +0900, Yuya Nishihara wrote:
> On Wed, 13 Jan 2016 10:51:06 -0600, Matt Mackall wrote:
> > On Wed, 2016-01-13 at 22:01 +0900, Yuya Nishihara wrote:
> > > On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote:
> > > > # HG changeset patch
> > > > # User Matt Mackall <mpm at selenic.com>
> > > > # Date 1452542432 21600
> > > > # Mon Jan 11 14:00:32 2016 -0600
> > > > # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83
> > > > # Parent 544d391bd3b42b96975a3521b73c25223db930b0
> > > > templater: replace jsonescape in main json templater (issue4926)
> > > >
> > > > This version differs in a couple ways:
> > > >
> > > > - it skips optional escaping of codepoints > U+007f
> > > > - it thus handles emoji correctly (JSON requires UTF-16 surrogates)
> > > > - but it may run afoul of silly Unicode linebreaks if exec'd in js
> > > > - it uses UTF-8b to round-trip undecodeable bytes
> > >
> > > We can't do that because JSON output can be embedded in non-UTF-8 HTML,
> > > where only 7bit ASCII is allowed,
> >
> > Example scenarios, please.
>
> HGENCODING=utf-8
> export HGENCODING
>
> hg init a
> cd a
> touch foo
> hg ci -Am "$(python -c 'print u"\xc0".encode("utf-8")')"
> hg serve --encoding iso-8859-1
>
> Then, access to http://localhost:8000/graph/tip .
> (In our real-word example, --encoding Shift_JIS and Japanese characters.)
>
> Before this patch, there was no mojibake because "À" is escaped to "\u00c0".
> With this patch, "À" is lost as follows:
>
> u"À" -> "\xc0" (iso-8859-1) -> "\xed\xb3\x80" (utf8b)
> -> "\xed\xb3\x80" (iso-8859-1)
>
> > There's no configuration of hgweb that won't potentially display non-ASCII
> > if it
> > exists in files. If you commit Unicode "á" to a file and fire up
> > "HGENCODING=ascii hg serve", you'll get mojibake in the browser by design
> > (and
> > the correct bytes verbatim if you select raw mode). So I'm not sure what you
> > mean by "allowed". I guess we could get into trouble if we expand JSON
> > directly
> > into some in-page Javascript when the page metadata marks it as non-UTF8.
>
> JSON data can be embedded in non-UTF8 page so long as it is represented in
> ASCII
> and the page encoding is compatible with ASCII.
Ok.
> > > and JSON input (i.e. template string)
> > > is a local-encoding text in general.
> >
> > encoding.jsonescape (indirectly) knows about localstr objects, and thus
> > recovers
> > the original UTF-8 text to encode if it exists.
>
> Yes, but localstr is mostly lost in templater,
Oh? Please elaborate.
> and toutf8b() takes it as bytes,
> not as local-encoding text.
You may be right, but according to the docstring, that means you've found a bug,
> > > I have patch series to fix the issue4926, but I found my patch seems to
> > > have
> > > the emoji issue right now.
> >
> > Whatever we do, we need to kill the second implementation of jsonescape in
> > the
> > templater.
>
> Sure. My series will do:
>
> 1. add option to escape all non-ASCII characters by encoding.jsonescape()
Ok.
> 2. add "|utf8" template filter to explicitly convert localstr|str to utf-8
Sounds suspicious. What's it going to do with filenames?
> 3. change "|json" to take input as utf8b bytes (BC)
Sounds like a really big break from our encoding philosophy.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial-devel
mailing list