[PATCH 2 of 3] templater: replace jsonescape in main json templater (issue4926)

Matt Mackall mpm at selenic.com
Thu Jan 14 18:49:27 UTC 2016


On Thu, 2016-01-14 at 22:12 +0900, Yuya Nishihara wrote:
> On Wed, 13 Jan 2016 10:51:06 -0600, Matt Mackall wrote:
> > On Wed, 2016-01-13 at 22:01 +0900, Yuya Nishihara wrote:
> > > On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote:
> > > > # HG changeset patch
> > > > # User Matt Mackall <mpm at selenic.com>
> > > > # Date 1452542432 21600
> > > > #      Mon Jan 11 14:00:32 2016 -0600
> > > > # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83
> > > > # Parent  544d391bd3b42b96975a3521b73c25223db930b0
> > > > templater: replace jsonescape in main json templater (issue4926)
> > > > 
> > > > This version differs in a couple ways:
> > > > 
> > > > - it skips optional escaping of codepoints > U+007f
> > > > - it thus handles emoji correctly (JSON requires UTF-16 surrogates)
> > > > - but it may run afoul of silly Unicode linebreaks if exec'd in js
> > > > - it uses UTF-8b to round-trip undecodeable bytes
> > > 
> > > We can't do that because JSON output can be embedded in non-UTF-8 HTML,
> > > where only 7bit ASCII is allowed,
> > 
> > Example scenarios, please.
> 
> HGENCODING=utf-8
> export HGENCODING
> 
> hg init a
> cd a
> touch foo
> hg ci -Am "$(python -c 'print u"\xc0".encode("utf-8")')"
> hg serve --encoding iso-8859-1
> 
> Then, access to http://localhost:8000/graph/tip .
> (In our real-word example, --encoding Shift_JIS and Japanese characters.)
> 
> Before this patch, there was no mojibake because "À" is escaped to "\u00c0".
> With this patch, "À" is lost as follows:
> 
>   u"À" -> "\xc0" (iso-8859-1) -> "\xed\xb3\x80" (utf8b)
>   -> "\xed\xb3\x80" (iso-8859-1)
> 
> > There's no configuration of hgweb that won't potentially display non-ASCII
> > if it
> > exists in files. If you commit Unicode "á" to a file and fire up
> > "HGENCODING=ascii hg serve", you'll get mojibake in the browser by design
> > (and
> > the correct bytes verbatim if you select raw mode). So I'm not sure what you
> > mean by "allowed". I guess we could get into trouble if we expand JSON
> > directly
> > into some in-page Javascript when the page metadata marks it as non-UTF8.
> 
> JSON data can be embedded in non-UTF8 page so long as it is represented in
> ASCII
> and the page encoding is compatible with ASCII.

Ok.

> > >  and JSON input (i.e. template string)
> > > is a local-encoding text in general.
> > 
> > encoding.jsonescape (indirectly) knows about localstr objects, and thus
> > recovers
> > the original UTF-8 text to encode if it exists.
> 
> Yes, but localstr is mostly lost in templater,

Oh? Please elaborate.

>  and toutf8b() takes it as bytes,
> not as local-encoding text.

You may be right, but according to the docstring, that means you've found a bug,

> > > I have patch series to fix the issue4926, but I found my patch seems to
> > > have
> > > the emoji issue right now.
> > 
> > Whatever we do, we need to kill the second implementation of jsonescape in
> > the
> > templater.
> 
> Sure. My series will do:
> 
>  1. add option to escape all non-ASCII characters by encoding.jsonescape()

Ok.

>  2. add "|utf8" template filter to explicitly convert localstr|str to utf-8

Sounds suspicious. What's it going to do with filenames?

>  3. change "|json" to take input as utf8b bytes (BC)

Sounds like a really big break from our encoding philosophy.

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list