Current py3k stage and next steps
Martin Geisler
mg at lazybytes.net
Mon Jun 28 19:51:41 UTC 2010
Matt Mackall <mpm at selenic.com> writes:
> On Mon, 2010-06-28 at 18:05 +0200, Martin Geisler wrote:
>> (I'm sorry about the top post... my phone insists)
>>
>> I think your example f) shows the main point: you cannot mix bytes
>> and text.
>>
>> This is already broken today - you must not output raw bytes in the
>> middle of a string encoded in the local encoding. It is a bug to do
>> so since you can end up producing a byte stream with a mixed encoding
>> such as Latin-1 inside UTF-8.
>
> Sorry, that's just a rule you invented to make the world a less scary
> place. But there is no such rule.
>
> If someone creates a file like this:
>
> $ echo "This is what Latin1 looks like: blah blah blah" > latin1.txt
> <switch to UTF-8>
> $ echo "And this is what UTF-8 looks like: blah blah blah" > utf-8.txt
> $ hg ci -Am"encoding examples"
> $ hg export tip > example.patch
>
> ..that last line will work today without complaint, and it will
> generate a patch that patch(1) understands and recreates a file with
> the same byte contents. -That- is the rule, and anything else is
> wrong.
It sounds like you have the impression that I want to somehow enforce
that the above patch has a single encoding? I fully understand that we
cannot control/guess the encoding of bytes in files.
When constructing a patch, we should continue to do what we do today.
Despite being pretty readable, I consider the patch format a binary
format. We also treat it like that in the current code: we translate the
special strings recognized by patch(1) or use unicode strings or
anything stupid like that.
> Really, Martin, it's high time you wrapped your head around the idea
> of being encoding-agnostic: we only care about the encoding of data
> when we need to. It's how Unix works and it's how Mercurial works and
> it's a perfectly valid alternative (if not bloody obviously superior!)
> to the "everything is characters in some knowable and consistent
> encoding" approach on Windows.
Sure!
All I'm suggesting is that we make a clear separation between strings
which are part of the interface and which should therefore be translated
and transcoded, and other strings which are binary strings used for
outputting patches and the like.
So 'hg export tip' gives binary output, 'hg log -l 3' gives transcoded
output, and 'hg log -l 3 -p' must unfortunately give mixed output.
The think I want to make explode is
_("foo %s bar") % rawbytes
which inserts some scary bytes in the middle of a unicode string. In a
world of strict separation between unicode and str objects, this would
have to be output using code similar to
ui.write(_("foo"))
ui.writeraw(rawbytes)
ui.write(_("bar"))
which I believe Antoine already suggested somewhere else.
--
Martin Geisler
Mercurial links: http://mercurial.ch/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.mercurial-scm.org/pipermail/mercurial-devel/attachments/20100628/dd351bd5/attachment.asc>
More information about the Mercurial-devel
mailing list