A proposal on solve encoding problem on Windows.

罗勇刚(Yonggang Luo) luoyonggang at gmail.com
Fri Oct 21 17:03:17 UTC 2011


2011/10/22 Matt Mackall <mpm at selenic.com>

> On Fri, 2011-10-21 at 08:48 -0700, Andrey wrote:
> > >
> > > > The most important goal for me was actually this: 2 (3?). use utf8 as
> > > > the default encoding for new commits.
> > > >
> > > > Now I see (thanks, Matt), that it may introduce serious regression
> > > > problems. I need some time to think about a possible solution.
> > > >
> > > >>   if all files in manifest are valid UTF-8:
> > > >>     # repo is already in UTF-8 mode or is pure ASCII
> > > >>     mode = utf8transcoding
> > > >
> > > > This check is just a guess. We cannot rely on it. In general, it is
> > > > not possible to detect the encoding from the sequence of bytes.
> > >
> > > You're right in principle, that a Latin-1 encoded text with "pære"
> also
> > > happen to be the UTF-8 encoding of "pære". However, what Matt writes is
> > > that the chance of that happening is small and so it is okay with him
> to
> > > declare a text to be UTF-8 if it can be correctly decoded as such.
> > >
> > > --
> > > Martin Geisler
> > >
> > >
> > > What I mean is that UTF-16 encoded text may look like (the same bytes)
> as
> > the UTF-8 encoded text
>
> This is incorrect.
>
> >>> u = u'Here is a string. Български'
> >>> u.encode('utf-8')
> 'Here is a string. \xd0\x91\xd1\x8a\xd0\xbb\xd0\xb3\xd0\xb0\xd1\x80\xd1
> \x81\xd0\xba\xd0\xb8'
> >>> u.encode('utf-16')
> '\xff\xfeH\x00e\x00r\x00e\x00 \x00i\x00s\x00 \x00a\x00 \x00s\x00t\x00r
> \x00i\x00n\x00g\x00.\x00 \x00\x11\x04J\x04;\x043\x040\x04@\x04A\x04:
> \x048\x04'
>
> UTF-8 has a couple key properties:
>
> - if a byte looks like ASCII, it is ASCII
> - the first byte of a multibyte character is always of the form 11xxxxxx
> - the first byte encodes the length of the character in bytes
> - the second and later bytes are of the form 10xxxxxx
>
> ..which means that it's very easy to recognize properly encoded UTF-8
> from even a small sample with very high probability. Here's a quick
> program that generates 50k random byte strings of each length and
> reports what percentage are valid UTF-8:


> $ python uc.py
>  1 50.19400000%
>  2 28.46400000%
>  3 15.50600000%
>  4 8.90000000%
>  5 5.05600000%
>  6 2.90200000%
>  7 1.56200000%
>  8 0.96000000%
>  9 0.46400000%
> 10 0.28000000%
> 11 0.13400000%
> 12 0.08000000%
> 13 0.05200000%
> 14 0.03000000%
> 15 0.02600000%
> 16 0.01200000%
> 17 0.00400000%
> 18 0.00200000%
> 19 0.00000000%
> 20 0.00600000%
>
> Given our manifest will be more like 10k - 1MB rather than just 20
> bytes, the odds of getting confused here are really quite negligible.
>
> Yes, detecting UTF8 is somewhat easy, but after migrating from old repo to
UTF repo,
then the detecting will be non-sense, because the same repository have two
different encoding The old one such as (CP936, cp1251, and so on) along
with UTF8.

> Also, as it's impossible to store UTF-16 in Mercurial's -manifest-
> (where we store filenames) due to the presence of NUL bytes, there's no
> chance of confusion.
>
> And elsewhere:
>
> > As far as I understand, if you create a file now on Windows, you will get
> > UTF-16 encoded name, because this is how Python 2 gets the name from the
> OS.
> > That is why the existing repositories do contain UTF-16 encoded names.
>
> Also wrong. What actually happens is that UTF-16 gets decoded into the
> 8-bit "filesystem encoding" when using the standard C interfaces that
> Python 2 wraps. This encoding may be different from the console encoding
> or the GUI encoding. And the console and GUI encodings generally don't
> agree anyway.
>
> --
> Mathematics is the supreme nostalgia of our time.
>
>
> _______________________________________________
> Mercurial mailing list
> Mercurial at selenic.com
> http://selenic.com/mailman/listinfo/mercurial
>



-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurial-scm.org/pipermail/mercurial/attachments/20111022/9cb59973/attachment-0002.html>


More information about the Mercurial mailing list