[PATCH STABLE V2] i18n: fix case folding problem with problematic encodings

Matt Mackall mpm at selenic.com
Wed Nov 30 18:35:55 UTC 2011


On Wed, 2011-11-30 at 15:24 +0900, FUJIWARA Katsunori wrote:
> At Tue, 29 Nov 2011 15:38:30 -0600,
> Matt Mackall wrote:
> > 
> > On Wed, 2011-11-30 at 05:24 +0900, FUJIWARA Katsunori wrote:
> > > # HG changeset patch
> > > # User FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
> > > # Date 1322598040 -32400
> > > # Branch stable
> > > # Node ID 5bf954f0303aefbcbfc2eefbefc5d7e9f95b98a7
> > > # Parent  e387e760b207383c961ed8accd35583791a33bb0
> > > i18n: fix case folding problem with problematic encodings
> > > 
> > > changeset 28e98a8b173d for case folding problem with problematic
> > > encoding was not enough.
> > > 
> > > this patch covers up a fault of fix in it.
> > 
> > Eep, way too much in one patch. Each of these bullet points ought to be
> > its own patch.
> > 
> > >   - switch internal format from str to unicode for "util.fspath()"
> > 
> > Broken broken broken on Linux. You can have _any bytes except null and /
> > in a valid Unix filename_, which means they can't be assumed to be
> > decodable in any encoding, let alone the current user's personal
> > encoding. Sensible users will use UTF-8 and UTF-8 only and only exchange
> > files with other people using UTF-8, but there's no guarantee that users
> > are sensible.
> > 
> > (NTFS has a related issue: filenames can be arbitrary 16-bit strings,
> > and needn't map into the valid UTF-16 codepoint space.)
> 
> Thank you for your comment. I'll re-write with your suggestions.
> 
> > >   - switch from "str.lower()" to "encoding.lower()"
> > 
> > Again, lower() is known to be wrong for NTFS. We need to use upper().
> > 
> > https://blogs.msdn.com/b/michkap/archive/2005/01/16/353873.aspx
> 
> Please confirm my understanding.
> 
> "use upper()" seems to consist of below actions.
> 
>   1. use "upper()" (or NEW "encoding.upper()") for "posix.normcase()"

No. The only place "NTFS" and "POSIX" are related is in Microsoft's
dreams. Different filesystems -must- have different folding rules.
Please examine the link I gave you earlier:

http://www.selenic.com/hg/file/ad686c818e1c/mercurial/posix.py#l174

HFS+ does:

- LOWER case 
- Unicode NFD normalization
- percent-escaping

..so we should mirror those rules on Mac when we detect case-folding.
But they'd be wrong elsewhere.

(Note that this means that util.normcase is actually charged with
handling other forms of folding/mapping as well.)

On Linux, where we can have any one of HFS+, NTFS, VFAT, ISO9660, etc.
connected either natively, or via one of dozens of network filesystems,
we're going to have a really hard time figuring out the underlying
case-folding rules for a given path. Also note that the character set
used to mount a non-native filesystem may disagree with the user's
locale. For instance, NTFS can be mounted in a mode where filenames are
represented as UTF-8, but a given user uses Latin1, or vice-versa. The
conservative thing to do here is str.lower(). This will be good enough
for something like 99% of users: 90% don't use non-native filesystems,
and 90% of the rest won't encounter case-collisions of non-ASCII
characters. 

Why does the lower vs upper thing matter at all? It mostly doesn't, but
there are few cases where the upper/lower mapping is not 1:1, like
Turkish iİıI and Georgian (which has three alphabets, only one of which
has "lowercase"). But as long as we have to have filesystem-specific
folding, we ought to try to match the filesystem insofar as Python's
Unicode database allows us to easily.

>   2. switch from "lower()" (or "encoding.lower()") for filename case
>      folding to "util.normcase()"
> 
>      # this is for readabilty/maintenancability
> 
>   3. upper case of fixed strings which are compared against normcase-d
>      string (or introduce case-folding-compare function ?)
> 
> But "os.path.normcase()" of Windows native Python lowers specified
> strings, so compare with upper-ed string seems to cause unexpected
> failure.

We should probably just ban os.path.normcase() from the Mercurial
codebase.

-- 
Mathematics is the supreme nostalgia of our time.





More information about the Mercurial-devel mailing list