[PATCH STABLE V2] i18n: fix case folding problem with problematic encodings
Matt Mackall
mpm at selenic.com
Thu Dec 1 16:55:11 UTC 2011
On Thu, 2011-12-01 at 18:03 +0900, FUJIWARA Katsunori wrote:
> At Wed, 30 Nov 2011 12:35:55 -0600,
> Matt Mackall wrote:
>
> > > Please confirm my understanding.
> > >
> > > "use upper()" seems to consist of below actions.
> > >
> > > 1. use "upper()" (or NEW "encoding.upper()") for "posix.normcase()"
> >
> > No. The only place "NTFS" and "POSIX" are related is in Microsoft's
> > dreams. Different filesystems -must- have different folding rules.
> > Please examine the link I gave you earlier:
> >
> > http://www.selenic.com/hg/file/ad686c818e1c/mercurial/posix.py#l174
> >
> > HFS+ does:
> >
> > - LOWER case
> > - Unicode NFD normalization
> > - percent-escaping
> >
> > ..so we should mirror those rules on Mac when we detect case-folding.
> > But they'd be wrong elsewhere.
> >
> > (Note that this means that util.normcase is actually charged with
> > handling other forms of folding/mapping as well.)
> >
> > On Linux, where we can have any one of HFS+, NTFS, VFAT, ISO9660, etc.
> > connected either natively, or via one of dozens of network filesystems,
> > we're going to have a really hard time figuring out the underlying
> > case-folding rules for a given path. Also note that the character set
> > used to mount a non-native filesystem may disagree with the user's
> > locale. For instance, NTFS can be mounted in a mode where filenames are
> > represented as UTF-8, but a given user uses Latin1, or vice-versa. The
> > conservative thing to do here is str.lower(). This will be good enough
> > for something like 99% of users: 90% don't use non-native filesystems,
> > and 90% of the rest won't encounter case-collisions of non-ASCII
> > characters.
> >
> > Why does the lower vs upper thing matter at all? It mostly doesn't, but
> > there are few cases where the upper/lower mapping is not 1:1, like
> > Turkish iİıI and Georgian (which has three alphabets, only one of which
> > has "lowercase"). But as long as we have to have filesystem-specific
> > folding, we ought to try to match the filesystem insofar as Python's
> > Unicode database allows us to easily.
> >
> > > 2. switch from "lower()" (or "encoding.lower()") for filename case
> > > folding to "util.normcase()"
> > >
> > > # this is for readabilty/maintenancability
> > >
> > > 3. upper case of fixed strings which are compared against normcase-d
> > > string (or introduce case-folding-compare function ?)
> > >
> > > But "os.path.normcase()" of Windows native Python lowers specified
> > > strings, so compare with upper-ed string seems to cause unexpected
> > > failure.
> >
> > We should probably just ban os.path.normcase() from the Mercurial
> > codebase.
>
> Thank you for detailed explanation !
>
>
> As I understand it:
>
> "util.normcase()" should abstract case folding policy, so
> normcase-ed result should not be expected to be either lower or
> upper.
>
>
> Then, I categorize lower/upper-ing points in current implementation.
>
> A. compare between filenames (directly or in-directly)
>
> "util.normcase()" should be applied on them.
util.normcase should only be applied after we've determined that we're
on a case-insensitive filesystem. We've done a pretty good job of
restricting its usage to dirstate.py, which carefully caches all the
relevant bits with _foldmap and normalize.
You should spend a while understanding dirstate.normalize. It's not
enough to be case-insensitive, we also have to be case-preserving.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial-devel
mailing list