[PATCH 1 of 8 RFC] vfs: replace invocation of file APIs of os module by ones via vfs

FUJIWARA Katsunori foozy at lares.dti.ne.jp
Sun Jun 17 08:27:16 UTC 2012


At Sat, 16 Jun 2012 23:34:47 +0200,
Adrian Buehlmann wrote:

> Some further, perhaps stupid and wild ideas:
> 
> For the openers (e.g. scmutil.opener) I think we might have to put a unicode
> string into base (see scmutil.py):
> 
> 199:    def __init__(self, base, audit=True):
> 200:        self.base = base
> 
> For the store openers, the path parameter on __call__
> 
> 218:    def __call__(self, path, mode="r", text=False, atomictemp=False):
> 
> would then be plain ASCII strings, as the filenames in the store are
> all encoded already, using ASCII characters only.
> 
> Then the join function
> 
> 293:    def join(self, path):
> 293:        return os.path.join(self.base, path)
> 
> needs to return a unicode string, which is formed by using the "base"
> unicode string and joining it with the ASCII path.
> 
> join() is used in __call__() to form the final, complete path f
>
> 224:        f = self.join(path)
> 
> which needs to be a unicode string as well (on Windows, of course).
>
> We then need a unicode version of util.posixfile
> 
> 261:        fp = util.posixfile(f, mode)
> 
> Which takes the unicode filename f.
> 
> So we would then also need a unicode version of posixfile for Windows in
> osutil.c, line 410.
> 
> The store openers need to be unicode-aware because of the base.
> 
> base is somewhere under the repo root. Which in turn can have funny characters
> (e.g. Japanese).
> 
> I think this has to be done unconditionally, if we want to support repo
> roots with funny paths.
> 
> Likewise, the base of wopeners need to be unicode strings as well for
> the same reasons.
> 
> But there, we ideally most likely want to have the path parameter on
> __call__ in UTF-8, or some other encoding (e.g. latin1 or whatever?),
> depending on some other conditions (the switching as per Matt's ideas).

For example, I can create files named as below via Python Unicode file
API even on Japanese Windows using cp932 as system code page:

  - u'\u00c0'
  - u'\u30cf\u309a' (NFD-ed u'\u30d1', which is valid in cp932)

But I can't access them via Python ANSI file API, because such Unicode
characters has no corresponding characters in cp932.

# "os.listdir('.')" returns mangled names for them

So, I think that there are two kinds of "funny" paths:

  (A) using only chars valid in system code page
  (B) using also chars not valid in system code page

If repo root path is (A):

  - root path (A) can be encoded to valid byte sequence in system code
    page, and

  - encoded "root path (A)" and the path in workdir in any encoding
    can be joined as byte sequence

So, we can access target files also by ANSI file API correctly: we can
switch ANS/Unicode file API, according to some conditions suggested by
Matt.

In the other hand, if repo root path is (B):

  - root path (B) can't be encoded to valid byte sequence in system
    code page, so we should use Unicode file API to access files under
    such directory, but

  - "root path (B)" and "legacy" path (not encoded in UTF8) can't be
    joined as Unicode without any information about encoding of
    "legacy" path

So, we can't access target "legacy" files in this case !


The paths to subrepos in workdir or manually renaming by users may
also cause this problem, even if we restrict repo root paths to (A) at
creation by clone or init.


It seems to be also problem that "valid in system code page" is not
portable concept between each environments: valid paths on Japanese
Windows may be not so on other ones.


But sorry, I know about Windows native API only little, so please
teach me if there are some good API to resolve this problem !

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy at lares.dti.ne.jp



More information about the Mercurial-devel mailing list