[PATCH 1 of 8 RFC] vfs: replace invocation of file APIs of os module by ones via vfs
FUJIWARA Katsunori
foozy at lares.dti.ne.jp
Sun Jun 17 08:27:16 UTC 2012
At Sat, 16 Jun 2012 23:34:47 +0200,
Adrian Buehlmann wrote:
> Some further, perhaps stupid and wild ideas:
>
> For the openers (e.g. scmutil.opener) I think we might have to put a unicode
> string into base (see scmutil.py):
>
> 199: def __init__(self, base, audit=True):
> 200: self.base = base
>
> For the store openers, the path parameter on __call__
>
> 218: def __call__(self, path, mode="r", text=False, atomictemp=False):
>
> would then be plain ASCII strings, as the filenames in the store are
> all encoded already, using ASCII characters only.
>
> Then the join function
>
> 293: def join(self, path):
> 293: return os.path.join(self.base, path)
>
> needs to return a unicode string, which is formed by using the "base"
> unicode string and joining it with the ASCII path.
>
> join() is used in __call__() to form the final, complete path f
>
> 224: f = self.join(path)
>
> which needs to be a unicode string as well (on Windows, of course).
>
> We then need a unicode version of util.posixfile
>
> 261: fp = util.posixfile(f, mode)
>
> Which takes the unicode filename f.
>
> So we would then also need a unicode version of posixfile for Windows in
> osutil.c, line 410.
>
> The store openers need to be unicode-aware because of the base.
>
> base is somewhere under the repo root. Which in turn can have funny characters
> (e.g. Japanese).
>
> I think this has to be done unconditionally, if we want to support repo
> roots with funny paths.
>
> Likewise, the base of wopeners need to be unicode strings as well for
> the same reasons.
>
> But there, we ideally most likely want to have the path parameter on
> __call__ in UTF-8, or some other encoding (e.g. latin1 or whatever?),
> depending on some other conditions (the switching as per Matt's ideas).
For example, I can create files named as below via Python Unicode file
API even on Japanese Windows using cp932 as system code page:
- u'\u00c0'
- u'\u30cf\u309a' (NFD-ed u'\u30d1', which is valid in cp932)
But I can't access them via Python ANSI file API, because such Unicode
characters has no corresponding characters in cp932.
# "os.listdir('.')" returns mangled names for them
So, I think that there are two kinds of "funny" paths:
(A) using only chars valid in system code page
(B) using also chars not valid in system code page
If repo root path is (A):
- root path (A) can be encoded to valid byte sequence in system code
page, and
- encoded "root path (A)" and the path in workdir in any encoding
can be joined as byte sequence
So, we can access target files also by ANSI file API correctly: we can
switch ANS/Unicode file API, according to some conditions suggested by
Matt.
In the other hand, if repo root path is (B):
- root path (B) can't be encoded to valid byte sequence in system
code page, so we should use Unicode file API to access files under
such directory, but
- "root path (B)" and "legacy" path (not encoded in UTF8) can't be
joined as Unicode without any information about encoding of
"legacy" path
So, we can't access target "legacy" files in this case !
The paths to subrepos in workdir or manually renaming by users may
also cause this problem, even if we restrict repo root paths to (A) at
creation by clone or init.
It seems to be also problem that "valid in system code page" is not
portable concept between each environments: valid paths on Japanese
Windows may be not so on other ones.
But sorry, I know about Windows native API only little, so please
teach me if there are some good API to resolve this problem !
----------------------------------------------------------------------
[FUJIWARA Katsunori] foozy at lares.dti.ne.jp
More information about the Mercurial-devel
mailing list