[PATCH 6 of 6 stable] convert: handle percent-encoded bytes in file URLs like Subversion
Yuya Nishihara
yuya at tcha.org
Tue Jun 30 14:48:30 UTC 2020
On Tue, 30 Jun 2020 16:28:55 +0200, Manuel Jacob wrote:
> On 2020-06-30 16:03, Yuya Nishihara wrote:
> > On Tue, 30 Jun 2020 15:43:29 +0200, Manuel Jacob wrote:
> >> >>> - path = pycompat.fsencode(unicodepath)
> >> >>> + return False
> >> >>> + path = unicodepath.encode(fsencoding)
> >> >>
> >> >> I think pycompat.fsencode() is more correct since the path will be
> >> >> later
> >> >> tested by os.path.*() functions. On Python 2, I'm not sure. Maybe our
> >> >> encoding.unitolocal() is okay.
> >> >
> >> > It was deliberate to use something that fails when it can’t be
> >> > encoded. However, I thought about failing in case of a logic error,
> >> > not about that the URL could contain a (UTF-8-encoded) code point that
> >> > is not available in `fsencoding`.
> >> >
> >> > On Windows, Subversion will be able to handle all code points. We’re
> >> > unnecessarily restricting ourselves by doing the rest of the function
> >> > on bytes. Although not entirely Mercurial-idiomatic, we should maybe
> >> > consider doing the rest of the function in Unicode. We can also bail
> >> > out, but we have to make sure the warning mentions that it’s not the
> >> > user’s fault.
> >> >
> >> > On Unix, Subversion will try later to convert the path to the locale
> >> > encoding, which fails if the code point is not encodable. We should
> >> > bail out in this case.
> >>
> >> Let me add that with "bail out", I included that a warning describing
> >> the problem is shown.
> >>
> >> Since what I described is mostly about showing nicer messages, should
> >> they be added in an amended patch or as follow-ups?
> >
> > I'm not sure if I understood what you mean.
> >
> > I just pointed out that the fsencoding is the encoding in which
> > Subversion
> > will convert unicode to bytes, whereas fsencode() is the function for
> > Python
> > layer. And the path should be a bytes encoded in Python way.
>
> Subversion will do a similar check as in the rest of the function to
> determine whether the repo is actually a SVN repo.
>
> If we use a different encoding, we have a higher chance of running into
> cases where our function thinks it’s a SVN repo, while Subversion thinks
> it’s not (and vice versa). Therefore, I think we should use
> `fsencoding`.
Then, can you adjust the documentation of fsencoding? Without that, I would
see there are at least three types of path variables:
- py_fs_bytes_t (or maybe hg_fs_bytes_t)
- svn_fs_bytes_t
- unicode_t (or svn_url_unicode_t)
Be aware that the path will be converted back to UCS2-ish encoding on
Windows by either Python 3 or Win32 ANSI API.
More information about the Mercurial-devel
mailing list