Unicode Windows API, Was: Concerns about using Python's ctypes library on Windows
Mads Kiilerich
mads at kiilerich.com
Sat Jul 30 19:03:07 UTC 2011
Adrian Buehlmann wrote, On 07/29/2011 09:48 PM:
> On 2011-07-29 18:37, Andrei Polushin wrote:
>> 1. What if Mercurial will to switch to Unicode Windows APIs?
> This won't happen. Ask Matt why. (But it wouldn't be a problem anyway)
I'm not so sure that it won't happen. There is a problem with cross
platform unicode and it should be fixed.
Some things obviously won't change. Mercurial will remain backward
compatible and will keep storing and working with encoded filenames
exactly like they are represented by traditional unix systems: as a
sequence of bytes without \0 and with / only used as separator between
path elements. That works just fine on and between systems that uses the
same encoding, and with the UTF-8 encoding unicode file names is no
problem. All major platforms - except Windows - now uses UTF-8 and it is
the de facto standard encoding in Mercurial. (Most experienced
developers are however smart enough to restrict themselves to (a subset
of) 7-bit ASCII.)
Non-Windows platforms generally works fine and are not going to change,
so some kind of Windows-specific solution/hack is needed. It could be
argued that it is unfair to define the problem in such a way that the
burden is put on Windows, but that is how it is, and that is how we have
to look at it if we want to be constructive and improve the situation.
This is where I think Mercurial on Windows could and should grow
_optional_ support for using unicode Windows APIs. The problem on
Windows is that Mercurial doesn't use UTF-8 APIs - both because Windows
uses UTF-16 instead of UTF-8 for its unicode APIs, and because Mercurial
uses the 8-bit API. Instead we could use the UTF-16 API and convert
to/from UTF-8 at the API level.
I think we should acknowledge, support and utilize that local and
"unknown" encodings all are converging towards UTF-8 for most users. It
happens automatically on other platforms, but some extra attention and
hacks are required on Windows.
Note that:
* It will be a bit tricky to introduce unicode on Windows in such a way
that existing repos keeps working the old quirky way while new repos
uses such a more cross platform UTF-8 approach.
* Console input/output of unicode on Windows seems to be a lost game no
matter what we do. Users will have to accept some garbage and use
wildcards (or file://...%xx...) to specify unicode filenames.
Re-encoding might not be an option for a general solution to console
output, but keeping everything in UTF-8 until the low-level write
function where it can be re-encoded (with loss?) to whatever encoding
the user wants seems to be one of the least broken solutions.
* References to filenames from file content (such as make files or build
systems) can in principle break if files on Windows no longer are
created with unreadable UTF-8 in their names. Many build systems are
however unicode aware, so while the build file might be encoded in UTF-8
the build system on Windows will reference the file using the UTF-16
encoded name.
* The fixutf8 extension already implement this UTF-8 approach and seems
to have some happy users. It could perhaps be promoted to a standard
extension or it could be used as inspiration for a new and better and
more complete and maintainable implementation.
* Case-folding and unicode normalization could be considered a separate
(and almost solved) problem.
* This discussion is mostly about file system access in the working
directory. All (?) files in .hg have names in plain ASCII and do thus
not have encoding issues but do instead have other requirements for file
system semantics.
/Mads
More information about the Mercurial-devel
mailing list