Unicode support for non-unicode locales
Densetsu no Ero-sennin
densetsu.no.ero.sennin at gmail.com
Tue Oct 9 17:45:08 UTC 2007
On 9 October 2007 (Tue), Matt Mackall wrote:
> utf-8$ touch <japan>
> utf-8$ tar --posix -c -f foo.tar <japan>
> utf-8$ zip foo.zip <japan>
>
> ascii$ tar --posix -x -v -f ../foo.tar
> \346\227\245\346\234\254\345\233\275
> ascii$ ls
> ?????????
> ascii$ rm *
> ascii$ unzip ../foo.zip
> Archive: ../foo.zip
> extracting: <garbage>
> ascii$ ls
> ?????????
Things are not that simple. Here's another example (I assume you can see
Cyrillics).
$ echo $LANG
en_US.UTF-8
$ tar --version | head -n 1
tar (GNU tar) 1.18
$ touch проверка
$ tar --posix -c -f foo.tar проверка
$ tar -t -f foo.tar
проверка
$ rm проверка
$ LC_ALL=ru_RU.KOI8-R tar -t -f foo.tar | iconv -f KOI8-R
проверка
$ LC_ALL=ru_RU.KOI8-R tar -x -f foo.tar
$ ls # must output some garbage
п©я─п╬п╡п╣я─п╨п╟
$ ls | iconv -f KOI8-R
проверка
Actually, when unpacking an archive in POSIX.1-2001 format, tar produces
correctly encoded filenames if locale encoding allows it. But if not, it
encodes filenames in UTF-8, regardless of the locale, producing lots of
garbage characters. Personally, I'd prefer it it to fail with lots of error
messages instead of that.
More information about the Mercurial-devel
mailing list