Problem with Unicode file names under Windows

Joop Eggen

unread,

Apr 14, 2011, 7:12:36 AM4/14/11

to merc...@selenic.com

Hello.

Technical Problem
Files and directories in Unicode (UTF-8) cannot be dealt with.

Our Problem
This really prevents us to switch (from subversion) to mercurial.

Weight
Nowadays unicode is a required standard in projects made internationally available.
For instance for business logic we use the German terms, so have a source Begrüßungspauschale.java.
An ASCII name Begruessungspauschale is plain ugly / harder to read.

Remarks
As unicode is not canonical (accented letter may also be letter+accent), one probably should also use normalization like NFC / NKFC.

Motive
Though I love to introduce mercurial, as it is integrated in NetBeans IDE; in my eyes better than git, and such.

Question
Could there be foreseeable solution in - say - two months?

I am aware of the scope of such change request.
Friendly regards,
Joop Eggen

Scott Palmer

unread,

Apr 14, 2011, 8:43:32 PM4/14/11

to Joop Eggen, merc...@selenic.com

Are you aware of this extension: http://mercurial.selenic.com/wiki/FixUtf8Extension ?

I don't know how well it works as I am lucky enough to only need to deal with english file names (so far). :-)

Cheers,

Scott

Masklinn

unread,

Apr 15, 2011, 5:30:38 AM4/15/11

to Joop Eggen, merc...@selenic.com

I have to say, this report is anything but informative, and I have (had?) a hard time making sense of some sections:

> Files and directories in Unicode (UTF-8) cannot be dealt with.

Do you mean that Mercurial does not transcode file names from some original encoding to NTFS-flavored UTF-16 [-1]?

> Our Problem

You don't really explain your problem, give any example, allow for any repro or anything like that, I do not see how the core team could help you without you providing any useful information.

I am guessing you created some non-ascii file names on a system using utf-8 (likely a linux system?), and the names came out garbled on Windows? I created a test repository containing non-ascii file names on OSX [0] and could reproduce the issue (I believe) after cloning it under Windows 7 using TortoiseHG: the names come out OK in TortoiseHG's log [1] but not on the file system [2]. There was no problem editing a file, even through the most basic text editor available on the plateform (Notepad), a revision was created by opening a file in notepad, adding a line taken from Wikipedia (korean I believe, I just took a line on a random non-eurolang wikipedia) [3].

Have you checked the Mercurial Wiki page on encoding, at least as reference? [4]. Quite simply (see section 4) mercurial simply assumes that non-ascii file names are not portable between systems (because they are not) and (as far as I understand what I read, I may be wrong) treats file names as byte streams, without trying to perform any manipulation on them. This is similar to what Unix filesystems (e.g. the ext* family) generally do.

> As unicode is not canonical (accented letter may also be letter+accent), one probably should also use normalization like NFC / NKFC.

I see no reason to do that, short of a tool or filesystem dealing incorrectly with a normalization it does not like.

[-1] http://blogs.msdn.com/b/michkap/archive/2006/09/10/748699.aspx
[0] https://bitbucket.org/masklinn/filenames
[1] http://imgur.com/5jkNV
[2] http://imgur.com/16Rwp
[3] https://bitbucket.org/masklinn/filenames/changeset/cbcbdaf5c8f6
[4] http://mercurial.selenic.com/wiki/EncodingStrategy
_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial

Martin Geisler

unread,

Apr 15, 2011, 12:17:10 PM4/15/11

to Masklinn, Joop Eggen, merc...@selenic.com

Masklinn <mask...@masklinn.net> writes:

> On 2011-04-14, at 13:12 , Joop Eggen wrote:
>
>> Our Problem
>
> You don't really explain your problem, give any example, allow for any
> repro or anything like that, I do not see how the core team could help
> you without you providing any useful information.

Well, there is no need for a reproduction script -- the problem he
describes is unfortunately a well-known problem in Mercurial: filenames
are read and written as bytes whereas other systems, Subversion in
particular, writes them as Unicode characters.

> Have you checked the Mercurial Wiki page on encoding, at least as
> reference? [4]. Quite simply (see section 4) mercurial simply assumes
> that non-ascii file names are not portable between systems (because
> they are not)

That is not true in general: it depends on the tool you use to access
the files with. Modern tools like OpenOffice and Subversion use the
current locale settings to decode the bytes themselves to decode them
into Unicode characters.

Older tools like make and CVS have not clue about character encodings in
filenames and Mercurial has unfortunately chosen to follow along in that
old tradition.

(I know that there might be filenames that cannot be decoded or encoded
with the current locale settings, but I feel that it would be better if
Mercurial dealt with that instead of punting on the issue.)

--
Martin Geisler

Mercurial links: http://mercurial.ch/

罗勇刚(Yonggang Luo)

unread,

Apr 15, 2011, 12:58:00 PM4/15/11

to Martin Geisler, mercurial

Yes , I agreed, that's very hard for Chinese like people to using hg
or git. Compared to subvesion, it's an fallback, like returned to
ancient time. We can using only ascii characters to deal with possible
issues, as users, we don't care of the advantage of technical, just
care of the comfortable of usage, indeed, many software issues is
arising by not support unicode. Comparing python2 with python3, that's
an obvious example. So, plz , do not just consider the ascii world,
but also consider the rest of the world.....

2011/4/16, Martin Geisler <m...@lazybytes.net>:

--
从我的移动设备发送

此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo

Andrey

unread,

Oct 14, 2011, 4:59:16 AM10/14/11

to mercuria...@googlegroups.com, mercurial

As far as I can see the message is clear: the problem is known but ignored.

Sad...

-

Andrey

Dennis Brakhane

unread,

Oct 20, 2011, 5:01:01 AM10/20/11

to mercuria...@googlegroups.com, mercurial

Am 14.10.2011 10:59 schrieb "Andrey" <py4...@gmail.com>:
>
> As far as I can see the message is clear: the problem is known but ignored.

As I understood it, it's more like "nobody knows how to fix it properly, without breaking other usecases "

罗勇刚(Yonggang Luo)

unread,

Oct 20, 2011, 5:57:35 AM10/20/11

to Dennis Brakhane, mercurial, mercuria...@googlegroups.com

There is a properly way to do that, just no one(those maintainers) accepted it.

These just stick on the bad and wrong schema on windows and think they were right.

--

Reply all

Reply to author

Forward