Add a Unicode mode, but keep the bytes mode

17 views
Skip to first unread message

Victor Stinner

unread,
Nov 4, 2011, 8:47:14 AM11/4/11
to mercuri...@selenic.com
Hi,

Summary: because we cannot solve all issues with a single data type (bytes or
Unicode), I propose to offer two exclusive modes: bytes and Unicode. People who
need a full Unicode support can chose the new Unicode mode, whereas existing
repositories will continue to work as before. This email lists all limitation
of each data type (repository "kind").

--

Thanks to the recent discussions, I have now a better idea of the issues
related to "Unicode filenames" (store filenames as UTF-8 and use a Unicode type
in Python). All issues listed above concern non-ASCII filenames. If you only
use ASCII filenames (which is the most common case), you don't have to be
worried by these issues :-)

There are two main use cases:

A) Portable project used on any platform (Windows, Linux and Mac OS X) shared
by a lot of people, tools compatible with Unicode filenames
B) Project specific to a platform shared by a small group, typical only on
UNIX, "legacy" tools (incompatible with Unicode filenames)

Unicode have to be used for (A), and bytes have to be used for (B). So I
propose to add a new "Unicode" mode to Mercurial.

--

It will be possible to convert a repository between the two modes under the
following conditions:

* Unicode->bytes requires an encoding able to encode all filenames. E.g. you
cannot convert to Latin1 if a filename contains a japanese character.

* bytes->Unicode requires an encoding able to decode all filenames. E.g. If
filenames were created on a latin1 system, you cannot convert the repository
from UTF-8 (you will get Unicode decode errors).

If it is the same computer used to create and convert the repository, it will
work on both cases (the locale encoding will be used).

You will have to use the same mode than all people of your project. You cannot
use bytes whereas others use Unicode. The mode has to be chosen when you
create a new repository, or the repository has to be converted only once when
everybody agrees (and after some tests).

There is no reason to convert from bytes to Unicode if you don't manipulate
non-ASCII filenames. You may want to move to Unicode if you have mojibake
issues (e.g. if you need to support the full Unicode range on Windows).

The default kind will be bytes until enough third-party tools are compatible
with Unicode (e.g. make).

--

Each repository mode has limitations:

* Unicode: you cannot checkout a repository on UNIX if your locale is unable
to encode all filenames. E.g. if your locale encoding is ASCII on Linux, you
cannot clone a repository containing non-ASCII filenames.

* bytes: you don't have access to the full Unicode range on Windows

* bytes: mojibake issues (filenames not displayed correctly) depending on your
locale

--

Summary of all issues related to filenames.

"Makefile": if a file contains a filename stored as bytes, you cannot "transcode"
filenames between two computers.
=> continue to use the bytes mode (until you solve these issues?)

Mojibake: filenames are currently stored as bytes without the encoding
information, if a filename was created on Windows with the cp1252 ANSI code
page or on Linux with latin1 code page, filenames are not displayed correctly
on Windows or Linux using a different code page/locale encoding (e.g. UTF-8 on
Linux).
=> convert your repository to Unicode

If filenames are stored as Unicode and your locale encoding cannot encode them,
you cannot checkout the repository.
=> continue to use the bytes mode, change your locale encoding or rename files

Mac OS X normalizes filenames to a variant of the decomposed form (NFD) when
the filesystem is HFS+.
=> Unicode filenames will be normalized

--

Now some technical details.

In the Python souce code, it is not a good idea to have two versions of each
function, one to process bytes filename, one to process Unicode filenames. I
suggest to always use the Unicode type because:

- we can store any bytes in Unicode using the ASCII encoding and the
surrogateescape error handler (PEP 383)
- you don't to store the encoding of a Unicode string, because the charset is
known (it's the Universal Character Set of Unicode)
- in Python 3, it's more pratical to manipulate Unicode than bytes

We might use bytes by encoding Unicode to UTF-8, but it would be more error-
prone because you have to be very careful to not concatenate two byte strings
of different encodings.

So non-ASCII characters will be stored in memory as surrogates in U+DC80-
U+DCFF, whereas ACSII characters will be stored as Unicode characters (U+0000-
U+007F range). On the disk, the filenames will be stored as bytes.

A global flag (maybe something like "unicode" in .hg/requires?) would indicate
if we use bytes or Unicode.

(Unicode mode) Filenames will be normalized to NFC when a directory content is
listed or when you pass a filename on the command line. So a checkout will pass
filenames normalized to NFC to the kernel. On Linux, Windows and Mac OS X, the
keyboard creates precomposed keys (use NFC), so it's better to use NFC. On Mac
OS X, the kernel normalize the filenames to its variant of NFD.

Victor
_______________________________________________
Mercurial-devel mailing list
Mercuri...@selenic.com
http://selenic.com/mailman/listinfo/mercurial-devel

Antoine Pitrou

unread,
Nov 4, 2011, 10:20:16 AM11/4/11
to mercuri...@selenic.com
On Fri, 04 Nov 2011 13:47:14 +0100
Victor Stinner <victor....@haypocalc.com> wrote:
>
> Each repository mode has limitations:
>
> * Unicode: you cannot checkout a repository on UNIX if your locale is unable
> to encode all filenames. E.g. if your locale encoding is ASCII on Linux, you
> cannot clone a repository containing non-ASCII filenames.

You can clone, you cannot checkout.
But you could also checkout if you accepted some (bijective) mangling of
un-decodable filenames.

Regards

Antoine.

Andrey

unread,
Nov 4, 2011, 1:20:28 PM11/4/11
to mercuri...@googlegroups.com, mercuri...@selenic.com
Great work.


On Friday, November 4, 2011 1:47:14 PM UTC+1, Victor Stinner wrote:

The default kind will be bytes until enough third-party tools are compatible
with Unicode (e.g. make).

I think the default should be Unicode. Users begin to use non-ASCII file names only when they get support from all the tools they use.
First, you create a file, run the tools, check the result and only then commit and push the changeset.

When new repositories are created with the proper style (UTF-8 encoded file names), it helps to solve the backwards compatibility. Otherwise we will be forever stuck with the legacy layout.

-
Andrey

Victor Stinner

unread,
Nov 4, 2011, 7:48:29 PM11/4/11
to mercuri...@selenic.com

The problem is that you will need the last Mercurial version to checkout (and
work on) such repository.

Well, we may alllow to checkout such repository, but old Mercurial versions
store filenames in the locale encoding (not in UTF-8). So if a new file is added
and pushed with an old Mercurial version to a "Unicode compliant" (new)
Mercurial server, "it doesn't work" (I don't think that the server can ask the
client for its locale encoding and the hash will be different if the filename is
stored differently...).

That's why I see this new Unicode mode as a requirement (.hg/requires).

Laurens Holst

unread,
Nov 5, 2011, 6:57:23 AM11/5/11
to mercuri...@selenic.com
Op 5-11-2011 0:48, Victor Stinner schreef:

> Le vendredi 4 novembre 2011 18:20:28, Andrey a écrit :
>> Great work.
>>
>> On Friday, November 4, 2011 1:47:14 PM UTC+1, Victor Stinner wrote:
>>> The default kind will be bytes until enough third-party tools are
>>> compatible
>>> with Unicode (e.g. make).

In other words: never.

I think a better point to switch the default would be when systems with
non-Unicode encoding are a thing of the past. Or to switch it right now,
and have a little more sensible fallback behaviour on non-Unicode
systems than ‘you can’t update at all’.

> The problem is that you will need the last Mercurial version to checkout (and
> work on) such repository.
>
> Well, we may alllow to checkout such repository, but old Mercurial versions
> store filenames in the locale encoding (not in UTF-8). So if a new file is added
> and pushed with an old Mercurial version to a "Unicode compliant" (new)
> Mercurial server, "it doesn't work"

Wouldn’t the behaviour for old client versions be identical to when the
repository were created on an UTF-8 system? That is, check out fine on
an UTF-8 system, and get the usual garbling of non-ASCII characters on a
Latin-1 system.

Another possibility might be to add two configuration options, one
describes the repository encoding and one the target encoding. Without
these set, Mercurial is encoding-agnostic (current behaviour), when you
set the repository encoding it automatically recodes filenames to your
local system’s encoding, or to the target encoding (if set). I think
this is similar to what the eol extension does.

~Laurens

--
~~ Ushiko-san! Kimi wa doushite, Ushiko-san nan da!! ~~
Laurens Holst, developer, Utrecht, the Netherlands
Website: www.grauw.nl. Working @ www.roughcookie.com


Victor Stinner

unread,
Nov 5, 2011, 12:10:22 PM11/5/11
to mercuri...@selenic.com
Le samedi 5 novembre 2011 11:57:23, Laurens Holst a écrit :
> Op 5-11-2011 0:48, Victor Stinner schreef:
> > Le vendredi 4 novembre 2011 18:20:28, Andrey a écrit :
> >> Great work.
> >>
> >> On Friday, November 4, 2011 1:47:14 PM UTC+1, Victor Stinner wrote:
> >>> The default kind will be bytes until enough third-party tools are
> >>> compatible
> >>> with Unicode (e.g. make).
>
> In other words: never.

Windows supports Unicode since Windows 95 (and non-BMP characters since
Windows 2000), but many Windows programs still use the ANSI (bytes) API (e.g.
Mercurial ;-)).

On Mac OS X, the kernel process filenames UTF-8, and most program uses
indirectly UTF-8 and so are Unicode compliant.

On UNIX, it does really depend on the locale encoding. There are still some
old systems using an encoding different than UTF-8, but all new systems use
UTF-8, and so, as Mac OS X, are Unicode compliant. But well, even if the
system uses UTF-8 encoding, you may get mojibake if the encoding of an USB key
is not correctly detected, or if you unpack an old archive (e.g. TAR archive
stores filenames are bytes, if you created your archive on a latin1 system, you
must have a latin1 locale encoding).

So it *is* possible to have a fully Unicode compliant system today... if your
system is well configured, if you are careful, and if don't have to handle old
content. There are many conditions, but it is possible ;-) And slowly it
becomes more and more easy to have such system.

> I think a better point to switch the default would be when systems with
> non-Unicode encoding are a thing of the past.

As any new features, it is better to wait for user feedback to improve the
feature and maybe fix bugs, before using it by default.

It would be too fast to use directly by default because users will continue to
use old Mercurial versions for some time (as some people are still using
Python 2.4 even if Python 2.7 and 3.2 are released) and the new Unicode mode
is not fully backward compatible.

> Or to switch it right now,
> and have a little more sensible fallback behaviour on non-Unicode
> systems than ‘you can’t update at all’.

The corner case is not "hg pull -u" but "hg push" (old repository => new
repository):

create on computer A (new Mercurial)

* create a new Unicode repository
* add content with non-ASCII filenames

work on computer B (old Mercurial)

* clone the repository
* add a new file with a non-ASCII filename
* hg ci
* hg push

After thinking twice, "hg push" is only a problem if you added new files with
names not decodable from UTF-8. It "works" if your locale encoding is UTF-8 or
if the filename is pure ASCII. So

Mac OS X and most Linux distro uses a UTF-8 locale encoding, but not Windows.
So on Windows, with an old Mercurial, you will be limited to ASCII if you add
new files.

> Wouldn’t the behaviour for old client versions be identical to when the
> repository were created on an UTF-8 system? That is, check out fine on
> an UTF-8 system, and get the usual garbling of non-ASCII characters on a
> Latin-1 system.

Yes.

> Another possibility might be to add two configuration options, one
> describes the repository encoding and one the target encoding. Without
> these set, Mercurial is encoding-agnostic (current behaviour), when you
> set the repository encoding it automatically recodes filenames to your
> local system’s encoding, or to the target encoding (if set). I think
> this is similar to what the eol extension does.

It is not so different than my "Unicode mode", and so it has the same
contraints and limitations, except that it has an important advantage: it
helps to have a smoother transition (backward compatibility) if you work in an
homogeneous environment (e.g. only Windows with cp1252 ANSI code page). Python
embeds most common encodings (e.g. most Windows code pages), it can work.

Being able to use latin1 (instead of UTF-8) would also help the corner case
because all byte strings are decodable from latin1.

It avoids also to really convert the content of a repository: if the "new"
encoding is already able to decode all filenames, you don't have to transcode
filenames, and hashes are unchanged.

I like your idea :-)

Laurens Holst

unread,
Nov 6, 2011, 6:09:16 AM11/6/11
to Victor Stinner, mercuri...@selenic.com
Op 5-11-2011 17:10, Victor Stinner schreef:

>> Another possibility might be to add two configuration options, one
>> describes the repository encoding and one the target encoding. Without
>> these set, Mercurial is encoding-agnostic (current behaviour), when you
>> set the repository encoding it automatically recodes filenames to your
>> local system’s encoding, or to the target encoding (if set). I think
>> this is similar to what the eol extension does.
> It is not so different than my "Unicode mode", and so it has the same
> contraints and limitations, except that it has an important advantage: it
> helps to have a smoother transition (backward compatibility) if you work in an
> homogeneous environment (e.g. only Windows with cp1252 ANSI code page). Python
> embeds most common encodings (e.g. most Windows code pages), it can work.
>
> Being able to use latin1 (instead of UTF-8) would also help the corner case
> because all byte strings are decodable from latin1.
>
> It avoids also to really convert the content of a repository: if the "new"
> encoding is already able to decode all filenames, you don't have to transcode
> filenames, and hashes are unchanged.

I think this is the main advantage yes.

Downside is that this way the transcoding is something the user needs to
manually set in his configuration file, even though the project should
know itself whether its build tools are encoding-agnostic (make) or
encoding-aware (ant). So this information could just as well be stored
in the repository. To make it more complicated, also consider the case
when I switch my build system from ant to make (would you want to recode
the entire working copy? uff).

This would be particularly useful for the case of an UTF-8 repository on
Windows. On Windows if you use the ‘bytes’ API it uses cp1252 (on most
of our western systems), not UTF-8, and I don’t think this will ever
change for backwards compatibility reasons. I wouldn’t even call it a
bytes API really. If you would store the origin encoding of the
repository in the repository itself, together with a transcode=true
flag, Windows can make a decision on what API to use.

Having such information stored in the repository (regardless of the
transcode flag) may also be useful to prevent inconsistent encodings in
the repository, and for hgweb as well.

So maybe it would be best to have a way to set repository encoding on a
repository, but without having to convert an existing repository.
Perhaps though pushkeys? This would have the advantage of being able to
set this for the entire repo in retrospect. Or else perhaps a versioned
.hgencoding file.

p.s. Another thing, I may be wrong but I seem to recall that Mercurial
uses a particular flag to open files that is available on the bytes API,
but not on the unicode API? I’m not sure but perhaps worth checking out.

Reply all
Reply to author
Forward
0 new messages