A proposal on solve encoding problem on Windows.

35 views
Skip to first unread message

罗勇刚(Yonggang Luo)

unread,
Oct 20, 2011, 12:04:54 PM10/20/11
to mercurial, mercuria...@googlegroups.com
Things need to be done: 
1. all pure-ASCII repostiroy won't be affected. this is easy
2. all OS except Windows won't be affected. using sys.platform == 'win32' to ensure that.
2. use utf8 as the default encoding for new commits.
4. supporting for messed up old mercurial repository. add new --encoding option to do that.
5. supporting automatically rename from old messed up filename to newly utf8 filename.

calling for supplement or issues in this schema.
--
         此致

罗勇刚
Yours
    sincerely,
Yonggang Luo

罗勇刚(Yonggang Luo)

unread,
Oct 20, 2011, 12:20:31 PM10/20/11
to mercurial, mercuria...@googlegroups.com
A proposal on solve encoding problem on Windows.

Things need to be done:
1. all pure-ASCII repostiroy won't be affected. this is easy
2. all OS except Windows won't be affected. using sys.platform == 'win32' to ensure that.
2. use utf8 as the default encoding for new commits.
4. supporting for messed up old mercurial repository. add new --encoding option to do that.
5. supporting automatically rename from old messed up filename to newly utf8 filename.
6. supporting seamlessly 'accented characters' between windows and linux, testcase needed.


General test case: commit and checkout the following files without problems under win2k and upper OS, then content setting to it’s filename, encoded with utf8.
Chinese (Traditional).txt
简体.txt
繁体.txt
중국어 (번체).txt
Chinês (Tradicional).txt



calling for supplement or issues in this schema.

Paul Boddie

unread,
Oct 20, 2011, 12:25:22 PM10/20/11
to luoyo...@gmail.com, mercurial
On 20/10/11 18:20, 罗勇刚(Yonggang Luo) wrote:
> A proposal on solve encoding problem on Windows.
> Things need to be done:
> 1. all pure-ASCII repostiroy won't be affected. this is easy
> 2. all OS except Windows won't be affected. using sys.platform == 'win32' to
> ensure that.
> 2. use utf8 as the default encoding for new commits.
> 4. supporting for messed up old mercurial repository. add new --encoding
> option to do that.
> 5. supporting automatically rename from old messed up filename to newly utf8
> filename.
> 6. supporting seamlessly 'accented characters' between windows and linux,
> testcase needed.

[...]

Might I suggest that you move the evolution of this proposal to a Wiki
page so that, interesting as this discussion has been, we do not have to
read messages for each edit to the proposal and each repeated assertion
and counter-assertion?

Paul

P.S. Dropping the Google group from the recipients list since it
requires a subscription.
_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial

罗勇刚(Yonggang Luo)

unread,
Oct 20, 2011, 12:33:38 PM10/20/11
to Paul Boddie, mercurial, mercuria...@googlegroups.com


2011/10/21 Paul Boddie <paul....@biotek.uio.no>

On 20/10/11 18:20, 罗勇刚(Yonggang Luo)  wrote:

A proposal on solve encoding problem on Windows.
Things need to be done:
1. all pure-ASCII repostiroy won't be affected. this is easy
2. all OS except Windows won't be affected. using sys.platform == 'win32' to
ensure that.
2. use utf8 as the default encoding for new commits.
4. supporting for messed up old mercurial repository. add new --encoding
option to do that.
5. supporting automatically rename from old messed up filename to newly utf8
filename.
6. supporting seamlessly 'accented characters' between windows and linux,
testcase needed.

[...]

Might I suggest that you move the evolution of this proposal to a Wiki page so that, interesting as this discussion has been, we do not have to read messages for each edit to the proposal and each repeated assertion and counter-assertion?
No problem, but how I receive the comments? If my proposal just a joke, how to know that, I want feedbacks 

Paul

P.S. Dropping the Google group from the recipients list since it requires a subscription.



--

罗勇刚(Yonggang Luo)

unread,
Oct 20, 2011, 12:40:21 PM10/20/11
to Paul Boddie, mercurial, mercuria...@googlegroups.com

Matt Mackall

unread,
Oct 20, 2011, 12:49:33 PM10/20/11
to luoyo...@gmail.com, mercurial
On Fri, 2011-10-21 at 00:04 +0800, 罗勇刚(Yonggang Luo) wrote:
> Things need to be done:
> 1. all pure-ASCII repostiroy won't be affected. this is easy
> 2. all OS except Windows won't be affected. using sys.platform == 'win32' to
> ensure that.
> 2. use utf8 as the default encoding for new commits.

This is not sufficiently backwards-compatible. Alice using Mercurial 2.1
checks in a new file to the existing project, Bob and Carl and Dave and
Erica using Mercurial 1.8 can't check it out.

For many users, this would be a serious regression: Windows users using
the same SBCS code page can already share files just fine.

> 4. supporting for messed up old mercurial repository. add new --encoding
> option to do that.

Same problem. A solution that breaks things for existing users will not
be considered.


Here's an alternate scheme:

if windows:
find manifest of the parent commit
if manifest is empty:
# brand new repo
mode = utf8transcoding
if all files in manifest are valid UTF-8:
# repo is already in UTF-8 mode or is pure ASCII
mode = utf8transcoding
else:
# existing repo, possibly using a Windows character set
mode = passthrough
else:
mode = passthrough

Notes:
1. We can reliably detect UTF-8 with very high probability
2. This automatically does the right thing on existing repos
3. This automatically does the right thing when working with Linux users
on UTF-8
4. Existing repos can be upgraded to UTF-8 if desired

--
Mathematics is the supreme nostalgia of our time.

Matt Mackall

unread,
Oct 20, 2011, 12:51:39 PM10/20/11
to luoyo...@gmail.com, mercurial
On Fri, 2011-10-21 at 00:40 +0800, 罗勇刚(Yonggang Luo) wrote:
> The wiki link is
> http://mercurial.selenic.com/wiki/EncodingStrategy#A_proposal_on_solve_encoding_problem_on_Windows.
> comment are welcome.

Putting it on the wiki was actually bad advice: the wiki is primarily
for reference. We don't want to collect lots of non-official and/or
half-baked proposals there. I've reverted it, we should keep the
discussion either here or on mercurial-devel.

--
Mathematics is the supreme nostalgia of our time.

Mike Meyer

unread,
Oct 20, 2011, 1:02:19 PM10/20/11
to luoyo...@gmail.com, mercuria...@googlegroups.com, mercurial
On Thu, Oct 20, 2011 at 9:04 AM, 罗勇刚(Yonggang Luo) <luoyo...@gmail.com> wrote:
Things need to be done: 
1. all pure-ASCII repostiroy won't be affected. this is easy
2. all OS except Windows won't be affected. using sys.platform == 'win32' to ensure that.
2. use utf8 as the default encoding for new commits.
4. supporting for messed up old mercurial repository. add new --encoding option to do that.
5. supporting automatically rename from old messed up filename to newly utf8 filename.

So what happens when I create and check in a file that uses a non-ascii file name encoded in something other than utf8 on a Unix box and you try and check it out on your windows box?

   <mike

Ben Fritz

unread,
Oct 20, 2011, 1:34:26 PM10/20/11
to merc...@selenic.com

On Oct 20, 11:51 am, Matt Mackall <m...@selenic.com> wrote:
> On Fri, 2011-10-21 at 00:40 +0800, 罗勇刚(Yonggang Luo) wrote:
> > The wiki link is

> >http://mercurial.selenic.com/wiki/EncodingStrategy#A_proposal_on_solv....


> > comment are welcome.
>
> Putting it on the wiki was actually bad advice: the wiki is primarily
> for reference. We don't want to collect lots of non-official and/or
> half-baked proposals there. I've reverted it, we should keep the
> discussion either here or on mercurial-devel.
>

What about using the talk page?

Ben Fritz

unread,
Oct 20, 2011, 1:35:18 PM10/20/11
to merc...@selenic.com

On Oct 20, 11:25 am, Paul Boddie <paul.bod...@biotek.uio.no> wrote:
>
> P.S. Dropping the Google group from the recipients list since it
> requires a subscription.

Google groups got the message anyway, I don't think you need to
explicitly CC it as long as you include the primary list address.

Ben Fritz

unread,
Oct 20, 2011, 1:36:29 PM10/20/11
to merc...@selenic.com

On Oct 20, 12:34 pm, Ben Fritz <fritzophre...@gmail.com> wrote:
> On Oct 20, 11:51 am, Matt Mackall <m...@selenic.com> wrote:
>
> > On Fri, 2011-10-21 at 00:40 +0800, 罗勇刚(Yonggang Luo) wrote:
> > > The wiki link is
> > >http://mercurial.selenic.com/wiki/EncodingStrategy#A_proposal_on_solv....
> > > comment are welcome.
>
> > Putting it on the wiki was actually bad advice: the wiki is primarily
> > for reference. We don't want to collect lots of non-official and/or
> > half-baked proposals there. I've reverted it, we should keep the
> > discussion either here or on mercurial-devel.
>
> What about using the talk page?

Never mind, stupid me didn't check that the wiki software used on
mercurial's wiki actualy has such a concept. Please forgive the
noise :-)

Matt Mackall

unread,
Oct 20, 2011, 1:37:08 PM10/20/11
to Ben Fritz, merc...@selenic.com
On Thu, 2011-10-20 at 10:34 -0700, Ben Fritz wrote:
>
> On Oct 20, 11:51 am, Matt Mackall <m...@selenic.com> wrote:
> > On Fri, 2011-10-21 at 00:40 +0800, 罗勇刚(Yonggang Luo) wrote:
> > > The wiki link is
> > >http://mercurial.selenic.com/wiki/EncodingStrategy#A_proposal_on_solv....
> > > comment are welcome.
> >
> > Putting it on the wiki was actually bad advice: the wiki is primarily
> > for reference. We don't want to collect lots of non-official and/or
> > half-baked proposals there. I've reverted it, we should keep the
> > discussion either here or on mercurial-devel.
> >
>
> What about using the talk page?

It's possible, but it's not a standard convention on our wiki, nor does
Moin do much to facilitate it. Talk happens here or on IRC.

--
Mathematics is the supreme nostalgia of our time.

Andrey

unread,
Oct 21, 2011, 4:37:47 AM10/21/11
to mercuria...@googlegroups.com, mercurial



So what happens when I create and check in a file that uses a non-ascii file name encoded in something other than utf8 on a Unix box and you try and check it out on your windows box?

   <mike

Your local Mercurial client on your Unix box shall be aware which encoding should used for the file (it can be either a general setting per user, per project or per file - similar to eol extension). When you commit, the client transforms the bytes to UTF-8 and stores the bytes in the repository. Windows users get the UTF-8 encoded name, transform it to what ever is used on their system and create the local file with the Mercurial client for Windows. 

Andrey

unread,
Oct 21, 2011, 4:59:31 AM10/21/11
to mercuria...@googlegroups.com, mercurial
The most important goal for me was actually this:
2 (3?). use utf8 as the default encoding for new commits.

Now I see (thanks, Matt), that it may introduce serious regression problems.
I need some time to think about a possible solution.
 

  if all files in manifest are valid UTF-8:
    # repo is already in UTF-8 mode or is pure ASCII
    mode = utf8transcoding

 This check is just a guess. We cannot rely on it. In general, it is not possible to detect the encoding from the sequence of bytes.

I am afraid, the change should only be implemented for brand new repositories.

P.S. I would prefer if someone (with a broad view) can maintain a wiki/document/specification for this proposal. Otherwise, it is rather difficult to follow the discussion...

Roger Gammans

unread,
Oct 21, 2011, 5:19:47 AM10/21/11
to Matt Mackall, mercurial
On Thu, Oct 20, 2011 at 11:49:33AM -0500, Matt Mackall wrote:
> Here's an alternate scheme:
>
> if windows:
> find manifest of the parent commit
> if manifest is empty:
> # brand new repo
> mode = utf8transcoding
> if all files in manifest are valid UTF-8:
> # repo is already in UTF-8 mode or is pure ASCII
> mode = utf8transcoding
> else:
> # existing repo, possibly using a Windows character set
> mode = passthrough
> else:
> mode = passthrough

Instead of running a mode guess algorithm, couldn't mercurial
store a repo encoding field. Which if blank /non-exsitant mercurial
defaults to the current behavior ?

Then if it is set hg can try to transcode to 'native' however that
is defined. (it should also have a switch to disable this) .

That means for existing repos nothing changes unless somebody adds a
commit with an encoding field.

And perhaps the field only needs to be added on the first commit
that adds a non-ansi filename, which means everything stays the same
for those that don't need encodings. But I'm less sure about this part,
certaily I think you should also be able to 'force' the encoding (eg
set the repo hg encoding 'hg set-encoding foo', and work with a
non-stated one eg 'hg checkout --repo-encoding=bar --local-encoding=foo',
for when you are working with legacy/broken repos.

This seems so simple, I'm guessing there is very good reason it hasn't
been suggested .

TTFN
--
Roger. Home| http://www.sandman.uklinux.net/
Master of Peng Shui. (Ancient oriental art of Penguin Arranging)
Work|Independent Sys Consultant | http://www.computer-surgery.co.uk/
New key Fpr: 0F2F E1DF 4CD2 5E7B EF9F B173 4CFA F143 ADBE 6B00

Andrey

unread,
Oct 21, 2011, 6:13:13 AM10/21/11
to mercuria...@googlegroups.com, mercurial

Instead of running a mode guess algorithm, couldn't mercurial
store a repo encoding field. Which if blank /non-exsitant  mercurial
defaults to the current behavior ?

Then if it is set hg can try to transcode to 'native' however that
is defined. (it should also have a switch to disable this) .

That means for existing repos nothing changes unless somebody adds a
commit with an encoding field.

Do you mean that the whole repository must be configured to contain only Cp1252 or UTF-16 encoded names ?
Do you mean to change the manifest file to keep the encoding information per file ?

Roger Gammans

unread,
Oct 21, 2011, 6:30:20 AM10/21/11
to mercuria...@googlegroups.com, mercurial
On Fri, Oct 21, 2011 at 03:13:13AM -0700, Andrey wrote:
>
>
> > Instead of running a mode guess algorithm, couldn't mercurial
> > store a repo encoding field. Which if blank /non-exsitant mercurial
> > defaults to the current behavior ?
> >
> Do you mean that the whole repository must be configured to contain only
> Cp1252 or UTF-16 encoded names ?
> Do you mean to change the manifest file to keep the encoding information per
> file ?

Where it is stored is an implementation detail. Except I believe it should
be revision controled.

But do you really have a use-case for files in a single repo *stored*
with different encodings. It strikes me as that way madness lies.

I have to say I can't imagine one, doesnt utf8 encode all the same codepoints
as cp-1252?

Have I missed something - does the windows wide character API do something odd
when given the top-portion of cp1252?

Andrey

unread,
Oct 21, 2011, 6:40:03 AM10/21/11
to mercuria...@googlegroups.com, mercurial
Since I did not quite catch your answer, I will repeat the question.

If it should be a boolean setting, does it mean that all the files must have the same UTF-8 encoding (which is not the case at the moment)?
If it should be a value setting, do you mean that all the file names must be Cp1252 encoded (or whatever the setting is)?

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 6:57:17 AM10/21/11
to Roger Gammans, Matt Mackall, mercurial
Agreed, +1

2011/10/21, Roger Gammans <rgam...@computer-surgery.co.uk>:

--
从我的移动设备发送

此致

罗勇刚
Yours
sincerely,
Yonggang Luo

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 7:56:43 AM10/21/11
to mercurial, mercuria...@googlegroups.com
A proposal on solve encoding problem on Windows.
Things need to be done:
1. all pure-ASCII repostiroy won't be affected. this is easy
2. all OS except Windows won't be affected. using sys.platform == 'win32' to ensure that.
3. use utf8 as the default encoding for new commits.

4. supporting for messed up old mercurial repository. add new --encoding option to do that.
5. supporting automatically rename from old messed up filename to newly utf8 filename.


Implement detail:
1.We need per-repository configuration about, (per-user is not a good idea, because encoding is based on repository, and differernt repository will have different encoding).
 the repository’s encoding configuration should be like this:
old_repository_encoding such as ‘ascii’ ‘cp1251’ cp936 cp1252 utf8 and so on
separator revision number such as 128
repository_encoding such as utf8.(this should be a constant utf8. anyway, this is still make sense to used clearly identify the old repository’s encoding without affect those old Mercurial. But for easily handle be new Mercurial.)

Setting three parameters because when we migrating from old repository to new repository, we need to face a problem, how to checkout to old history? for example checkout old-tag, old-branch? When checkout those revision, then we need work in the old-way.


So the schema should be something like this:
if windows:
 if there is configuration about encoding:
   rev = current revision
   if rev is working directory:
     mode = repository_encoding, parent_encoding, because the working directory’s encoding may differ from it’s parent encoding. when this happened, we should automatically handle the rename on those messed up characters.
   elif rev <= separator revision:
     mode = old_repository_encoding
   else:
     mode = repository_encoding
 else:

   mode = passthrough
else:
 mode = passthrough


Also which need to supply a upgrade tool to upgrade to using utf8 as the repository encoding on old tools.
hg encoding --old ascii --new utf8
hg encoding --old cp936 --new utf8

Notes:
0. Evertying happened under windows.
1. Everything is decided, so there is no possibility, only need a upgrade instruction at once, the everything will handled automatically.
2. Existing repos will not be affected at all, except the user execute the hg encoding instruction
3. Newly created repository default setting to utf8, but we can still supply option to create old way repository.



General test case: commit and checkout the following files without problems under win2k and upper OS, then content setting to it’s filename, encoded with utf8.
Chinese (Traditional).txt
简体.txt
繁体.txt
중국어 (번체).txt
Chinês (Tradicional).txt
áéíóúñ.txt 'accented characters'

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 8:08:37 AM10/21/11
to mercurial, mercuria...@googlegroups.com

Wiki page attached for clearly reading.

--

Andrey

unread,
Oct 21, 2011, 8:39:16 AM10/21/11
to mercuria...@googlegroups.com, mercurial
You greatly simplified the task.



Implement detail:
1.We need per-repository configuration about, (per-user is not a good idea, because encoding is based on repository, and differernt repository will have different encoding).
 the repository’s encoding configuration should be like this:
old_repository_encoding such as ‘ascii’ ‘cp1251’ cp936 cp1252 utf8 and so on

You make an assumption that _all_ the file names  in the repository have the same encoding. This is not the case.
 

Setting three parameters because when we migrating from old repository to new repository, we need to face a problem, how to checkout to old history? for example checkout old-tag, old-branch? When checkout those revision, then we need work in the old-way.

Please consider the case when the old client (for instance version 1.8), which is not aware of the settings you describe, wants to check out the files.



Also which need to supply a upgrade tool to upgrade to using utf8 as the repository encoding on old tools.
hg encoding --old ascii --new utf8
hg encoding --old cp936 --new utf8


I am sorry, I do not understand this at all. Do you affect the working directory ? Do you commit ?  Do you change the history ?

Notes:
0. Evertying happened under windows.
1. Everything is decided, so there is no possibility, only need a upgrade instruction at once, the everything will handled automatically.
2. Existing repos will not be affected at all, except the user execute the hg encoding instruction
3. Newly created repository default setting to utf8, but we can still supply option to create old way repository.

Unix is also affected. See  the comments about Mac OS and different encodings for Unix systems.

-
Andrey

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 8:50:24 AM10/21/11
to mercuria...@googlegroups.com


2011/10/21 Andrey <py4...@gmail.com>

You greatly simplified the task.



Implement detail:
1.We need per-repository configuration about, (per-user is not a good idea, because encoding is based on repository, and differernt repository will have different encoding).
 the repository’s encoding configuration should be like this:
old_repository_encoding such as ‘ascii’ ‘cp1251’ cp936 cp1252 utf8 and so on

You make an assumption that _all_ the file names  in the repository have the same encoding. This is not the case.
 

Setting three parameters because when we migrating from old repository to new repository, we need to face a problem, how to checkout to old history? for example checkout old-tag, old-branch? When checkout those revision, then we need work in the old-way.

Please consider the case when the old client (for instance version 1.8), which is not aware of the settings you describe, wants to check out the files.
That's impossible, if we  want to use utf8 as the encoding, that's likes you want win98 support for Unicode properly.



Also which need to supply a upgrade tool to upgrade to using utf8 as the repository encoding on old tools.
hg encoding --old ascii --new utf8
hg encoding --old cp936 --new utf8


I am sorry, I do not understand this at all. Do you affect the working directory ? Do you commit ?  Do you change the history ?
 
 
 

Notes:
0. Evertying happened under windows.
1. Everything is decided, so there is no possibility, only need a upgrade instruction at once, the everything will handled automatically.
2. Existing repos will not be affected at all, except the user execute the hg encoding instruction
3. Newly created repository default setting to utf8, but we can still supply option to create old way repository.

Unix is also affected. See  the comments about Mac OS and different encodings for Unix systems.
Why unix will be affected? everthing is under restriction of 
if windows:
 
-
Andrey

_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial




--

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 8:51:13 AM10/21/11
to mercuria...@googlegroups.com
Update
== Things need to be done ==
 *1. all pure-ASCII repostiroy won't be affected. this is easy
 *2. all OS except Windows won't be affected. using sys.platform == 'win32' to ensure that.
 *3. use utf8 as the default encoding for new commits.
 *4. supporting for messed up old mercurial repository. add new --encoding option to do that.
 *5. supporting automatically rename from old messed up filename to newly utf8 filename.


== Implement detail ==
 * We need per-repository configuration about, (per-user is not a good idea, because encoding is based on repository, and differernt repository will have different encoding).
 * The repository’s encoding configuration should be like this:
{{{
old_repository_encoding: such as ‘ascii’ ‘cp1251’ cp936 cp1252 utf8 and so on
separator revision number: such as 128
repository_encoding: such as utf8 or ascii.(this should be a constant utf8. anyway, this is still make sense to used clearly identify the old repository’s encoding without affect those old Mercurial. But for easily handle be new Mercurial.)
}}}

Setting three parameters because when we migrating from old repository to new repository, we need to face a problem, how to checkout to old history? for example checkout old-tag, old-branch? When checkout those revision, then we need work in the old-way.


== The schema ==
{{{#!python
if windows:
  if there is configuration about encoding:
    rev = current revision
    if rev is working directory:
      '''
      Because the working directory’s encoding may differ from it’s parent encoding.
      When this happened, we should automatically handle the rename on those messed
      up characters.
      '''
      mode = repository_encoding, parent_encoding
    elif rev <= separator revision:
      mode = old_repository_encoding
    else:
      mode = repository_encoding
  else:
    mode = passthrough
else:
  mode = passthrough
}}}
== Upgrade ==
Also which need to supply a upgrade tool to upgrade to using utf8 as the repository encoding on old tools.
{{{#!cmd
hg encoding upgrade --old ascii --new utf8 --sep 1920
hg encoding upgrade --old cp936 --new utf8
hg encoding upgrade --old cp936 --new utf8
hg encoding upgrade --new utf8 #The old is the current locale
hg encoding upgrade #The old is the current local and the new is utf8
hg encoding --verify # Iterating on the whole repository, to verify each path in each revision is encoding in config encoding.
hg encoding upgrade --clear
}}}
option explain
{{{
--old The old repository's original encoding.
--new In most case, it's utf8, when someone want to use Mercurial in old way, then it's can be setting to other things(such as ascii for make sure each committed filename are encoded in ascii).
--sep Setting the separator revision, when not specified, then it's the newest revision. or if old is ascii, then the sep revision is 0
--clear #Regenerating the repository with all path under all revision is converted to new encoding(utf8). (There may someone desire for it)
}}}
== Notes ==
 *1. Evertying happened under windows.
 *2. Everything is decided, so we need to guess nothing, only need a upgrade instruction at once, after that, everything will handled automatically.
 *3. Existing repos will not be affected at all, except the user execute the hg encoding instruction to upgrade the repository.
 *4. Newly created repository default setting to utf8, but we can still supply an option to create repository in old way.

== General test case ==
commit and checkout the following files without problems under win2k and upper OS, then content setting to it’s filename, encoded with utf8.
{{{
Chinese (Traditional).txt
简体.txt
繁体.txt
중국어 (번체).txt
Chinês (Tradicional).txt
áéíóúñ.txt 'accented characters'
}}}

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 9:00:48 AM10/21/11
to mercuria...@googlegroups.com, mercurial


2011/10/21 罗勇刚(Yonggang Luo) <luoyo...@gmail.com>



2011/10/21 Andrey <py4...@gmail.com>
You greatly simplified the task.



Implement detail:
1.We need per-repository configuration about, (per-user is not a good idea, because encoding is based on repository, and differernt repository will have different encoding).
 the repository’s encoding configuration should be like this:
old_repository_encoding such as ‘ascii’ ‘cp1251’ cp936 cp1252 utf8 and so on

You make an assumption that _all_ the file names  in the repository have the same encoding. This is not the case.
Sorry, Not the case? for what? can you give me more detail information.
 
 

Setting three parameters because when we migrating from old repository to new repository, we need to face a problem, how to checkout to old history? for example checkout old-tag, old-branch? When checkout those revision, then we need work in the old-way.

Please consider the case when the old client (for instance version 1.8), which is not aware of the settings you describe, wants to check out the files.
That's impossible, if we  want to use utf8 as the encoding, that's likes you want win98 support for Unicode properly.
For example, you create a repository under linux with hg 1.8 by UTF8(it's default under linux), and then commit unicode filenames, how do you check it out under Windows? Now, you just replace (linux with hg.1.8) to (windows with hg 2.unicode), that's exactly the same condition. 
-- 

Andrey

unread,
Oct 21, 2011, 9:46:44 AM10/21/11
to mercuria...@googlegroups.com
Please do not narrow down the problem to just 'Unix against Windows'

1) the problem is that Mercurial shall work properly when the OS expects a file name with the encoding different than UTF-8 (the encoding is known, it must be the same for all files for the working directory, it must be configurable). Unix can be configured to use other then UTF-8 encoding.
2) it should be possible that users use different encodings but contribute to the same repository. For instance, Unix and Windows users work together on the same project. 
Please note, that the encoding is not the property of the repository but the property of the working directory.
3) the old Mercurial versions should work either the same or better (no regression)
 
You make an assumption that _all_ the file names  in the repository have the same encoding. This is not the case.
>Sorry, Not the case? for what? can you give me more detail information.

I am afraid, I am wrong here. Indeed, to work properly, all the file names in the repository must be with the same encoding at the moment.

Roger Gammans

unread,
Oct 21, 2011, 10:34:53 AM10/21/11
to Andrey, merc...@selenic.com
On Fri, Oct 21, 2011 at 06:46:44AM -0700, Andrey wrote:
> Please note, that the encoding is not the property of the repository but the
> property of the working directory.
No. BOTH they have this property and they might be different.

You can guess or make a stab a the enviromnent encoding. But you can't guess the
repo encoding as this is the encoding on the environment where the commit
occured.

So my suggestion was to add a repo encoding property, so that this
was known on repo which set it. This then means that meaningful
transcodes between the repo and the local encoding can be done.

And with a '--local-encoding' switch allowing override of the
'guess' so the behaviour is a predicatble as possible.

> 3) the old Mercurial versions should work either the same or better (no
> regression)

Unfortuantely I don't see how that is possible, in my mind we would
add a .hg/requires entry when the encoding was set, and repo's
with a encoding set would not be usable on previous versions. I don't see
how they can reliably be, as they won't 'guarantee' the repo encoding.

TTFN
--
Roger. Home| http://www.sandman.uklinux.net/
Master of Peng Shui. (Ancient oriental art of Penguin Arranging)
Work|Independent Sys Consultant | http://www.computer-surgery.co.uk/
New key Fpr: 0F2F E1DF 4CD2 5E7B EF9F B173 4CFA F143 ADBE 6B00

Martin Geisler

unread,
Oct 21, 2011, 10:41:41 AM10/21/11
to mercurial
Andrey <py4...@gmail.com> writes:

> The most important goal for me was actually this: 2 (3?). use utf8 as
> the default encoding for new commits.
>
> Now I see (thanks, Matt), that it may introduce serious regression
> problems. I need some time to think about a possible solution.
>
>> if all files in manifest are valid UTF-8:
>> # repo is already in UTF-8 mode or is pure ASCII
>> mode = utf8transcoding
>
> This check is just a guess. We cannot rely on it. In general, it is
> not possible to detect the encoding from the sequence of bytes.

You're right in principle, that a Latin-1 encoded text with "pære" also
happen to be the UTF-8 encoding of "pære". However, what Matt writes is
that the chance of that happening is small and so it is okay with him to
declare a text to be UTF-8 if it can be correctly decoded as such.

--
Martin Geisler

Mercurial links: http://mercurial.ch/

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 10:47:11 AM10/21/11
to Roger Gammans, mercurial, mercuria...@googlegroups.com


2011/10/21 Roger Gammans <rgam...@computer-surgery.co.uk>

>
> On Fri, Oct 21, 2011 at 06:46:44AM -0700, Andrey wrote:
> > Please note, that the encoding is not the property of the repository but the
> > property of the working directory.
> No.  BOTH they have this property and they might be different.
>
> You can guess or make a stab a the enviromnent encoding. But you can't guess the
> repo encoding as this is the encoding on the environment where the commit
> occured.
>
> So my suggestion was to add a repo encoding property, so that this
> was known on repo which set it. This then means that meaningful
> transcodes between the repo and the local encoding can be done.

That's right, that's what my proposal mentioned

 * The repository’s encoding configuration should be like this:

{{{
old_repository_encoding: such as ‘ascii’ ‘cp1251’ cp936 cp1252 utf8 and so on
separator revision number: such as 128
repository_encoding: such as utf8 or ascii.(this should be a constant utf8. anyway, this is still make sense to used clearly identify the old repository’s encoding without affect those old Mercurial. But for easily handle be new Mercurial.)
}}}
repository’s encoding may be your called repository encoding.

And the following command for encode change:
Be notice, in one repository, two much encoding is not recommended, only two encoding is possible
the local encoding and utf8. or will get the things to be too complicated. and the encoding change should not 
be something like utf8->ascii, utf8-cp936, that's not reversible. Because UTF8's codepoint is the superset of cp936 and all other non-unicode
encoding.

{{{#!cmd
hg encoding upgrade --old ascii --new utf8 --sep 1920
hg encoding upgrade --old cp936 --new utf8
hg encoding upgrade --old cp936 --new utf8
hg encoding upgrade --new utf8 #The old is the current locale
hg encoding upgrade #The old is the current local and the new is utf8
hg encoding --verify # Iterating on the whole repository, to verify each path in each revision is encoding in config encoding.
hg encoding upgrade --clear
}}}
>
> And with a '--local-encoding' switch allowing override of the
> 'guess' so the behaviour is a predicatble as possible.
>
> > 3) the old Mercurial versions should work either the same or better (no
> > regression)
>
> Unfortuantely I don't see how that is possible, in my mind we would
> add a .hg/requires entry when the encoding was set, and repo's
> with a encoding set would not be usable on previous versions. I don't see
> how they can reliably be, as they won't 'guarantee' the repo encoding.
>
> TTFN
> --
> Roger.                          Home| http://www.sandman.uklinux.net/
> Master of Peng Shui.      (Ancient oriental art of Penguin Arranging)
> Work|Independent Sys Consultant | http://www.computer-surgery.co.uk/
>  New key Fpr: 0F2F E1DF 4CD2 5E7B EF9F  B173 4CFA F143 ADBE 6B00
> _______________________________________________
> Mercurial mailing list
> Merc...@selenic.com
> http://selenic.com/mailman/listinfo/mercurial



罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 10:54:08 AM10/21/11
to mercuria...@googlegroups.com
2011/10/21 Andrey <py4...@gmail.com>

>
> Please do not narrow down the problem to just 'Unix against Windows'
> 1) the problem is that Mercurial shall work properly when the OS expects a file name with the encoding different than UTF-8 (the encoding is known, it must be the same for all files for the working directory, it must be configurable). Unix can be configured to use other then UTF-8 encoding.

I am focus on Windows, as you see, it Unix also need to config the
Unicode, maybe separate proposal needed.
such as
if os is windows
  then do something
if os is unix
  then do something
else
  passthrough.


>
> 2) it should be possible that users use different encodings but contribute to the same repository. For instance, Unix and Windows users work together on the same project.

Maybe you want something like this.
If encoding is configurated: #For all OS(Unix,windows, mac os)?
   do the things in the new way
else:
  do the things in the old way? pass through

Andrey

unread,
Oct 21, 2011, 11:22:04 AM10/21/11
to mercuria...@googlegroups.com, merc...@selenic.com
Since I do not have a clear picture myself, it is not simple to answer the questions...


> Please note, that the encoding is not the property of the repository but the
> property of the working directory.
No.  BOTH they have this property and they might be different.

You can guess or make a stab a the enviromnent encoding. But you can't guess the
repo encoding as this is the encoding on the environment where the commit
occured.


My initial thought was that only UTF-8 shall be the encoding for the repository. (that is why one setting is enough) Now I see that for backwards compatibility we have to keep the information about the file names encoding for the existing repositories (which might be other then UTF-8) to allow old Mercurial clients to continue to work.
Indeed, it must be both.

> 3) the old Mercurial versions should work either the same or better (no
> regression)

Unfortuantely I don't see how that is possible

I also do not see now how it is possible. 

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 11:26:42 AM10/21/11
to mercuria...@googlegroups.com, mercurial
>> > 3) the old Mercurial versions should work either the same or better (no
>> > regression)
>>
>
> I also do not see now how it is possible.
From my point of view, the old Mercurial still with the problem
existed, there is no better, or worse.
That's just working as is. Anyway, do not trying to use old Mercurial
playing with newly Mercurial.Unicode
upgraded or created repository, that's must be a bad choice.

> _______________________________________________
> Mercurial mailing list
> Merc...@selenic.com
> http://selenic.com/mailman/listinfo/mercurial
>
>

--

         此致

罗勇刚
Yours
    sincerely,
Yonggang Luo

Matt Mackall

unread,
Oct 21, 2011, 11:30:49 AM10/21/11
to luoyo...@gmail.com, mercurial
On Fri, 2011-10-21 at 20:08 +0800, 罗勇刚(Yonggang Luo) wrote:
> http://mercurial.selenic.com/wiki/UnicodeOnWindows
>
> Wiki page attached for clearly reading.

Please read the wiki style guide, especially this section:

http://mercurial.selenic.com/wiki/WikiStyleGuide#Development_plans_and_other_speculative_pages

--
Mathematics is the supreme nostalgia of our time.

Andrey

unread,
Oct 21, 2011, 11:48:29 AM10/21/11
to mercuria...@googlegroups.com, mercurial
What I mean is that UTF-16 encoded text may look like (the same bytes) as the UTF-8 encoded text

Without BOM (byte order mark) we cannot make any conclusions about the content. Do you mean that BOM _is_ stored in the repository ? 

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 11:56:21 AM10/21/11
to mercuria...@googlegroups.com, mercurial
2011/10/21 Andrey <py4...@gmail.com>:
BOM never stored in the filename, it's just stored in the content(the
begging) of the file. so it's should not be considerated here.
It's off topic.
Also, UTF-16 should not be a filename encoding stored in Mercurial
repository, because it's not compatible with ASCII.
for example the space character will be represent as two byte 0x20
0x00 in LE, the 0x00 should not be appeared.
The UTF-32 is not be a choice for almost the same reason (4 byte vs 2byte).

> _______________________________________________
> Mercurial mailing list
> Merc...@selenic.com
> http://selenic.com/mailman/listinfo/mercurial
>
>

--

         此致

罗勇刚
Yours
    sincerely,
Yonggang Luo

Andrey

unread,
Oct 21, 2011, 12:08:20 PM10/21/11
to mercuria...@googlegroups.com, mercurial, luoyo...@gmail.com

>
> What I mean is that UTF-16 encoded text may look like (the same bytes) as
> the UTF-8 encoded text
> Without BOM (byte order mark) we cannot make any conclusions about the
> content. Do you mean that BOM _is_ stored in the repository ?
BOM never stored in the filename, it's just stored in the content(the
begging) of the file. so it's should not be considerated here.
It's off topic.
 
BOM is not (only) about files. It is an indicator in the beginning of the byte sequence. In the repository, the manifest file (the file which contains the names of the files in the repository) may contain BOM in the beginning of each file name. But as far as I know, it does not.

Also, UTF-16 should not be a filename encoding stored in Mercurial
repository,  because it's not compatible with ASCII.
for example the space character will be represent as two byte 0x20
0x00 in LE, the 0x00 should not be appeared.


As far as I understand, if you create a file now on Windows, you will get UTF-16 encoded name, because this is how Python 2 gets the name from the OS. That is why the existing repositories do contain UTF-16 encoded names.
 

Mike Meyer

unread,
Oct 21, 2011, 12:11:58 PM10/21/11
to merc...@selenic.com
Sigh. Not sure this made it to the list, as it only went to the google groups address, which claims it bounced it. Sorry if you see it twice.


---------- Forwarded message ----------
From: Mike Meyer <m...@mired.org>
Date: Fri, Oct 21, 2011 at 8:50 AM
Subject: Re: A proposal on solve encoding problem on Windows.
To: mercuria...@googlegroups.com



On Fri, Oct 21, 2011 at 1:37 AM, Andrey <py4...@gmail.com> wrote:
So what happens when I create and check in a file that uses a non-ascii file name encoded in something other than utf8 on a Unix box and you try and check it out on your windows box?
Your local Mercurial client on your Unix box shall be aware which encoding should used for the file

Sure, if you change the Unix clients as well, then this isn't a problem. But the proposal on the table - and the one I was responding to, which you failed to quote - was that only the Windows behavior be fixed, and that the behavior on Linux systems *not* change. That behavior is not encoding aware, which is why the answer to this question is of interest.

I can't see this proposal going forward until this question is answered.

   <mike
 

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 12:12:27 PM10/21/11
to mercuria...@googlegroups.com
2011/10/22 Andrey <py4...@gmail.com>:

>>
>> >
>> > What I mean is that UTF-16 encoded text may look like (the same bytes)
>> > as
>> > the UTF-8 encoded text
>> > Without BOM (byte order mark) we cannot make any conclusions about the
>> > content. Do you mean that BOM _is_ stored in the repository ?
>> BOM never stored in the filename, it's just stored in the content(the
>> begging) of the file. so it's should not be considerated here.
>> It's off topic.
>
>
> BOM is not (only) about files. It is an indicator in the beginning of the
That's right, but

> byte sequence. In the repository, the manifest file (the file which contains
> the names of the files in the repository) may contain BOM in the beginning
> of each file name. But as far as I know, it does not.
So it's will still won't added BOM on it, right?

>>
>> Also, UTF-16 should not be a filename encoding stored in Mercurial
>> repository,  because it's not compatible with ASCII.
>> for example the space character will be represent as two byte 0x20
>> 0x00 in LE, the 0x00 should not be appeared.
>>
> As far as I understand, if you create a file now on Windows, you will get
> UTF-16 encoded name, because this is how Python 2 gets the name from the OS.
> That is why the existing repositories do contain UTF-16 encoded names.

That's impossible, I don't know why this will happened. Indeed Python


2 gets the name from

the OS is always Local Encoding encoded. those API will force convert
the UTF16 encoded
names to Local Encoding encoded name.
For example, my os is CP936, then when I calling the API by Python 2,
then the function returned
filename is encoded in CP936, not in UTF16. This need to be explained.

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 12:19:25 PM10/21/11
to Mike Meyer, mercuria...@googlegroups.com, mercurial
2011/10/22 Mike Meyer <m...@mired.org>:
Which question? do you means

So what happens when I create and check in a file that uses a non-ascii
file name encoded in something other than utf8 on a Unix box and you try and
check it out on your windows box?
If is this question, then the answer is:
refer to http://mercurial.selenic.com/wiki/UnicodeOnWindows
1, first, you know which encoding you used under Unix box. suppose the
encoding is cp936.
2. On windows, execute the command
hg encoding upgrade --old cp936 --new cp936
Then you will working fine with it on windows.
>    <mike

>
>
> _______________________________________________
> Mercurial mailing list
> Merc...@selenic.com
> http://selenic.com/mailman/listinfo/mercurial
>
>

--

         此致

罗勇刚
Yours
    sincerely,
Yonggang Luo

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 12:35:11 PM10/21/11
to Mike Meyer, mercuria...@googlegroups.com


2011/10/22 Mike Meyer <m...@mired.org>:
> On Fri, Oct 21, 2011 at 9:19 AM, 罗勇刚(Yonggang Luo)
> <luoyo...@gmail.com> wrote:
>>
>> 2011/10/22 Mike Meyer <m...@mired.org>:

>> > On Fri, Oct 21, 2011 at 1:37 AM, Andrey <py4...@gmail.com> wrote:
>> >>> So what happens when I create and check in a file that uses a non-ascii
>> >>> file name encoded in something other than utf8 on a Unix box and you try and
>> >>> check it out on your windows box?
>> Which question? do you means
>> So what happens when I create and check in a file that uses a non-ascii
>> file name encoded in something other than utf8 on a Unix box and you try and
>> check it out on your windows box?
>
> Yes, that's the question.

>
>> If is this question, then the answer is:
>> refer to http://mercurial.selenic.com/wiki/UnicodeOnWindows
>> 1, first, you know which encoding you used under Unix box. suppose the
>> encoding is cp936.
>> 2. On windows, execute the command
>> hg encoding upgrade --old cp936 --new cp936
>> Then you will working fine with it on window
>
>  From reading the wiki, upgrade is for upgrading an *old* repository
> from whatever it's encoding is (which of course wrongly assumes that
> all the files in an old repository have the same encoding). I'm not
> asking about that case, I'm asking about the case where a *new*
> repository is used on a system that is not Windows and not utf8 which,
> according to the wiki page, "won't be affected".
Of cause, it's not affected, you can still using new Mercurial in the old way on windows. when creating new repository.

Notes

1. Evertying happened under windows.

2. Everything is decided, so we need to guess nothing, only need a upgrade instruction at once, after that, everything will handled automatically.
3. Existing repos will not be affected at all, except the user execute the hg encoding instruction to upgrade the repository.
4. Newly created repository default setting to utf8, but we can still supply an option to create repository in old way.
>
>   <mike

Mike Meyer

unread,
Oct 21, 2011, 12:46:43 PM10/21/11
to luoyo...@gmail.com, merc...@selenic.com
On Fri, Oct 21, 2011 at 9:35 AM, 罗勇刚(Yonggang Luo)

I'm sorry, but I don't see how this answers the question. Could you
provide details? Or maybe I should. Here's the scenario:

- I create a new repository on Windows, and start checking in files.
- I clone it to a Unix system, and check in files with non-utf8,
non-ascii file names.
- I pull changes back to the Unix system and update.

So - what becomes of those non-utf8, non-ascii file names on my Windows box?

> Notes
>
> 1. Evertying happened under windows.

No, it doesn't. That's part of the problem. Mercurial is a
cross-platform tool with unix leanings. Any change has to both not
break things on Unix, *and* not break things for people who are using
it on both Windows and Unix.

Matt Mackall

unread,
Oct 21, 2011, 12:55:58 PM10/21/11
to Mike Meyer, merc...@selenic.com

The same as happens today.

The only vaguely reasonable approaches for doing cross-platform work
are:

a) ASCII only (works perfectly today)
b) UTF-8 on Unix, UTF-16 on Windows, and be sure you avoid the 'makefile
problem'

Anything else is doomed to failure so I'm only interested in adding
support for (b).

--
Mathematics is the supreme nostalgia of our time.

Matt Mackall

unread,
Oct 21, 2011, 12:56:39 PM10/21/11
to mercuria...@googlegroups.com, mercurial
On Fri, 2011-10-21 at 08:48 -0700, Andrey wrote:
> >
> > > The most important goal for me was actually this: 2 (3?). use utf8 as
> > > the default encoding for new commits.
> > >
> > > Now I see (thanks, Matt), that it may introduce serious regression
> > > problems. I need some time to think about a possible solution.
> > >
> > >> if all files in manifest are valid UTF-8:
> > >> # repo is already in UTF-8 mode or is pure ASCII
> > >> mode = utf8transcoding
> > >
> > > This check is just a guess. We cannot rely on it. In general, it is
> > > not possible to detect the encoding from the sequence of bytes.
> >
> > You're right in principle, that a Latin-1 encoded text with "pære" also
> > happen to be the UTF-8 encoding of "pære". However, what Matt writes is
> > that the chance of that happening is small and so it is okay with him to
> > declare a text to be UTF-8 if it can be correctly decoded as such.
> >
> > --
> > Martin Geisler
> >
> >
> > What I mean is that UTF-16 encoded text may look like (the same bytes) as
> the UTF-8 encoded text

This is incorrect.

>>> u = u'Here is a string. Български'
>>> u.encode('utf-8')
'Here is a string. \xd0\x91\xd1\x8a\xd0\xbb\xd0\xb3\xd0\xb0\xd1\x80\xd1
\x81\xd0\xba\xd0\xb8'
>>> u.encode('utf-16')
'\xff\xfeH\x00e\x00r\x00e\x00 \x00i\x00s\x00 \x00a\x00 \x00s\x00t\x00r
\x00i\x00n\x00g\x00.\x00 \x00\x11\x04J\x04;\x043\x040\x04@\x04A\x04:
\x048\x04'

UTF-8 has a couple key properties:

- if a byte looks like ASCII, it is ASCII
- the first byte of a multibyte character is always of the form 11xxxxxx
- the first byte encodes the length of the character in bytes
- the second and later bytes are of the form 10xxxxxx

..which means that it's very easy to recognize properly encoded UTF-8
from even a small sample with very high probability. Here's a quick
program that generates 50k random byte strings of each length and
reports what percentage are valid UTF-8:

$ python uc.py
1 50.19400000%
2 28.46400000%
3 15.50600000%
4 8.90000000%
5 5.05600000%
6 2.90200000%
7 1.56200000%
8 0.96000000%
9 0.46400000%
10 0.28000000%
11 0.13400000%
12 0.08000000%
13 0.05200000%
14 0.03000000%
15 0.02600000%
16 0.01200000%
17 0.00400000%
18 0.00200000%
19 0.00000000%
20 0.00600000%

Given our manifest will be more like 10k - 1MB rather than just 20
bytes, the odds of getting confused here are really quite negligible.

Also, as it's impossible to store UTF-16 in Mercurial's -manifest-
(where we store filenames) due to the presence of NUL bytes, there's no
chance of confusion.

And elsewhere:

> As far as I understand, if you create a file now on Windows, you will get
> UTF-16 encoded name, because this is how Python 2 gets the name from the OS.
> That is why the existing repositories do contain UTF-16 encoded names.

Also wrong. What actually happens is that UTF-16 gets decoded into the
8-bit "filesystem encoding" when using the standard C interfaces that
Python 2 wraps. This encoding may be different from the console encoding
or the GUI encoding. And the console and GUI encodings generally don't
agree anyway.

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 12:58:45 PM10/21/11
to Mike Meyer, mercurial
I create a new repository on Windows with encoding cp936, and start checking in files., 
I clone it to a Unix system, and check in files with non-utf8, suppose you encoding is cp936
 non-ascii file names.

I pull changes back to the Unix system and update.

So, things works. Any way, in the same repository, no more than TWO different encoding is must be.

 
>
> So - what becomes of those non-utf8, non-ascii file names on my Windows box?
>
> > Notes
> >
> > 1. Evertying happened under windows. 
>
>  
>
> No, it doesn't. That's part of the problem. Mercurial is a
> cross-platform tool with unix leanings. Any change has to both not
> break things on Unix, *and* not break things for people who are using
> it on both Windows and Unix.

Er, Only those NEW repository affected. please don't insist again and again.

Mike Meyer

unread,
Oct 21, 2011, 1:02:58 PM10/21/11
to Matt Mackall, merc...@selenic.com
On Fri, Oct 21, 2011 at 9:55 AM, Matt Mackall <m...@selenic.com> wrote:
> On Fri, 2011-10-21 at 09:46 -0700, Mike Meyer wrote:

>> - I create a new repository on Windows, and start checking in files.
>> - I clone it to a Unix system, and check in files with non-utf8,
>> non-ascii file names.
>> - I pull changes back to the Unix system and update.
>>
>> So - what becomes of those non-utf8, non-ascii file names on my Windows box?
>
> The same as happens today.

In that case, I'm missing something in either the proposed change or
what happens today. I expect that today, the bytes would be the same
on both ends, so you'd get a gibberish display but avoid the makefile
problem. With the proposed changes, the files would be decoded as if
they were utf8, which would mean that you either get a gibberish
display *and* the makefile problem, or the decoding fails and you get
????. Could you tell me where I went wrong?

<mike

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 1:03:17 PM10/21/11
to Matt Mackall, mercurial


2011/10/22 Matt Mackall <m...@selenic.com>
Yes, detecting UTF8 is somewhat easy, but after migrating from old repo to UTF repo,
then the detecting will be non-sense, because the same repository have two different encoding The old one such as (CP936, cp1251, and so on) along with UTF8.
Also, as it's impossible to store UTF-16 in Mercurial's -manifest-
(where we store filenames) due to the presence of NUL bytes, there's no
chance of confusion.

And elsewhere:

> As far as I understand, if you create a file now on Windows, you will get
> UTF-16 encoded name, because this is how Python 2 gets the name from the OS.
> That is why the existing repositories do contain UTF-16 encoded names.

Also wrong. What actually happens is that UTF-16 gets decoded into the
8-bit "filesystem encoding" when using the standard C interfaces that
Python 2 wraps. This encoding may be different from the console encoding
or the GUI encoding. And the console and GUI encodings generally don't
agree anyway.

--
Mathematics is the supreme nostalgia of our time.


_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 1:06:18 PM10/21/11
to Matt Mackall, mercurial, mercuria...@googlegroups.com


2011/10/22 Matt Mackall <m...@selenic.com>
That's definitely right, for cross-platfrom, ASCII is the limited but easiest way to do that. and UTF8 is the only solution for
international support:) 
 

Anything else is doomed to failure so I'm only interested in adding
support for (b).

--
Mathematics is the supreme nostalgia of our time.


Mike Meyer

unread,
Oct 21, 2011, 1:07:37 PM10/21/11
to luoyo...@gmail.com, mercurial
On Fri, Oct 21, 2011 at 9:58 AM, 罗勇刚(Yonggang Luo)

So yo'ure saying that if I'm working with something other than utf8 on
Unix, I *have* to create the repository on windows with the encoding I
want to use on Unix?


>> > 1. Evertying happened under windows.
>> No, it doesn't. That's part of the problem. Mercurial is a
>> cross-platform tool with unix leanings. Any change has to both not
>> break things on Unix, *and* not break things for people who are using
>> it on both Windows and Unix.
> Er, Only those NEW repository affected. please don't insist again and again.

Right. I'm *asking* about new repositories. Please don't answer the
wrong question again and again.

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 1:09:35 PM10/21/11
to Mike Meyer, mercuria...@googlegroups.com


2011/10/22 Mike Meyer <m...@mired.org>

On Fri, Oct 21, 2011 at 9:55 AM, Matt Mackall <m...@selenic.com> wrote:
> On Fri, 2011-10-21 at 09:46 -0700, Mike Meyer wrote:

>> - I create a new repository on Windows, and start checking in files.
>> - I clone it to a Unix system, and check in files with non-utf8,
>> non-ascii file names.
>> - I pull changes back to the Unix system and update.
>>
>> So - what becomes of those non-utf8, non-ascii file names on my Windows box?
>
> The same as happens today.

In that case, I'm missing something in either the proposed change or
what happens today. I expect that today, the bytes would be the same
on both ends, so you'd get a gibberish display but avoid the makefile
problem. With the proposed changes, the files would be decoded as if
they were utf8, which would mean that you either get a gibberish
That's depending on you, if you choose to using utf8, then utf8 works. if you choose  
other encoding, also works.
display *and* the makefile problem, or the decoding fails and you get
????. Could you tell me where I went wrong?

  <mike

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 1:11:50 PM10/21/11
to Mike Meyer, mercurial, mercuria...@googlegroups.com
Yes, it's an option, but not limited to this, you can so something like this

1, first, you create an new repository under Unix, the encoding is cp936 by mercurial 1.8
2. On windows, using the newly mercurial. execute the following command:

hg encoding upgrade --old cp936 --new cp936
Then you will working fine with it on windows.


>> > 1. Evertying happened under windows.
>> No, it doesn't. That's part of the problem. Mercurial is a
>> cross-platform tool with unix leanings. Any change has to both not
>> break things on Unix, *and* not break things for people who are using
>> it on both Windows and Unix.
> Er, Only those NEW repository affected. please don't insist again and again.

Right. I'm *asking* about new repositories. Please don't answer the
wrong question again and again.

  <mike

Mike Meyer

unread,
Oct 21, 2011, 1:24:53 PM10/21/11
to luoyo...@gmail.com, mercurial
On Fri, Oct 21, 2011 at 10:11 AM, 罗勇刚(Yonggang Luo)

Is it going to be a requirement that I *not* upgrade mercurial on Unix
if I want to use something other than utf8 or ascii? That doesn't seem
right, as the proposal says that mercurials behavior doesn't change on
non-Windows systems. So I ought to get the same thing using the new
unicode-aware-on-windows version as I do with 1.8.

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 1:28:19 PM10/21/11
to Mike Meyer, mercurial
Sorry for the insist, indeed, I just give an example earlier version Mercurial also not affected.
New version also work the same. 

 <mike

Mike Meyer

unread,
Oct 21, 2011, 2:06:18 PM10/21/11
to luoyo...@gmail.com, mercurial
On Fri, Oct 21, 2011 at 10:28 AM, 罗勇刚(Yonggang Luo)

So the new version of Mercurial will create an "old-style" repository
on Unix which you then upgrade? Fair enough. The requirement seems to
be that 1) everyone use the same repo encoding, and 2) You have to
know that encoding when the first time you use the repo on Windows
(either upgrading to that encoding if you cloned it, or creating it
with that encoding).

Pierre Asselin

unread,
Oct 21, 2011, 9:15:36 PM10/21/11
to merc...@selenic.com
Roger Gammans <rgam...@computer-surgery.co.uk> wrote:

> You can guess or make a stab a the enviromnent encoding. But you can't guess the
> repo encoding as this is the encoding on the environment where the commit
> occured.

I don't see why the repo needs an encoding (where by "repo"
I mean the stuff under .hg).

The repository is private Mercurial data. It is Mercurial, not
the OS, that decides what bytes get stored in the manifest. If
Mercurial decides that Windows file names are serialized as UTF-8,
then that's that.

The filenames under .hg/store are already mangled to ASCII
and hopefully the OS won't sabotage that.


> So my suggestion was to add a repo encoding property, so that this
> was known on repo which set it. This then means that meaningful
> transcodes between the repo and the local encoding can be done.

You only need a working copy encoding for that. If somebody
wants to clone a repo under codepage-whatever only the working
copy is affected (with an error if one of the Unicode characters
won't map).

--
pa at panix dot com

Pierre Asselin

unread,
Oct 21, 2011, 9:20:20 PM10/21/11
to merc...@selenic.com
???(Yonggang Luo) <luoyo...@gmail.com> wrote:


> * The repository?s encoding configuration should be like this:
> {{{
> old_repository_encoding: such as ?ascii? ?cp1251? cp936 cp1252 utf8 and so
> on
> separator revision number: such as 128

One boundary revision is not enough. You can't guarantee that all
the old-encoding changesets come before the first new-encoding
changeset. Mercurial is distributed. Even in an all-Windows
environment, after multiple cloning-pulling-pushing between multiple
developers the changeset order will vary internally among the clones
--even after everybody syncs up.

I'm not sure you need a repository encoding though, except
possibly as a transitional measure.

Andrey

unread,
Oct 21, 2011, 10:39:19 PM10/21/11
to mercuria...@googlegroups.com, merc...@selenic.com

I don't see why the repo needs an encoding (where by "repo"
I mean the stuff under .hg).


The same file name is encoded differently for different platforms. For instance 'дятел.txt' cannot be  exchanged between Unix and Windows.

-
Andrey

罗勇刚(Yonggang Luo)

unread,
Oct 21, 2011, 11:37:20 PM10/21/11
to Pierre Asselin, mercurial, mercuria...@googlegroups.com


在 2011年10月22日星期六,Pierre Asselin 写道:
???(Yonggang Luo)  <luoyo...@gmail.com> wrote:


>  * The repository?s encoding configuration should be like this:
> {{{
> old_repository_encoding: such as ?ascii? ?cp1251? cp936 cp1252 utf8 and so
> on
> separator revision number: such as 128

One boundary revision is not enough.  You can't guarantee that all
the old-encoding changesets come before the first new-encoding
changeset.  Mercurial is distributed.  Even in an all-Windows
environment, after multiple cloning-pulling-pushing between multiple
developers the changeset order will vary internally among the clones
--even after everybody syncs up.

That's what I am worried about, I have no idea about this, is there any better suggestion from people have good
idea on this?  Any, this is definitely necessary If we want to migrate from old repository to new old in the simplest way.
And another method is to convert all old repository's path to new Encoding(UTF8) and recommit it. I don't know such a
 modification will disturb something, such as the HASH is different(This is important).
The second method is to set a ctx.extra['encoding'] properly. So, when you commit a revision, the encoding information
is also attached. When we playing with a ctx. the following schema is appeared.

if ctx.['endoing'] exist:
   ctx_encoding = ctx.['endoing']
else:
   ctx_encoding= old repository encoding.


I'm not sure you need a repository encoding though, except
possibly as a transitional measure.


--
pa at panix dot com

_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial


罗勇刚(Yonggang Luo)

unread,
Oct 22, 2011, 12:03:03 AM10/22/11
to mercurial, mercuria...@googlegroups.com
Wiki updated revision
Mainly problems is 
== Per-repository encoding configuration needed ==
and 
=== Other choice about encoding configuration ===
Which one we should used, or any other better solution to setting the properly configuration about 
encoding. Anyway, I think a clearly encoding configuration for each repository is needed, so
that we don't need to guess the repository's encoding.

<!> This page is intended for developers.
<!> This is a proposed feature, last updated Oct 21, 2011.

== Things need to be done ==
 *1. all pure-ASCII repostiroy won't be affected. this is easy
 *2. all OS except Windows won't be affected. using sys.platform == 'win32' to ensure that.
 *3. use utf8 as the default encoding for new commits.
 *4. supporting for messed up old mercurial repository. add new --encoding option to do that.
 *5. supporting automatically rename from old messed up filename to newly utf8 filename.


== Per-repository encoding configuration needed ==
We need per-repository configuration about, (per-user is not a good idea, because encoding is based on repository, and differernt repository will have different encoding).
The repository’s encoding configuration should be like this:
  *1. old_repository_encoding
      such as ‘ascii’ ‘cp1251’ cp936 cp1252 utf8 and so on, this is the encoding of old repository
  *2. repository_encoding
      such as utf8 or ascii.
      In most case, it's utf8. But anyway, some user don't want to use utf8, so it's still can be other encoding, such as cp936, cp1251, cp1252, so those users won't be feel loss with the new Mercurial. They can still do everything in old way.
  *3. separator_revision, >=0
      such as 0,128,312, this is maybe not be a good choice, please refer to [[Other choice about encoding configuration]]
      0 means the repository only contains one encoding. so [0, tip] is encoded in repository_encoding
      other means [0, separator_revision) is encoded in old_repository_encoding, [separator_revision, tip] is encoded in repository_encoding.
}}}

Setting three parameters because when we migrating from old repository to new repository, we need to face a problem, how to checkout to old history? for example checkout old-tag, old-branch? When checkout those revision, then we need work in the old-way.

=== Other choice about encoding configuration ===
  *1. Convert all old repository's path to new Encoding(UTF8) and recommit it. I don't know such a  modification will disturb something, such as the HASH will be changed for each ctx(This is important).
  *2. Set a ctx.extra['encoding'] property. So when commit a revision, the encoding information is also attached. Then when we playing with a ctx. the following schema is appeared.{{{#!python
if ctx is working directory:
  ctx_encoding = repository_encoding # Such as UTF8
if 'encoding' in ctx.extra():
  ctx_encoding = ctx.extra()['encoding']
else:
  ctx_encoding= old_repository_encoding.}}}



== The schema only support for Windows ==
{{{#!python
if windows:
  if there is configuration about encoding:
    rev = current revision
    if rev is working directory:
      '''
      Because the working directory’s encoding may differ from it’s parent encoding.
      When this happened, we should automatically handle the rename on those messed
      up characters.
      '''
      mode = repository_encoding, parent_encoding
    elif rev < separator revision:
      mode = old_repository_encoding
    else:
      mode = repository_encoding
  else:
    mode = passthrough
else:
  mode = passthrough
}}}
== Schema support for all OS ==
{{{#!python
if there is configuration about encoding:
  rev = current revision
  if rev is working directory:
    '''
    Because the working directory’s encoding may differ from it’s parent encoding.
    When this happened, we should automatically handle the rename on those messed
    up characters.
    '''
    mode = repository_encoding, parent_encoding
  elif rev <= separator revision:
    mode = old_repository_encoding
  else:
    mode = repository_encoding
else:
  mode = passthrough
}}}
== Upgrade ==
Also which need to supply a upgrade tool to upgrade to using utf8 as the repository encoding on old tools.
{{{#!cmd
hg encoding upgrade --old ascii --new utf8 --sep 1920
hg encoding upgrade --old cp936 --new utf8
hg encoding upgrade --old cp936 --new utf8
hg encoding upgrade --new utf8 #The old is the current locale
hg encoding upgrade #The old is the current local and the new is utf8
hg encoding --verify # Iterating on the whole repository, to verify each path in each revision is encoding in config encoding.
hg encoding upgrade --clear
}}}
option explain
{{{
--old The old repository's original encoding.
--new In most case, it's utf8, when someone want to use Mercurial in old way, then it's can be setting to other things(such as ascii for make sure each committed filename are encoded in ascii).
--sep Setting the separator revision, when not specified, then it's the newest revision. or if old is ascii, then the sep revision is 0
--clear #Regenerating the repository with all path under all revision is converted to new encoding(utf8). (There may someone desire for it)
}}}
== Notes ==
 *1. Evertying happened under windows.
 *2. Everything is decided, so we need to guess nothing, only need a upgrade instruction at once, after that, everything will handled automatically.
 *3. Existing repos will not be affected at all, except the user execute the hg encoding instruction to upgrade the repository.
 *4. Newly created repository default setting to utf8, but we can still supply an option to create repository in old way.

== Compatibility ==
    The old Mercurial just working as is when the repository is not encoding configurated. Else, if the repository's encoding is the same with the locale encoding. then it's also working properly, else if the repository only contains ASCII paths(0-0x7F), it's also works. Else, it's may checkout, but not working properly. Anyway, do not trying to use old Mercurial playing with newly Mercurial.Unicode upgraded or created (that config with utf8) repository, that's must be a bad choice.

== Questions and Answers ==
  *Question
    So what happens when I create and check in a file that uses a non-ascii file name encoded in something other than utf8 on a Unix box and you try and check it out on your windows box?
  *Answer
    First, you know which encoding you used under Unix box. suppose the encoding is cp936. On windows, execute the following command
   {{{
hg encoding upgrade --old cp936 --new cp936
}}} It's only need to execute in one time, only when the repository's encoding is not decided. Then you will working fine with it on windows.
  *Question
    I don't see why the repo needs an encoding (where by "repo" I mean the stuff under .hg).
  *Answer(By Andrey)
    The same file name is encoded differently for different platforms. For instance 'дятел.txt' cannot be exchanged between Unix and Windows.

== General test case ==
commit and checkout the following files without problems under win2k and upper OS, then content setting to it’s filename, encoded with utf8.
{{{
Chinese (Traditional).txt
简体.txt
繁体.txt
중국어 (번체).txt
Chinês (Tradicional).txt
áéíóúñ.txt 'accented characters'
}}}
----
CategoryDeveloper CategoryNewFeatures

Pierre Asselin

unread,
Oct 22, 2011, 8:25:21 PM10/22/11
to merc...@selenic.com
Andrey <py4...@gmail.com> wrote:

> > I don't see why the repo needs an encoding (where by "repo"
> > I mean the stuff under .hg).
> >

> The same file name is encoded differently for different platforms. For

> instance '?????.txt' cannot be exchanged between Unix and Windows.

Hmf, nor between the list and Usenet. Okay, I can see it on pipermail.

I'm not sure Yonggang Luo is asking for cross-platform interoperability.
Can you create a repo with that file on Windows and clone it
successfully on Windows?

Anyway, Why doesn't it work?

I'm assuming UTF-8 on Unix. Of course you can put byte sequences
that are not UTF-8 in a Unix filename, but that wouldn't be nice
when interoperating with Windows. Just like committing a file
"foo" and a file "Foo".

Under .hg the filename shows up in two places: 1) the manifest, 2)
the mangled filename under .hg/store . In the manifest, Mercurial
copies the bytes retrieved from the filesystem. In the store,
Mercurial mangles any bytes above 127 into an ascii representation.

After the Windows pull, what bytes are in the manifest and what filename
shows up under .hg/store ? I would assume the same as Unix, but I
don't have a Windows box handy to verify that. If the file doesn't
show up correctly in the working copy it must be a problem creating
the working copy.

If the bytes are UTF-8 in the manifest and mangled UTF-8 in the store,
"all" Mercurial would have to do to is re-encode in Windows' UTF-16
and pass that to the Windows API to create a file with a "Unicode"
filename. Doesn't it do that? I see an extension in the wiki,
http://mercurial.selenic.com/wiki/FixUtf8Extension
is the information on this page current?

[
Full interoperability is harder because the *contents* of the
files also matters. If a C source file contains UTF-8 for

#include "Ð´Ñ Ñ‚ÐµÐ».h"

will the Windows C preprocessor grok that?
]

Pierre Asselin

unread,
Oct 22, 2011, 8:35:33 PM10/22/11
to merc...@selenic.com
???(Yonggang Luo) <luoyo...@gmail.com> wrote:
> >
> > One boundary revision is not enough. You can't guarantee that all
> > the old-encoding changesets come before the first new-encoding
> > changeset.
>
> That's what I am worried about, I have no idea about this, is there any
> better suggestion from people have good
> idea on this?

Hmmm, one encoding per changeset, as metadata in the manifest revlog,
plus one repository-wide default for old changesets where the manifest
has no such metadata.

No idea how to handle back-compatibility. Is it just a new word
in .hg/requires ?

====

I have a question. Are you trying to interoperate with Linux
or is the problem entirely within Windows?

罗勇刚(Yonggang Luo)

unread,
Oct 22, 2011, 11:51:15 PM10/22/11
to Pierre Asselin, mercurial, mercuria...@googlegroups.com


2011/10/23 Pierre Asselin <p...@panix.com>

Andrey <py4...@gmail.com> wrote:

> > I don't see why the repo needs an encoding (where by "repo"
> > I mean the stuff under .hg).
> >

> The same file name is encoded differently for different platforms. For
> instance '?????.txt' cannot be  exchanged between Unix and Windows.

Hmf, nor between the list and Usenet.  Okay, I can see it on pipermail.

I'm not sure Yonggang Luo is asking for cross-platform interoperability.
Yes, it is, but only Windows is affected. 
Can you create a repo with that file on Windows and clone it
successfully on Windows?
That's impossible at the current time that commit those files in one short on Windows.
That's why this proposal is appeared. 

Anyway, Why doesn't it work?
Because the local encoding codepont is limited, that's a SET problem
cp936 is a SUBSET of UTF8, cp1251 is also the SUBSET of UTF8.
so you can not use cp936 representing all codepoint of UTF8, so is cp1251, and so on. 

I'm assuming UTF-8 on Unix.  Of course you can put byte sequences
that are not UTF-8 in a Unix filename, but that wouldn't be nice
when interoperating with Windows.  Just like committing a file
"foo" and a file "Foo".

Under .hg the filename shows up in two places: 1) the manifest, 2)
the mangled filename under .hg/store .  In the manifest, Mercurial
copies the bytes retrieved from the filesystem.  In the store,
Mercurial mangles any bytes above 127 into an ascii representation.

After the Windows pull, what bytes are in the manifest and what filename
shows up under .hg/store ?  I would assume the same as Unix, but I
Still as same as Unix. 
don't have a Windows box handy to verify that.  If the file doesn't
show up correctly in the working copy it must be a problem creating
But the encoding is a problem, because unix's encoding is UTF8, but at the current time
Windows didn't support for UTF8.  So those files cannot be checkout properly.
the working copy.

If the bytes are UTF-8 in the manifest and mangled UTF-8 in the store,
"all" Mercurial would have to do to is re-encode in Windows' UTF-16
and pass that to the Windows API to create a file with a "Unicode"
filename.  Doesn't it do that?  I see an extension in the wiki,
http://mercurial.selenic.com/wiki/FixUtf8Extension
is the information on this page current?
Yes,  this extension is fine, but I think you have already noticed the comment of this extension

This extension is still in beta, use it at your own risk.

Why it's have risk? just because this extension is based on Encoding GUESS 
So it's guess result is in-trusted. My proposal is deal with this problem, to assign a explicit for each repository.
So we can working with the repository with no RISK.
[
Full interoperability is harder because the *contents* of the
files also matters.  If a C source file contains UTF-8 for

   #include "Ð´Ñ Ñ‚ÐµÐ».h"
Sorry, I am afraid this is off topic, mercurial only care about the encoding of the FILENAME, not the CONTENT of the file.
or it's will be very very complicated. 

will the Windows C preprocessor grok that?
That's the COMPILER should care about:( 
]

--
pa at panix dot com

_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial

罗勇刚(Yonggang Luo)

unread,
Oct 22, 2011, 11:52:37 PM10/22/11
to Pierre Asselin, mercurial, mercuria...@googlegroups.com


2011/10/23 Pierre Asselin <p...@panix.com>

???(Yonggang Luo)  <luoyo...@gmail.com> wrote:
> >
> > One boundary revision is not enough.  You can't guarantee that all
> > the old-encoding changesets come before the first new-encoding
> > changeset.
>
> That's what I am worried about, I have no idea about this, is there any
> better suggestion from people have good
> idea on this?

Hmmm, one encoding per changeset, as metadata in the manifest revlog,
plus one repository-wide default for old changesets where the manifest
has no such metadata.

No idea how to handle back-compatibility.  Is it just a new word
If we want to support for UTF8 on windows, we have to drop the back-compatibility, if UTF8 is not 
a must-be option, then back-compatibility is maintained.
in .hg/requires ? 

====

I have a question.  Are you trying to interoperate with Linux
or is the problem entirely within Windows?



--
pa at panix dot com

_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial

Dennis Brakhane

unread,
Oct 24, 2011, 2:29:13 PM10/24/11
to luoyo...@gmail.com, mercuria...@googlegroups.com, mercurial, Pierre Asselin
On Sun, Oct 23, 2011 at 5:51 AM, 罗勇刚(Yonggang Luo)
<luoyo...@gmail.com> wrote:
> Because the local encoding codepont is limited, that's a SET problem
> cp936 is a SUBSET of UTF8, cp1251 is also the SUBSET of UTF8.
> so you can not use cp936 representing all codepoint of UTF8, so is cp1251,
> and so on.

I'm pretty sure you mean Unicode, not UTF-8. UTF-8 is (the most used)
encoding scheme for Unicode characters.

罗勇刚(Yonggang Luo)

unread,
Oct 24, 2011, 10:49:57 PM10/24/11
to Dennis Brakhane, mercurial, mercuria...@googlegroups.com


2011/10/25 Dennis Brakhane <brak...@googlemail.com>

On Sun, Oct 23, 2011 at 5:51 AM, 罗勇刚(Yonggang Luo)
<luoyo...@gmail.com> wrote:
> Because the local encoding codepont is limited, that's a SET problem
> cp936 is a SUBSET of UTF8, cp1251 is also the SUBSET of UTF8.
> so you can not use cp936 representing all codepoint of UTF8, so is cp1251,
> and so on.

I'm pretty sure you mean Unicode, not UTF-8. UTF-8 is (the most used)
encoding scheme for Unicode characters.

UTF8 is a part of Unicode, and UTF8 is a only choice for Mercurial.

Andrey

unread,
Oct 25, 2011, 4:21:54 AM10/25/11
to mercuria...@googlegroups.com, Dennis Brakhane, mercurial, luoyo...@gmail.com




I'm pretty sure you mean Unicode, not UTF-8. UTF-8 is (the most used)
encoding scheme for Unicode characters.

UTF8 is a part of Unicode, and UTF8 is a only choice for Mercurial.


Yonggang,
UTF-8 is just the encoding scheme for Unicode characters. There are many ways to encode the same Unicode character set.

I see now that this topic causes so much misunderstanding that it is hardly possible to make any progress...

-
Andrey
 

罗勇刚(Yonggang Luo)

unread,
Oct 25, 2011, 4:55:18 AM10/25/11
to mercuria...@googlegroups.com
I don't know why you insist on one character can encoded in multiple
codepoint, there is no relationship with this proposal? Can you give
me a simple example to show you meaning?

2011/10/25, Andrey <py4...@gmail.com>:

--
从我的移动设备发送

Andrey

unread,
Oct 25, 2011, 9:16:50 AM10/25/11
to mercuria...@googlegroups.com


I don't know why you insist on one character can encoded in multiple
codepoint, there is no relationship with this proposal? Can you give
me a simple example to show you meaning?


The 'A' Unicode character will occupy:
1 byte in UTF-8 encoding scheme
2 bytes in UTF-16BE
2 bytes in UTF-16LE 
4 bytes in UTF-32BE
4 bytes in UTF-32LE

As you can see, 5 different byte sequences lead to the same character. 

-
Andrey

罗勇刚(Yonggang Luo)

unread,
Oct 25, 2011, 9:23:00 AM10/25/11
to mercuria...@googlegroups.com


2011/10/25 Andrey <py4...@gmail.com>
I am really get confused by your statement........ Is there anyone can give a impressive explain? 

-
Andrey



--

Andrey

unread,
Oct 25, 2011, 10:25:21 AM10/25/11
to mercuria...@googlegroups.com



The 'A' Unicode character will occupy:
1 byte in UTF-8 encoding scheme
2 bytes in UTF-16BE
2 bytes in UTF-16LE 
4 bytes in UTF-32BE
4 bytes in UTF-32LE

As you can see, 5 different byte sequences lead to the same character. 
I am really get confused by your statement........ Is there anyone can give a impressive explain? 



I think the FAQ link gives a very good explanation:

-
Andrey

罗勇刚(Yonggang Luo)

unread,
Oct 25, 2011, 10:29:52 AM10/25/11
to mercuria...@googlegroups.com
I think the FAQ link gives a very good explanation:
I means this is not relating to Mercurial support for Unicode under Windows.
Indeed, why UTF8 is the only choice is explained before. 

Jouni Airaksinen

unread,
Oct 26, 2011, 3:39:15 AM10/26/11
to merc...@selenic.com
I've been following this conversation and it seems to me that people are
trying to overengineer this?

Why can't old repositories work as-is and new repositories could be
created to use storing in UTF-8 regardless of platform the operations are
done? Platform would talk with it's own file system with encoding it
needs. If you need to convert old repository, that's why we have convert
tool. Converting to new repository UTF-8 format of course means older
clients will have trouble (Linux clients would work fine as they always
been UTF-8). The UTF-8 requirement would just be indicated in the requires
file like the repo formats normally are. Doesn't older clients warn about
missing requirements anyhow?

I understand there are problems convert certain characters ie. from UTF-8
to target encodings (Windows or Mac), but how does it differ from current
situation? It doesn't. All we get much better support for many characters
such as the umlauts (common in Finnish language and people tend to use
them in documentation file names). If you really have characters which
cannot work at all between systems then just don't use them. Right now
it's basically only A-Z so any improvement would be good. Or what would
english speaking people think if there was version control would couldn't
store, say, letter Q because someone thinks it's useless character for
him/her as it's so rarely used. Let people use O instead if they need it?

Anyway... for me umlauts supported across Windows, Linux and Mac would be
great. Right now we have few repositories which need to be cloned in
Windows VM under Mac if you want to use the repositories. Not very ideal.

Or at least have possibility to add crossplatform character constraint
requires to repositories to not allow commits with "invalid" characters in
filenames, if utf-8 is out of ouestion.. (pun intended).

-jouni


ps. what's the difference between mercuria...@googlegroups.com and merc...@selenic.com?
Andrey_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial

Martin Geisler

unread,
Oct 26, 2011, 4:20:39 AM10/26/11
to Jouni Airaksinen, merc...@selenic.com
Jouni Airaksinen <Jouni.Ai...@descom.fi> writes:

> I've been following this conversation and it seems to me that people
> are trying to overengineer this?

Yes, and there's a lot of confused statements too in this thread.

> Why can't old repositories work as-is and new repositories could be
> created to use storing in UTF-8 regardless of platform the operations
> are done? Platform would talk with it's own file system with encoding
> it needs. If you need to convert old repository, that's why we have
> convert tool. Converting to new repository UTF-8 format of course
> means older clients will have trouble (Linux clients would work fine
> as they always been UTF-8). The UTF-8 requirement would just be
> indicated in the requires file like the repo formats normally are.
> Doesn't older clients warn about missing requirements anyhow?

Yes, an old client wont access a repository on disk if it doesn't
understand the .hg/requires file.

But there are funny upgrade scenarios to think about: if I create a
repository on Windows with a future version that supports Unicode, then
I'll create a manifest encoded in UTF-8. You can then no longer clone
that with hg-1.9 onto another Windows machine -- you'll get scrambled
filenames.

This is just like the situation today if I use the fixutf8 extension and
you don't. Personally, I think that is acceptable, but it's a regression
From today where we two Windows users can talk to each other using any
pair of versions.

> ps. what's the difference between mercuria...@googlegroups.com
> and merc...@selenic.com?

The @googlegroups.com address is for Google's mirror. It should not
really be used -- @selenic.com is the real address and sending to that
address ought to be enough to get the mail into Google's system too.

--
Martin Geisler

Mercurial links: http://mercurial.ch/

Tom Anderson

unread,
Oct 26, 2011, 4:21:52 AM10/26/11
to Jouni Airaksinen, merc...@selenic.com
On 26 October 2011 08:39, Jouni Airaksinen <Jouni.Ai...@descom.fi> wrote:

> I've been following this conversation and it seems to me that people are
> trying to overengineer this?
>
> Why can't old repositories work as-is and new repositories could be
> created to use storing in UTF-8 regardless of platform the operations are
> done?

That would obviously require knowing the encoding being used by the
local platform, so that filenames could be transcoded on commit.

I believe the objection to this idea is that Mercurial can't
accurately know what that encoding is. Unix systems mostly do use
UTF-8, as it happens, but that is not necessarily the case - they
could use Latin-1, or any other encoding. Because the encoding in use
is not recorded anywhere definitive, Mercurial can't tell for sure. It
would have to do something like assume UTF-8 unless configured
otherwise, which leads to the situation where someone using a Latin-1
system forgets to configure that, and they end up committing files
with corrupt names.

That said, i think this is essentially the right solution, but it
should be optional, not the default. I have a very wordy email sitting
in my drafts about this, but basically, i think we should have two
modes, set on a per-repository basis, and burned into the repository
when created:

- Old mode, in which, repositories use a 'passthrough' encoding for
both the filesystem and the repository; the same bytes are used in
both places (if you like, think of this as, if you'll excuse some
Java, char decode(byte b) {return (char)b & 0xff;} byte encode(char
ch) {return (byte)ch;}). This reproduces the current behaviour.
- New mode, in which repositories use UTF-8 for the repository, and a
local encoding for the filesystem. That could be configured in hgrc in
the usual way. If it was not configured, Mercurial could either (a)
guess an encoding based on some combination of system settings (the
LANG environment variable on Unix, don't know about Windows), a survey
of the bytes in some local filenames, whatever, or (b) refuse to
commit (like how it won't commit until you specify a username). The
former would be easier on users; the latter would be safer and more
Pythonic ("In the face of ambiguity, refuse the temptation to
guess."). It could perhaps follow a canny middle way: as long as any
path being committed appears to be plain ASCII (which works for
everyone except users of EBCDIC and PETSCII machines - not many of
them around), then guess that it's ASCII, but if it has any high bits
set, throw a strop and demand to be configured properly.

I assume old mode would have to be the default, so as not to trip up
users currently depending on that behaviour. Personally, i would like
to see the new mode be the default, but i don't think that will fly.

You could actually convert a repository between these states as long
as it only contained ASCII or UTF-8 filenames, i think.

tom

--
Tom Anderson         |                e2x Ltd, 1 Norton Folgate, London E1 6DB
(e) t...@e2x.co.uk    |    (m) +44 (7960) 989794    |    (f) +44 (20) 7100 3749

罗勇刚(Yonggang Luo)

unread,
Oct 26, 2011, 9:46:59 AM10/26/11
to Tom Anderson, mercurial, mercuria...@googlegroups.com
Tom, please have a look 
http://mercurial.selenic.com/wiki/UnicodeOnWindows

I've considerated most conditions, you can edit it with your revise:) 
2011/10/26 Tom Anderson <tom.an...@e2x.co.uk>

罗勇刚(Yonggang Luo)

unread,
Oct 26, 2011, 9:49:04 AM10/26/11
to Martin Geisler, mercurial, mercuria...@googlegroups.com


2011/10/26 Martin Geisler <m...@lazybytes.net>

Jouni Airaksinen <Jouni.Ai...@descom.fi> writes:

> I've been following this conversation and it seems to me that people
> are trying to overengineer this?

Yes, and there's a lot of confused statements too in this thread.

> Why can't old repositories work as-is and new repositories could be
> created to use storing in UTF-8 regardless of platform the operations
> are done? Platform would talk with it's own file system with encoding
> it needs. If you need to convert old repository, that's why we have
> convert tool. Converting to new repository UTF-8 format of course
> means older clients will have trouble (Linux clients would work fine
> as they always been UTF-8). The UTF-8 requirement would just be
> indicated in the requires file like the repo formats normally are.
> Doesn't older clients warn about missing requirements anyhow?

Yes, an old client wont access a repository on disk if it doesn't
understand the .hg/requires file.

But there are funny upgrade scenarios to think about: if I create a
repository on Windows with a future version that supports Unicode, then
I'll create a manifest encoded in UTF-8. You can then no longer clone
that with hg-1.9 onto another Windows machine -- you'll get scrambled
filenames.

This is just like the situation today if I use the fixutf8 extension and
you don't. Personally, I think that is acceptable, but it's a regression
From today where we two Windows users can talk to each other using any
pair of versions.
For those people WANT utf8, they will use the newest Mercurial, 
for those people want old way Mercurial, they still can use the newest Merurial.
I didn't found any problem on support for utf8 on Windows.


> ps. what's the difference between mercuria...@googlegroups.com
> and merc...@selenic.com?

The @googlegroups.com address is for Google's mirror. It should not
really be used -- @selenic.com is the real address and sending to that
address ought to be enough to get the mail into Google's system too.

--
Martin Geisler

Mercurial links: http://mercurial.ch/

_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial

Paul Boddie

unread,
Oct 26, 2011, 11:09:59 AM10/26/11
to luoyo...@gmail.com, mercurial
On 26/10/11 15:49, 罗勇刚(Yonggang Luo) wrote:
> http://mercurial.selenic.com/wiki/UnicodeOnWindows
> For those people WANT utf8, they will use the newest Mercurial,
> for those people want old way Mercurial, they still can use the newest
> Merurial.
> I didn't found any problem on support for utf8 on Windows.

I've taken the liberty of tidying up that page to fix the formatting and
to make the wording a bit clearer. There were also vague statements like...

"Evertying happened under windows."

...which don't really clarify matters even if repeated several times,
and so I've attempted to interpret the real meaning here as...

"The filename conversion only occurs on Windows."

You may wish to correct this, but keep it explicit: "everything" means
nothing in this case. I've also added a section which should describe
the motivation for the proposal.

I must admit that I haven't really looked at Mercurial's internals, but
I am skeptical about the proposal in that it seems to defer the
interpretation of filenames until a Windows system is involved, by which
time the repository might be stuffed with all kinds of different byte
sequences for any given character sequence.

> 2011/10/26 Martin Geisler<m...@lazybytes.net>
>
>> Jouni Airaksinen<Jouni.Ai...@descom.fi> writes:
>>> ps. what's the difference between mercuria...@googlegroups.com
>>> and merc...@selenic.com?
>>
>> The @googlegroups.com address is for Google's mirror. It should not
>> really be used -- @selenic.com is the real address and sending to that
>> address ought to be enough to get the mail into Google's system too.

I've removed it from this reply since I now appear to get personal
replies from that service when people see fit to post there instead of
on this list. Since this list is the only thing I have explicitly chosen
to subscribe to, I don't see why I should be pestered by the Google
service in question.

Paul

P.S. You might like to use the preview function on the Wiki a bit more:
saving a new page version every minute or so fills up the history very
quickly and makes it awkward to go back and see older edits.

Reply all
Reply to author
Forward
0 new messages