[...]
Might I suggest that you move the evolution of this proposal to a Wiki
page so that, interesting as this discussion has been, we do not have to
read messages for each edit to the proposal and each repeated assertion
and counter-assertion?
Paul
P.S. Dropping the Google group from the recipients list since it
requires a subscription.
_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial
On 20/10/11 18:20, 罗勇刚(Yonggang Luo) wrote:[...]
A proposal on solve encoding problem on Windows.
Things need to be done:
1. all pure-ASCII repostiroy won't be affected. this is easy
2. all OS except Windows won't be affected. using sys.platform == 'win32' to
ensure that.
2. use utf8 as the default encoding for new commits.
4. supporting for messed up old mercurial repository. add new --encoding
option to do that.
5. supporting automatically rename from old messed up filename to newly utf8
filename.
6. supporting seamlessly 'accented characters' between windows and linux,
testcase needed.
Might I suggest that you move the evolution of this proposal to a Wiki page so that, interesting as this discussion has been, we do not have to read messages for each edit to the proposal and each repeated assertion and counter-assertion?
Paul
P.S. Dropping the Google group from the recipients list since it requires a subscription.
This is not sufficiently backwards-compatible. Alice using Mercurial 2.1
checks in a new file to the existing project, Bob and Carl and Dave and
Erica using Mercurial 1.8 can't check it out.
For many users, this would be a serious regression: Windows users using
the same SBCS code page can already share files just fine.
> 4. supporting for messed up old mercurial repository. add new --encoding
> option to do that.
Same problem. A solution that breaks things for existing users will not
be considered.
Here's an alternate scheme:
if windows:
find manifest of the parent commit
if manifest is empty:
# brand new repo
mode = utf8transcoding
if all files in manifest are valid UTF-8:
# repo is already in UTF-8 mode or is pure ASCII
mode = utf8transcoding
else:
# existing repo, possibly using a Windows character set
mode = passthrough
else:
mode = passthrough
Notes:
1. We can reliably detect UTF-8 with very high probability
2. This automatically does the right thing on existing repos
3. This automatically does the right thing when working with Linux users
on UTF-8
4. Existing repos can be upgraded to UTF-8 if desired
--
Mathematics is the supreme nostalgia of our time.
Putting it on the wiki was actually bad advice: the wiki is primarily
for reference. We don't want to collect lots of non-official and/or
half-baked proposals there. I've reverted it, we should keep the
discussion either here or on mercurial-devel.
--
Mathematics is the supreme nostalgia of our time.
Things need to be done:1. all pure-ASCII repostiroy won't be affected. this is easy2. all OS except Windows won't be affected. using sys.platform == 'win32' to ensure that.2. use utf8 as the default encoding for new commits.4. supporting for messed up old mercurial repository. add new --encoding option to do that.5. supporting automatically rename from old messed up filename to newly utf8 filename.
On Oct 20, 11:51 am, Matt Mackall <m...@selenic.com> wrote:
> On Fri, 2011-10-21 at 00:40 +0800, 罗勇刚(Yonggang Luo) wrote:
> > The wiki link is
> >http://mercurial.selenic.com/wiki/EncodingStrategy#A_proposal_on_solv....
> > comment are welcome.
>
> Putting it on the wiki was actually bad advice: the wiki is primarily
> for reference. We don't want to collect lots of non-official and/or
> half-baked proposals there. I've reverted it, we should keep the
> discussion either here or on mercurial-devel.
>
What about using the talk page?
On Oct 20, 11:25 am, Paul Boddie <paul.bod...@biotek.uio.no> wrote:
>
> P.S. Dropping the Google group from the recipients list since it
> requires a subscription.
Google groups got the message anyway, I don't think you need to
explicitly CC it as long as you include the primary list address.
On Oct 20, 12:34 pm, Ben Fritz <fritzophre...@gmail.com> wrote:
> On Oct 20, 11:51 am, Matt Mackall <m...@selenic.com> wrote:
>
> > On Fri, 2011-10-21 at 00:40 +0800, 罗勇刚(Yonggang Luo) wrote:
> > > The wiki link is
> > >http://mercurial.selenic.com/wiki/EncodingStrategy#A_proposal_on_solv....
> > > comment are welcome.
>
> > Putting it on the wiki was actually bad advice: the wiki is primarily
> > for reference. We don't want to collect lots of non-official and/or
> > half-baked proposals there. I've reverted it, we should keep the
> > discussion either here or on mercurial-devel.
>
> What about using the talk page?
Never mind, stupid me didn't check that the wiki software used on
mercurial's wiki actualy has such a concept. Please forgive the
noise :-)
It's possible, but it's not a standard convention on our wiki, nor does
Moin do much to facilitate it. Talk happens here or on IRC.
--
Mathematics is the supreme nostalgia of our time.
So what happens when I create and check in a file that uses a non-ascii file name encoded in something other than utf8 on a Unix box and you try and check it out on your windows box?<mike
if all files in manifest are valid UTF-8:
# repo is already in UTF-8 mode or is pure ASCII
mode = utf8transcoding
Instead of running a mode guess algorithm, couldn't mercurial
store a repo encoding field. Which if blank /non-exsitant mercurial
defaults to the current behavior ?
Then if it is set hg can try to transcode to 'native' however that
is defined. (it should also have a switch to disable this) .
That means for existing repos nothing changes unless somebody adds a
commit with an encoding field.
And perhaps the field only needs to be added on the first commit
that adds a non-ansi filename, which means everything stays the same
for those that don't need encodings. But I'm less sure about this part,
certaily I think you should also be able to 'force' the encoding (eg
set the repo hg encoding 'hg set-encoding foo', and work with a
non-stated one eg 'hg checkout --repo-encoding=bar --local-encoding=foo',
for when you are working with legacy/broken repos.
This seems so simple, I'm guessing there is very good reason it hasn't
been suggested .
TTFN
--
Roger. Home| http://www.sandman.uklinux.net/
Master of Peng Shui. (Ancient oriental art of Penguin Arranging)
Work|Independent Sys Consultant | http://www.computer-surgery.co.uk/
New key Fpr: 0F2F E1DF 4CD2 5E7B EF9F B173 4CFA F143 ADBE 6B00
Instead of running a mode guess algorithm, couldn't mercurial
store a repo encoding field. Which if blank /non-exsitant mercurial
defaults to the current behavior ?Then if it is set hg can try to transcode to 'native' however that
is defined. (it should also have a switch to disable this) .That means for existing repos nothing changes unless somebody adds a
commit with an encoding field.
Where it is stored is an implementation detail. Except I believe it should
be revision controled.
But do you really have a use-case for files in a single repo *stored*
with different encodings. It strikes me as that way madness lies.
I have to say I can't imagine one, doesnt utf8 encode all the same codepoints
as cp-1252?
Have I missed something - does the windows wide character API do something odd
when given the top-portion of cp1252?
2011/10/21, Roger Gammans <rgam...@computer-surgery.co.uk>:
--
从我的移动设备发送
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
Implement detail:
1.We need per-repository configuration about, (per-user is not a good idea, because encoding is based on repository, and differernt repository will have different encoding).
the repository’s encoding configuration should be like this:
old_repository_encoding such as ‘ascii’ ‘cp1251’ cp936 cp1252 utf8 and so on
Setting three parameters because when we migrating from old repository to new repository, we need to face a problem, how to checkout to old history? for example checkout old-tag, old-branch? When checkout those revision, then we need work in the old-way.
Also which need to supply a upgrade tool to upgrade to using utf8 as the repository encoding on old tools.
hg encoding --old ascii --new utf8
hg encoding --old cp936 --new utf8
Notes:
0. Evertying happened under windows.
1. Everything is decided, so there is no possibility, only need a upgrade instruction at once, the everything will handled automatically.
2. Existing repos will not be affected at all, except the user execute the hg encoding instruction
3. Newly created repository default setting to utf8, but we can still supply option to create old way repository.
You greatly simplified the task.
Implement detail:
1.We need per-repository configuration about, (per-user is not a good idea, because encoding is based on repository, and differernt repository will have different encoding).
the repository’s encoding configuration should be like this:
old_repository_encoding such as ‘ascii’ ‘cp1251’ cp936 cp1252 utf8 and so onYou make an assumption that _all_ the file names in the repository have the same encoding. This is not the case.
Setting three parameters because when we migrating from old repository to new repository, we need to face a problem, how to checkout to old history? for example checkout old-tag, old-branch? When checkout those revision, then we need work in the old-way.Please consider the case when the old client (for instance version 1.8), which is not aware of the settings you describe, wants to check out the files.
Also which need to supply a upgrade tool to upgrade to using utf8 as the repository encoding on old tools.
hg encoding --old ascii --new utf8
hg encoding --old cp936 --new utf8
I am sorry, I do not understand this at all. Do you affect the working directory ? Do you commit ? Do you change the history ?
Notes:
0. Evertying happened under windows.
1. Everything is decided, so there is no possibility, only need a upgrade instruction at once, the everything will handled automatically.
2. Existing repos will not be affected at all, except the user execute the hg encoding instruction
3. Newly created repository default setting to utf8, but we can still supply option to create old way repository.
Unix is also affected. See the comments about Mac OS and different encodings for Unix systems.
-Andrey
_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial
2011/10/21 Andrey <py4...@gmail.com>You greatly simplified the task.
Implement detail:
1.We need per-repository configuration about, (per-user is not a good idea, because encoding is based on repository, and differernt repository will have different encoding).
the repository’s encoding configuration should be like this:
old_repository_encoding such as ‘ascii’ ‘cp1251’ cp936 cp1252 utf8 and so onYou make an assumption that _all_ the file names in the repository have the same encoding. This is not the case.
Setting three parameters because when we migrating from old repository to new repository, we need to face a problem, how to checkout to old history? for example checkout old-tag, old-branch? When checkout those revision, then we need work in the old-way.Please consider the case when the old client (for instance version 1.8), which is not aware of the settings you describe, wants to check out the files.That's impossible, if we want to use utf8 as the encoding, that's likes you want win98 support for Unicode properly.
You make an assumption that _all_ the file names in the repository have the same encoding. This is not the case.
You can guess or make a stab a the enviromnent encoding. But you can't guess the
repo encoding as this is the encoding on the environment where the commit
occured.
So my suggestion was to add a repo encoding property, so that this
was known on repo which set it. This then means that meaningful
transcodes between the repo and the local encoding can be done.
And with a '--local-encoding' switch allowing override of the
'guess' so the behaviour is a predicatble as possible.
> 3) the old Mercurial versions should work either the same or better (no
> regression)
Unfortuantely I don't see how that is possible, in my mind we would
add a .hg/requires entry when the encoding was set, and repo's
with a encoding set would not be usable on previous versions. I don't see
how they can reliably be, as they won't 'guarantee' the repo encoding.
TTFN
--
Roger. Home| http://www.sandman.uklinux.net/
Master of Peng Shui. (Ancient oriental art of Penguin Arranging)
Work|Independent Sys Consultant | http://www.computer-surgery.co.uk/
New key Fpr: 0F2F E1DF 4CD2 5E7B EF9F B173 4CFA F143 ADBE 6B00
> The most important goal for me was actually this: 2 (3?). use utf8 as
> the default encoding for new commits.
>
> Now I see (thanks, Matt), that it may introduce serious regression
> problems. I need some time to think about a possible solution.
>
>> if all files in manifest are valid UTF-8:
>> # repo is already in UTF-8 mode or is pure ASCII
>> mode = utf8transcoding
>
> This check is just a guess. We cannot rely on it. In general, it is
> not possible to detect the encoding from the sequence of bytes.
You're right in principle, that a Latin-1 encoded text with "pære" also
happen to be the UTF-8 encoding of "pære". However, what Matt writes is
that the chance of that happening is small and so it is okay with him to
declare a text to be UTF-8 if it can be correctly decoded as such.
--
Martin Geisler
Mercurial links: http://mercurial.ch/
I am focus on Windows, as you see, it Unix also need to config the
Unicode, maybe separate proposal needed.
such as
if os is windows
then do something
if os is unix
then do something
else
passthrough.
>
> 2) it should be possible that users use different encodings but contribute to the same repository. For instance, Unix and Windows users work together on the same project.
Maybe you want something like this.
If encoding is configurated: #For all OS(Unix,windows, mac os)?
do the things in the new way
else:
do the things in the old way? pass through
> Please note, that the encoding is not the property of the repository but the
> property of the working directory.
No. BOTH they have this property and they might be different.You can guess or make a stab a the enviromnent encoding. But you can't guess the
repo encoding as this is the encoding on the environment where the commit
occured.
> 3) the old Mercurial versions should work either the same or better (no
> regression)Unfortuantely I don't see how that is possible
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
Please read the wiki style guide, especially this section:
http://mercurial.selenic.com/wiki/WikiStyleGuide#Development_plans_and_other_speculative_pages
--
Mathematics is the supreme nostalgia of our time.
> _______________________________________________
> Mercurial mailing list
> Merc...@selenic.com
> http://selenic.com/mailman/listinfo/mercurial
>
>
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
>
> What I mean is that UTF-16 encoded text may look like (the same bytes) as
> the UTF-8 encoded text
> Without BOM (byte order mark) we cannot make any conclusions about the
> content. Do you mean that BOM _is_ stored in the repository ?
BOM never stored in the filename, it's just stored in the content(the
begging) of the file. so it's should not be considerated here.
It's off topic.
Also, UTF-16 should not be a filename encoding stored in Mercurial
repository, because it's not compatible with ASCII.
for example the space character will be represent as two byte 0x20
0x00 in LE, the 0x00 should not be appeared.
So what happens when I create and check in a file that uses a non-ascii file name encoded in something other than utf8 on a Unix box and you try and check it out on your windows box?
Your local Mercurial client on your Unix box shall be aware which encoding should used for the file
>>
>> Also, UTF-16 should not be a filename encoding stored in Mercurial
>> repository, because it's not compatible with ASCII.
>> for example the space character will be represent as two byte 0x20
>> 0x00 in LE, the 0x00 should not be appeared.
>>
> As far as I understand, if you create a file now on Windows, you will get
> UTF-16 encoded name, because this is how Python 2 gets the name from the OS.
> That is why the existing repositories do contain UTF-16 encoded names.
That's impossible, I don't know why this will happened. Indeed Python
2 gets the name from
the OS is always Local Encoding encoded. those API will force convert
the UTF16 encoded
names to Local Encoding encoded name.
For example, my os is CP936, then when I calling the API by Python 2,
then the function returned
filename is encoded in CP936, not in UTF16. This need to be explained.
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
I'm sorry, but I don't see how this answers the question. Could you
provide details? Or maybe I should. Here's the scenario:
- I create a new repository on Windows, and start checking in files.
- I clone it to a Unix system, and check in files with non-utf8,
non-ascii file names.
- I pull changes back to the Unix system and update.
So - what becomes of those non-utf8, non-ascii file names on my Windows box?
> Notes
>
> 1. Evertying happened under windows.
No, it doesn't. That's part of the problem. Mercurial is a
cross-platform tool with unix leanings. Any change has to both not
break things on Unix, *and* not break things for people who are using
it on both Windows and Unix.
The same as happens today.
The only vaguely reasonable approaches for doing cross-platform work
are:
a) ASCII only (works perfectly today)
b) UTF-8 on Unix, UTF-16 on Windows, and be sure you avoid the 'makefile
problem'
Anything else is doomed to failure so I'm only interested in adding
support for (b).
--
Mathematics is the supreme nostalgia of our time.
This is incorrect.
>>> u = u'Here is a string. Български'
>>> u.encode('utf-8')
'Here is a string. \xd0\x91\xd1\x8a\xd0\xbb\xd0\xb3\xd0\xb0\xd1\x80\xd1
\x81\xd0\xba\xd0\xb8'
>>> u.encode('utf-16')
'\xff\xfeH\x00e\x00r\x00e\x00 \x00i\x00s\x00 \x00a\x00 \x00s\x00t\x00r
\x00i\x00n\x00g\x00.\x00 \x00\x11\x04J\x04;\x043\x040\x04@\x04A\x04:
\x048\x04'
UTF-8 has a couple key properties:
- if a byte looks like ASCII, it is ASCII
- the first byte of a multibyte character is always of the form 11xxxxxx
- the first byte encodes the length of the character in bytes
- the second and later bytes are of the form 10xxxxxx
..which means that it's very easy to recognize properly encoded UTF-8
from even a small sample with very high probability. Here's a quick
program that generates 50k random byte strings of each length and
reports what percentage are valid UTF-8:
$ python uc.py
1 50.19400000%
2 28.46400000%
3 15.50600000%
4 8.90000000%
5 5.05600000%
6 2.90200000%
7 1.56200000%
8 0.96000000%
9 0.46400000%
10 0.28000000%
11 0.13400000%
12 0.08000000%
13 0.05200000%
14 0.03000000%
15 0.02600000%
16 0.01200000%
17 0.00400000%
18 0.00200000%
19 0.00000000%
20 0.00600000%
Given our manifest will be more like 10k - 1MB rather than just 20
bytes, the odds of getting confused here are really quite negligible.
Also, as it's impossible to store UTF-16 in Mercurial's -manifest-
(where we store filenames) due to the presence of NUL bytes, there's no
chance of confusion.
And elsewhere:
> As far as I understand, if you create a file now on Windows, you will get
> UTF-16 encoded name, because this is how Python 2 gets the name from the OS.
> That is why the existing repositories do contain UTF-16 encoded names.
Also wrong. What actually happens is that UTF-16 gets decoded into the
8-bit "filesystem encoding" when using the standard C interfaces that
Python 2 wraps. This encoding may be different from the console encoding
or the GUI encoding. And the console and GUI encodings generally don't
agree anyway.
>> - I create a new repository on Windows, and start checking in files.
>> - I clone it to a Unix system, and check in files with non-utf8,
>> non-ascii file names.
>> - I pull changes back to the Unix system and update.
>>
>> So - what becomes of those non-utf8, non-ascii file names on my Windows box?
>
> The same as happens today.
In that case, I'm missing something in either the proposed change or
what happens today. I expect that today, the bytes would be the same
on both ends, so you'd get a gibberish display but avoid the makefile
problem. With the proposed changes, the files would be decoded as if
they were utf8, which would mean that you either get a gibberish
display *and* the makefile problem, or the decoding fails and you get
????. Could you tell me where I went wrong?
<mike
Also, as it's impossible to store UTF-16 in Mercurial's -manifest-
(where we store filenames) due to the presence of NUL bytes, there's no
chance of confusion.
And elsewhere:
Also wrong. What actually happens is that UTF-16 gets decoded into the
> As far as I understand, if you create a file now on Windows, you will get
> UTF-16 encoded name, because this is how Python 2 gets the name from the OS.
> That is why the existing repositories do contain UTF-16 encoded names.
8-bit "filesystem encoding" when using the standard C interfaces that
Python 2 wraps. This encoding may be different from the console encoding
or the GUI encoding. And the console and GUI encodings generally don't
agree anyway.
--
Mathematics is the supreme nostalgia of our time.
_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial
Anything else is doomed to failure so I'm only interested in adding
support for (b).
--
Mathematics is the supreme nostalgia of our time.
So yo'ure saying that if I'm working with something other than utf8 on
Unix, I *have* to create the repository on windows with the encoding I
want to use on Unix?
>> > 1. Evertying happened under windows.
>> No, it doesn't. That's part of the problem. Mercurial is a
>> cross-platform tool with unix leanings. Any change has to both not
>> break things on Unix, *and* not break things for people who are using
>> it on both Windows and Unix.
> Er, Only those NEW repository affected. please don't insist again and again.
Right. I'm *asking* about new repositories. Please don't answer the
wrong question again and again.
On Fri, Oct 21, 2011 at 9:55 AM, Matt Mackall <m...@selenic.com> wrote:
> On Fri, 2011-10-21 at 09:46 -0700, Mike Meyer wrote:
>> - I create a new repository on Windows, and start checking in files.In that case, I'm missing something in either the proposed change or
>> - I clone it to a Unix system, and check in files with non-utf8,
>> non-ascii file names.
>> - I pull changes back to the Unix system and update.
>>
>> So - what becomes of those non-utf8, non-ascii file names on my Windows box?
>
> The same as happens today.
what happens today. I expect that today, the bytes would be the same
on both ends, so you'd get a gibberish display but avoid the makefile
problem. With the proposed changes, the files would be decoded as if
they were utf8, which would mean that you either get a gibberish
display *and* the makefile problem, or the decoding fails and you get
????. Could you tell me where I went wrong?
<mike
Right. I'm *asking* about new repositories. Please don't answer the
>> > 1. Evertying happened under windows.
>> No, it doesn't. That's part of the problem. Mercurial is a
>> cross-platform tool with unix leanings. Any change has to both not
>> break things on Unix, *and* not break things for people who are using
>> it on both Windows and Unix.
> Er, Only those NEW repository affected. please don't insist again and again.
wrong question again and again.
<mike
Is it going to be a requirement that I *not* upgrade mercurial on Unix
if I want to use something other than utf8 or ascii? That doesn't seem
right, as the proposal says that mercurials behavior doesn't change on
non-Windows systems. So I ought to get the same thing using the new
unicode-aware-on-windows version as I do with 1.8.
<mike
So the new version of Mercurial will create an "old-style" repository
on Unix which you then upgrade? Fair enough. The requirement seems to
be that 1) everyone use the same repo encoding, and 2) You have to
know that encoding when the first time you use the repo on Windows
(either upgrading to that encoding if you cloned it, or creating it
with that encoding).
> You can guess or make a stab a the enviromnent encoding. But you can't guess the
> repo encoding as this is the encoding on the environment where the commit
> occured.
I don't see why the repo needs an encoding (where by "repo"
I mean the stuff under .hg).
The repository is private Mercurial data. It is Mercurial, not
the OS, that decides what bytes get stored in the manifest. If
Mercurial decides that Windows file names are serialized as UTF-8,
then that's that.
The filenames under .hg/store are already mangled to ASCII
and hopefully the OS won't sabotage that.
> So my suggestion was to add a repo encoding property, so that this
> was known on repo which set it. This then means that meaningful
> transcodes between the repo and the local encoding can be done.
You only need a working copy encoding for that. If somebody
wants to clone a repo under codepage-whatever only the working
copy is affected (with an error if one of the Unicode characters
won't map).
--
pa at panix dot com
> * The repository?s encoding configuration should be like this:
> {{{
> old_repository_encoding: such as ?ascii? ?cp1251? cp936 cp1252 utf8 and so
> on
> separator revision number: such as 128
One boundary revision is not enough. You can't guarantee that all
the old-encoding changesets come before the first new-encoding
changeset. Mercurial is distributed. Even in an all-Windows
environment, after multiple cloning-pulling-pushing between multiple
developers the changeset order will vary internally among the clones
--even after everybody syncs up.
I'm not sure you need a repository encoding though, except
possibly as a transitional measure.
I don't see why the repo needs an encoding (where by "repo"
I mean the stuff under .hg).
???(Yonggang Luo) <luoyo...@gmail.com> wrote:
> * The repository?s encoding configuration should be like this:
> {{{
> old_repository_encoding: such as ?ascii? ?cp1251? cp936 cp1252 utf8 and so
> on
> separator revision number: such as 128
One boundary revision is not enough. You can't guarantee that all
the old-encoding changesets come before the first new-encoding
changeset. Mercurial is distributed. Even in an all-Windows
environment, after multiple cloning-pulling-pushing between multiple
developers the changeset order will vary internally among the clones
--even after everybody syncs up.
I'm not sure you need a repository encoding though, except
possibly as a transitional measure.
--
pa at panix dot com
_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial
> > I don't see why the repo needs an encoding (where by "repo"
> > I mean the stuff under .hg).
> >
> The same file name is encoded differently for different platforms. For
> instance '?????.txt' cannot be exchanged between Unix and Windows.
Hmf, nor between the list and Usenet. Okay, I can see it on pipermail.
I'm not sure Yonggang Luo is asking for cross-platform interoperability.
Can you create a repo with that file on Windows and clone it
successfully on Windows?
Anyway, Why doesn't it work?
I'm assuming UTF-8 on Unix. Of course you can put byte sequences
that are not UTF-8 in a Unix filename, but that wouldn't be nice
when interoperating with Windows. Just like committing a file
"foo" and a file "Foo".
Under .hg the filename shows up in two places: 1) the manifest, 2)
the mangled filename under .hg/store . In the manifest, Mercurial
copies the bytes retrieved from the filesystem. In the store,
Mercurial mangles any bytes above 127 into an ascii representation.
After the Windows pull, what bytes are in the manifest and what filename
shows up under .hg/store ? I would assume the same as Unix, but I
don't have a Windows box handy to verify that. If the file doesn't
show up correctly in the working copy it must be a problem creating
the working copy.
If the bytes are UTF-8 in the manifest and mangled UTF-8 in the store,
"all" Mercurial would have to do to is re-encode in Windows' UTF-16
and pass that to the Windows API to create a file with a "Unicode"
filename. Doesn't it do that? I see an extension in the wiki,
http://mercurial.selenic.com/wiki/FixUtf8Extension
is the information on this page current?
[
Full interoperability is harder because the *contents* of the
files also matters. If a C source file contains UTF-8 for
#include "Ð´Ñ Ñ‚ÐµÐ».h"
will the Windows C preprocessor grok that?
]
Hmmm, one encoding per changeset, as metadata in the manifest revlog,
plus one repository-wide default for old changesets where the manifest
has no such metadata.
No idea how to handle back-compatibility. Is it just a new word
in .hg/requires ?
====
I have a question. Are you trying to interoperate with Linux
or is the problem entirely within Windows?
Andrey <py4...@gmail.com> wrote:> instance '?????.txt' cannot be exchanged between Unix and Windows.
> > I don't see why the repo needs an encoding (where by "repo"
> > I mean the stuff under .hg).
> >
> The same file name is encoded differently for different platforms. For
Hmf, nor between the list and Usenet. Okay, I can see it on pipermail.
I'm not sure Yonggang Luo is asking for cross-platform interoperability.
Can you create a repo with that file on Windows and clone it
successfully on Windows?
Anyway, Why doesn't it work?
I'm assuming UTF-8 on Unix. Of course you can put byte sequences
that are not UTF-8 in a Unix filename, but that wouldn't be nice
when interoperating with Windows. Just like committing a file
"foo" and a file "Foo".
Under .hg the filename shows up in two places: 1) the manifest, 2)
the mangled filename under .hg/store . In the manifest, Mercurial
copies the bytes retrieved from the filesystem. In the store,
Mercurial mangles any bytes above 127 into an ascii representation.
After the Windows pull, what bytes are in the manifest and what filename
shows up under .hg/store ? I would assume the same as Unix, but I
don't have a Windows box handy to verify that. If the file doesn't
show up correctly in the working copy it must be a problem creating
the working copy.
If the bytes are UTF-8 in the manifest and mangled UTF-8 in the store,
"all" Mercurial would have to do to is re-encode in Windows' UTF-16
and pass that to the Windows API to create a file with a "Unicode"
filename. Doesn't it do that? I see an extension in the wiki,
http://mercurial.selenic.com/wiki/FixUtf8Extension
is the information on this page current?
This extension is still in beta, use it at your own risk.
[
Full interoperability is harder because the *contents* of the
files also matters. If a C source file contains UTF-8 for
#include "Ð´Ñ Ñ‚ÐµÐ».h"
will the Windows C preprocessor grok that?
]
--
pa at panix dot com
_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial
???(Yonggang Luo) <luoyo...@gmail.com> wrote:
> >
> > One boundary revision is not enough. You can't guarantee that all
> > the old-encoding changesets come before the first new-encoding
> > changeset.
>
> That's what I am worried about, I have no idea about this, is there anyHmmm, one encoding per changeset, as metadata in the manifest revlog,
> better suggestion from people have good
> idea on this?
plus one repository-wide default for old changesets where the manifest
has no such metadata.
No idea how to handle back-compatibility. Is it just a new word
in .hg/requires ?
====
I have a question. Are you trying to interoperate with Linux
or is the problem entirely within Windows?
--
pa at panix dot com
_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial
I'm pretty sure you mean Unicode, not UTF-8. UTF-8 is (the most used)
encoding scheme for Unicode characters.
On Sun, Oct 23, 2011 at 5:51 AM, 罗勇刚(Yonggang Luo)I'm pretty sure you mean Unicode, not UTF-8. UTF-8 is (the most used)
<luoyo...@gmail.com> wrote:
> Because the local encoding codepont is limited, that's a SET problem
> cp936 is a SUBSET of UTF8, cp1251 is also the SUBSET of UTF8.
> so you can not use cp936 representing all codepoint of UTF8, so is cp1251,
> and so on.
encoding scheme for Unicode characters.
I'm pretty sure you mean Unicode, not UTF-8. UTF-8 is (the most used)
encoding scheme for Unicode characters.
UTF8 is a part of Unicode, and UTF8 is a only choice for Mercurial.
2011/10/25, Andrey <py4...@gmail.com>:
--
从我的移动设备发送
I don't know why you insist on one character can encoded in multiple
codepoint, there is no relationship with this proposal? Can you give
me a simple example to show you meaning?
-Andrey
The 'A' Unicode character will occupy:1 byte in UTF-8 encoding scheme2 bytes in UTF-16BE2 bytes in UTF-16LE4 bytes in UTF-32BE4 bytes in UTF-32LEAs you can see, 5 different byte sequences lead to the same character.I am really get confused by your statement........ Is there anyone can give a impressive explain?
I think the FAQ link gives a very good explanation:
> I've been following this conversation and it seems to me that people
> are trying to overengineer this?
Yes, and there's a lot of confused statements too in this thread.
> Why can't old repositories work as-is and new repositories could be
> created to use storing in UTF-8 regardless of platform the operations
> are done? Platform would talk with it's own file system with encoding
> it needs. If you need to convert old repository, that's why we have
> convert tool. Converting to new repository UTF-8 format of course
> means older clients will have trouble (Linux clients would work fine
> as they always been UTF-8). The UTF-8 requirement would just be
> indicated in the requires file like the repo formats normally are.
> Doesn't older clients warn about missing requirements anyhow?
Yes, an old client wont access a repository on disk if it doesn't
understand the .hg/requires file.
But there are funny upgrade scenarios to think about: if I create a
repository on Windows with a future version that supports Unicode, then
I'll create a manifest encoded in UTF-8. You can then no longer clone
that with hg-1.9 onto another Windows machine -- you'll get scrambled
filenames.
This is just like the situation today if I use the fixutf8 extension and
you don't. Personally, I think that is acceptable, but it's a regression
From today where we two Windows users can talk to each other using any
pair of versions.
> ps. what's the difference between mercuria...@googlegroups.com
> and merc...@selenic.com?
The @googlegroups.com address is for Google's mirror. It should not
really be used -- @selenic.com is the real address and sending to that
address ought to be enough to get the mail into Google's system too.
--
Martin Geisler
Mercurial links: http://mercurial.ch/
> I've been following this conversation and it seems to me that people are
> trying to overengineer this?
>
> Why can't old repositories work as-is and new repositories could be
> created to use storing in UTF-8 regardless of platform the operations are
> done?
That would obviously require knowing the encoding being used by the
local platform, so that filenames could be transcoded on commit.
I believe the objection to this idea is that Mercurial can't
accurately know what that encoding is. Unix systems mostly do use
UTF-8, as it happens, but that is not necessarily the case - they
could use Latin-1, or any other encoding. Because the encoding in use
is not recorded anywhere definitive, Mercurial can't tell for sure. It
would have to do something like assume UTF-8 unless configured
otherwise, which leads to the situation where someone using a Latin-1
system forgets to configure that, and they end up committing files
with corrupt names.
That said, i think this is essentially the right solution, but it
should be optional, not the default. I have a very wordy email sitting
in my drafts about this, but basically, i think we should have two
modes, set on a per-repository basis, and burned into the repository
when created:
- Old mode, in which, repositories use a 'passthrough' encoding for
both the filesystem and the repository; the same bytes are used in
both places (if you like, think of this as, if you'll excuse some
Java, char decode(byte b) {return (char)b & 0xff;} byte encode(char
ch) {return (byte)ch;}). This reproduces the current behaviour.
- New mode, in which repositories use UTF-8 for the repository, and a
local encoding for the filesystem. That could be configured in hgrc in
the usual way. If it was not configured, Mercurial could either (a)
guess an encoding based on some combination of system settings (the
LANG environment variable on Unix, don't know about Windows), a survey
of the bytes in some local filenames, whatever, or (b) refuse to
commit (like how it won't commit until you specify a username). The
former would be easier on users; the latter would be safer and more
Pythonic ("In the face of ambiguity, refuse the temptation to
guess."). It could perhaps follow a canny middle way: as long as any
path being committed appears to be plain ASCII (which works for
everyone except users of EBCDIC and PETSCII machines - not many of
them around), then guess that it's ASCII, but if it has any high bits
set, throw a strop and demand to be configured properly.
I assume old mode would have to be the default, so as not to trip up
users currently depending on that behaviour. Personally, i would like
to see the new mode be the default, but i don't think that will fly.
You could actually convert a repository between these states as long
as it only contained ASCII or UTF-8 filenames, i think.
tom
--
Tom Anderson | e2x Ltd, 1 Norton Folgate, London E1 6DB
(e) t...@e2x.co.uk | (m) +44 (7960) 989794 | (f) +44 (20) 7100 3749
Jouni Airaksinen <Jouni.Ai...@descom.fi> writes:Yes, and there's a lot of confused statements too in this thread.
> I've been following this conversation and it seems to me that people
> are trying to overengineer this?
Yes, an old client wont access a repository on disk if it doesn't
> Why can't old repositories work as-is and new repositories could be
> created to use storing in UTF-8 regardless of platform the operations
> are done? Platform would talk with it's own file system with encoding
> it needs. If you need to convert old repository, that's why we have
> convert tool. Converting to new repository UTF-8 format of course
> means older clients will have trouble (Linux clients would work fine
> as they always been UTF-8). The UTF-8 requirement would just be
> indicated in the requires file like the repo formats normally are.
> Doesn't older clients warn about missing requirements anyhow?
understand the .hg/requires file.
But there are funny upgrade scenarios to think about: if I create a
repository on Windows with a future version that supports Unicode, then
I'll create a manifest encoded in UTF-8. You can then no longer clone
that with hg-1.9 onto another Windows machine -- you'll get scrambled
filenames.
This is just like the situation today if I use the fixutf8 extension and
you don't. Personally, I think that is acceptable, but it's a regression
From today where we two Windows users can talk to each other using any
pair of versions.
The @googlegroups.com address is for Google's mirror. It should not
really be used -- @selenic.com is the real address and sending to that
address ought to be enough to get the mail into Google's system too.
--
Martin Geisler
Mercurial links: http://mercurial.ch/
_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial
I've taken the liberty of tidying up that page to fix the formatting and
to make the wording a bit clearer. There were also vague statements like...
"Evertying happened under windows."
...which don't really clarify matters even if repeated several times,
and so I've attempted to interpret the real meaning here as...
"The filename conversion only occurs on Windows."
You may wish to correct this, but keep it explicit: "everything" means
nothing in this case. I've also added a section which should describe
the motivation for the proposal.
I must admit that I haven't really looked at Mercurial's internals, but
I am skeptical about the proposal in that it seems to defer the
interpretation of filenames until a Windows system is involved, by which
time the repository might be stuffed with all kinds of different byte
sequences for any given character sequence.
> 2011/10/26 Martin Geisler<m...@lazybytes.net>
>
>> Jouni Airaksinen<Jouni.Ai...@descom.fi> writes:
>>> ps. what's the difference between mercuria...@googlegroups.com
>>> and merc...@selenic.com?
>>
>> The @googlegroups.com address is for Google's mirror. It should not
>> really be used -- @selenic.com is the real address and sending to that
>> address ought to be enough to get the mail into Google's system too.
I've removed it from this reply since I now appear to get personal
replies from that service when people see fit to post there instead of
on this list. Since this list is the only thing I have explicitly chosen
to subscribe to, I don't see why I should be pestered by the Google
service in question.
Paul
P.S. You might like to use the preview function on the Wiki a bit more:
saving a new page version every minute or so fills up the history very
quickly and makes it awkward to go back and see older edits.